Bug 1506141 - Upgrade failed in turn at task [Restart journald] for the first time when run upgrade playbook on master hosts
Summary: Upgrade failed in turn at task [Restart journald] for the first time when run...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.7.0
Assignee: Michael Gugino
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-25 08:58 UTC by liujia
Modified: 2017-11-28 22:19 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 22:19:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description liujia 2017-10-25 08:58:05 UTC
Description of problem:
Upgrade ocp v3.6 to v3.7. Upgrade failed for the first time when run upgrade playbook on master hosts. It failed at task [Restart journald]. 
TASK [Restart journald] *****************************************************************************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to restart service systemd-journald: Job for systemd-journald.service failed because a fatal signal was delivered to the control process. See \"systemctl status systemd-journald.service\" and \"journalctl -xe\" for details.\n"}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry

But checked systemd-journald.service status on the host that it was running when upgrade failed and stopped. 

systemctl status systemd-journald.service
● systemd-journald.service - Journal Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-journald.service; static; vendor preset: disabled)
   Active: active (running) since Wed 2017-10-25 03:02:26 EDT; 2min 8s ago
     Docs: man:systemd-journald.service(8)
           man:journald.conf(5)
 Main PID: 1510 (systemd-journal)
   Status: "Processing requests..."
   Memory: 38.4M
   CGroup: /system.slice/systemd-journald.service
           └─1510 /usr/lib/systemd/systemd-journald

What can be done is just re-run upgrade playbook again. But for ha deployment, it will fail again on another master host until all master hosts fail once, then re-run upgrade can work.

Version-Release number of the following components:
openshift-ansible-docs-3.7.0-0.178.0.git.0.27a1039.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Install ocp v3.6
2. Upgrade v3.6 to v3.7
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
The times of upgrade fail is equal with the number of master hosts.

Comment 2 Scott Dodson 2017-10-25 14:29:36 UTC
I saw this the first time I installed with the journald rate limiting changes myself. I too had journald running fine by the time I went to look and the suspicion was some sort of transition taking too long or something so I suspect we should just add a retry on this task and hope that improves the situation.

https://github.com/openshift/openshift-ansible/pull/3753#issuecomment-330971553

Comment 3 Michael Gugino 2017-10-25 16:14:44 UTC
I believe the issue here is we are setting a persistence file, and journald is not able to create it (ie, doesn't have permissions to).

I will investigate this.

Comment 4 Michael Gugino 2017-10-25 16:47:04 UTC
According to our documentation, we must create the persistence log storage directories manually: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/s1-using_the_journal

Comment 5 Michael Gugino 2017-10-25 17:00:42 UTC
PR Created: https://github.com/openshift/openshift-ansible/pull/5882

Comment 7 liujia 2017-10-30 06:45:10 UTC
Version:
openshift-ansible-3.7.0-0.185.0.git.0.eb61aff.el7.noarch

Steps:
1. HA install ocp v3.6
2 .Upgrade v3.6 to lates v3.7

Still hit the issue.

Checked upgrade log to find that TASK [Create journald persistence directories] had been run, and corresponding dir had been created.

# ls -la /var/log/journal/
total 8
drwxr-xr-x.  3 root root   46 Oct 30 02:36 .
drwxr-xr-x. 13 root root 4096 Oct 30 02:36 ..
drwxr-xr-x.  2 root root 4096 Oct 30 02:40 8cc63b6b56124a29b52a8d655f648e46

Upgrade log in attachment.

Comment 9 liujia 2017-10-30 08:09:48 UTC
The original workaround does not work now. After failed and failed on each master once, and re-run upgrade for N+1 time, it will lead to other issue which results from upgrade was done against a semi-upgrade env.

# openshift version
openshift v3.6.173.0.62
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

# oc version
oc v3.6.173.0.62
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-jliu-conha-lb-nfs-1.1030-luw.qe.rhcloud.com
openshift v3.7.0-0.184.0
kubernetes v1.7.6+a08f5eeb62

TASK [Upgrade all storage] **************************************************************************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:00:21.045391", "end": "2017-10-30 03:28:56.455407", "failed": true, "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2017-10-30 03:28:35.410016", "stderr": "", "stderr_lines": [], "stdout": "summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69\nerror: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\"", "stdout_lines": ["summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69", "error: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\""]}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry


Marked it as urgent because the issue blocked all upgrade test now.

Comment 10 Michael Gugino 2017-10-30 13:57:32 UTC
I'm unsure what would be causing this.  My guess is too much is trying to be flushed at once and we're hitting some kind of race or overflow in journald.

I am not able to replicate this issue on a clean host running just the journald steps.

Retry PR Created: https://github.com/openshift/openshift-ansible/pull/5930

Comment 11 liujia 2017-10-31 03:28:06 UTC
PR works well, please merge.

Comment 13 liujia 2017-11-01 02:05:38 UTC
Still no this pr in v3.7.0-0.188.0.

Comment 14 liujia 2017-11-02 07:28:35 UTC
Verified on openshift-ansible-3.7.0-0.189.0.git.0.d497c5e.el7.noarch.

Comment 17 errata-xmlrpc 2017-11-28 22:19:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.