1506141 – Upgrade failed in turn at task [Restart journald] for the first time when run upgrade playbook on master hosts

Bug 1506141 - Upgrade failed in turn at task [Restart journald] for the first time when run upgrade playbook on master hosts

Summary: Upgrade failed in turn at task [Restart journald] for the first time when run...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Michael Gugino
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-25 08:58 UTC by liujia
Modified:	2017-11-28 22:19 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-11-28 22:19:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description liujia 2017-10-25 08:58:05 UTC

Description of problem:
Upgrade ocp v3.6 to v3.7. Upgrade failed for the first time when run upgrade playbook on master hosts. It failed at task [Restart journald]. 
TASK [Restart journald] *****************************************************************************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to restart service systemd-journald: Job for systemd-journald.service failed because a fatal signal was delivered to the control process. See \"systemctl status systemd-journald.service\" and \"journalctl -xe\" for details.\n"}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry

But checked systemd-journald.service status on the host that it was running when upgrade failed and stopped. 

systemctl status systemd-journald.service
● systemd-journald.service - Journal Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-journald.service; static; vendor preset: disabled)
   Active: active (running) since Wed 2017-10-25 03:02:26 EDT; 2min 8s ago
     Docs: man:systemd-journald.service(8)
           man:journald.conf(5)
 Main PID: 1510 (systemd-journal)
   Status: "Processing requests..."
   Memory: 38.4M
   CGroup: /system.slice/systemd-journald.service
           └─1510 /usr/lib/systemd/systemd-journald

What can be done is just re-run upgrade playbook again. But for ha deployment, it will fail again on another master host until all master hosts fail once, then re-run upgrade can work.

Version-Release number of the following components:
openshift-ansible-docs-3.7.0-0.178.0.git.0.27a1039.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Install ocp v3.6
2. Upgrade v3.6 to v3.7
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
The times of upgrade fail is equal with the number of master hosts.

Comment 2 Scott Dodson 2017-10-25 14:29:36 UTC

I saw this the first time I installed with the journald rate limiting changes myself. I too had journald running fine by the time I went to look and the suspicion was some sort of transition taking too long or something so I suspect we should just add a retry on this task and hope that improves the situation.

https://github.com/openshift/openshift-ansible/pull/3753#issuecomment-330971553

Comment 3 Michael Gugino 2017-10-25 16:14:44 UTC

I believe the issue here is we are setting a persistence file, and journald is not able to create it (ie, doesn't have permissions to).

I will investigate this.

Comment 4 Michael Gugino 2017-10-25 16:47:04 UTC

According to our documentation, we must create the persistence log storage directories manually: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/s1-using_the_journal

Comment 5 Michael Gugino 2017-10-25 17:00:42 UTC

PR Created: https://github.com/openshift/openshift-ansible/pull/5882

Comment 7 liujia 2017-10-30 06:45:10 UTC

Version:
openshift-ansible-3.7.0-0.185.0.git.0.eb61aff.el7.noarch

Steps:
1. HA install ocp v3.6
2 .Upgrade v3.6 to lates v3.7

Still hit the issue.

Checked upgrade log to find that TASK [Create journald persistence directories] had been run, and corresponding dir had been created.

# ls -la /var/log/journal/
total 8
drwxr-xr-x.  3 root root   46 Oct 30 02:36 .
drwxr-xr-x. 13 root root 4096 Oct 30 02:36 ..
drwxr-xr-x.  2 root root 4096 Oct 30 02:40 8cc63b6b56124a29b52a8d655f648e46

Upgrade log in attachment.

Comment 9 liujia 2017-10-30 08:09:48 UTC

The original workaround does not work now. After failed and failed on each master once, and re-run upgrade for N+1 time, it will lead to other issue which results from upgrade was done against a semi-upgrade env.

# openshift version
openshift v3.6.173.0.62
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

# oc version
oc v3.6.173.0.62
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-jliu-conha-lb-nfs-1.1030-luw.qe.rhcloud.com
openshift v3.7.0-0.184.0
kubernetes v1.7.6+a08f5eeb62

TASK [Upgrade all storage] **************************************************************************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:00:21.045391", "end": "2017-10-30 03:28:56.455407", "failed": true, "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2017-10-30 03:28:35.410016", "stderr": "", "stderr_lines": [], "stdout": "summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69\nerror: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\"", "stdout_lines": ["summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69", "error: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\""]}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry


Marked it as urgent because the issue blocked all upgrade test now.

Comment 10 Michael Gugino 2017-10-30 13:57:32 UTC

I'm unsure what would be causing this.  My guess is too much is trying to be flushed at once and we're hitting some kind of race or overflow in journald.

I am not able to replicate this issue on a clean host running just the journald steps.

Retry PR Created: https://github.com/openshift/openshift-ansible/pull/5930

Comment 11 liujia 2017-10-31 03:28:06 UTC

PR works well, please merge.

Comment 13 liujia 2017-11-01 02:05:38 UTC

Still no this pr in v3.7.0-0.188.0.

Comment 14 liujia 2017-11-02 07:28:35 UTC

Verified on openshift-ansible-3.7.0-0.189.0.git.0.d497c5e.el7.noarch.

Comment 17 errata-xmlrpc 2017-11-28 22:19:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.