Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1506141 - Upgrade failed in turn at task [Restart journald] for the first time when run upgrade playbook on master hosts
Upgrade failed in turn at task [Restart journald] for the first time when run...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.7.0
Unspecified Unspecified
urgent Severity urgent
: ---
: 3.7.0
Assigned To: Michael Gugino
liujia
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-10-25 04:58 EDT by liujia
Modified: 2017-11-28 17:19 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-28 17:19:09 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-28 21:34:54 EST

  None (edit)
Description liujia 2017-10-25 04:58:05 EDT
Description of problem:
Upgrade ocp v3.6 to v3.7. Upgrade failed for the first time when run upgrade playbook on master hosts. It failed at task [Restart journald]. 
TASK [Restart journald] *****************************************************************************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to restart service systemd-journald: Job for systemd-journald.service failed because a fatal signal was delivered to the control process. See \"systemctl status systemd-journald.service\" and \"journalctl -xe\" for details.\n"}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry

But checked systemd-journald.service status on the host that it was running when upgrade failed and stopped. 

systemctl status systemd-journald.service
● systemd-journald.service - Journal Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-journald.service; static; vendor preset: disabled)
   Active: active (running) since Wed 2017-10-25 03:02:26 EDT; 2min 8s ago
     Docs: man:systemd-journald.service(8)
           man:journald.conf(5)
 Main PID: 1510 (systemd-journal)
   Status: "Processing requests..."
   Memory: 38.4M
   CGroup: /system.slice/systemd-journald.service
           └─1510 /usr/lib/systemd/systemd-journald

What can be done is just re-run upgrade playbook again. But for ha deployment, it will fail again on another master host until all master hosts fail once, then re-run upgrade can work.

Version-Release number of the following components:
openshift-ansible-docs-3.7.0-0.178.0.git.0.27a1039.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Install ocp v3.6
2. Upgrade v3.6 to v3.7
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
The times of upgrade fail is equal with the number of master hosts.
Comment 2 Scott Dodson 2017-10-25 10:29:36 EDT
I saw this the first time I installed with the journald rate limiting changes myself. I too had journald running fine by the time I went to look and the suspicion was some sort of transition taking too long or something so I suspect we should just add a retry on this task and hope that improves the situation.

https://github.com/openshift/openshift-ansible/pull/3753#issuecomment-330971553
Comment 3 Michael Gugino 2017-10-25 12:14:44 EDT
I believe the issue here is we are setting a persistence file, and journald is not able to create it (ie, doesn't have permissions to).

I will investigate this.
Comment 4 Michael Gugino 2017-10-25 12:47:04 EDT
According to our documentation, we must create the persistence log storage directories manually: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/s1-using_the_journal
Comment 5 Michael Gugino 2017-10-25 13:00:42 EDT
PR Created: https://github.com/openshift/openshift-ansible/pull/5882
Comment 7 liujia 2017-10-30 02:45:10 EDT
Version:
openshift-ansible-3.7.0-0.185.0.git.0.eb61aff.el7.noarch

Steps:
1. HA install ocp v3.6
2 .Upgrade v3.6 to lates v3.7

Still hit the issue.

Checked upgrade log to find that TASK [Create journald persistence directories] had been run, and corresponding dir had been created.

# ls -la /var/log/journal/
total 8
drwxr-xr-x.  3 root root   46 Oct 30 02:36 .
drwxr-xr-x. 13 root root 4096 Oct 30 02:36 ..
drwxr-xr-x.  2 root root 4096 Oct 30 02:40 8cc63b6b56124a29b52a8d655f648e46

Upgrade log in attachment.
Comment 9 liujia 2017-10-30 04:09:48 EDT
The original workaround does not work now. After failed and failed on each master once, and re-run upgrade for N+1 time, it will lead to other issue which results from upgrade was done against a semi-upgrade env.

# openshift version
openshift v3.6.173.0.62
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

# oc version
oc v3.6.173.0.62
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-jliu-conha-lb-nfs-1.1030-luw.qe.rhcloud.com
openshift v3.7.0-0.184.0
kubernetes v1.7.6+a08f5eeb62

TASK [Upgrade all storage] **************************************************************************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:00:21.045391", "end": "2017-10-30 03:28:56.455407", "failed": true, "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2017-10-30 03:28:35.410016", "stderr": "", "stderr_lines": [], "stdout": "summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69\nerror: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\"", "stdout_lines": ["summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69", "error: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\""]}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry


Marked it as urgent because the issue blocked all upgrade test now.
Comment 10 Michael Gugino 2017-10-30 09:57:32 EDT
I'm unsure what would be causing this.  My guess is too much is trying to be flushed at once and we're hitting some kind of race or overflow in journald.

I am not able to replicate this issue on a clean host running just the journald steps.

Retry PR Created: https://github.com/openshift/openshift-ansible/pull/5930
Comment 11 liujia 2017-10-30 23:28:06 EDT
PR works well, please merge.
Comment 13 liujia 2017-10-31 22:05:38 EDT
Still no this pr in v3.7.0-0.188.0.
Comment 14 liujia 2017-11-02 03:28:35 EDT
Verified on openshift-ansible-3.7.0-0.189.0.git.0.d497c5e.el7.noarch.
Comment 17 errata-xmlrpc 2017-11-28 17:19:09 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.