Description of problem: Upgrade ocp v3.6 to v3.7. Upgrade failed for the first time when run upgrade playbook on master hosts. It failed at task [Restart journald]. TASK [Restart journald] ***************************************************************************************************************************************************** fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to restart service systemd-journald: Job for systemd-journald.service failed because a fatal signal was delivered to the control process. See \"systemctl status systemd-journald.service\" and \"journalctl -xe\" for details.\n"} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry But checked systemd-journald.service status on the host that it was running when upgrade failed and stopped. systemctl status systemd-journald.service ● systemd-journald.service - Journal Service Loaded: loaded (/usr/lib/systemd/system/systemd-journald.service; static; vendor preset: disabled) Active: active (running) since Wed 2017-10-25 03:02:26 EDT; 2min 8s ago Docs: man:systemd-journald.service(8) man:journald.conf(5) Main PID: 1510 (systemd-journal) Status: "Processing requests..." Memory: 38.4M CGroup: /system.slice/systemd-journald.service └─1510 /usr/lib/systemd/systemd-journald What can be done is just re-run upgrade playbook again. But for ha deployment, it will fail again on another master host until all master hosts fail once, then re-run upgrade can work. Version-Release number of the following components: openshift-ansible-docs-3.7.0-0.178.0.git.0.27a1039.el7.noarch How reproducible: always Steps to Reproduce: 1. Install ocp v3.6 2. Upgrade v3.6 to v3.7 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: The times of upgrade fail is equal with the number of master hosts.
I saw this the first time I installed with the journald rate limiting changes myself. I too had journald running fine by the time I went to look and the suspicion was some sort of transition taking too long or something so I suspect we should just add a retry on this task and hope that improves the situation. https://github.com/openshift/openshift-ansible/pull/3753#issuecomment-330971553
I believe the issue here is we are setting a persistence file, and journald is not able to create it (ie, doesn't have permissions to). I will investigate this.
According to our documentation, we must create the persistence log storage directories manually: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/s1-using_the_journal
PR Created: https://github.com/openshift/openshift-ansible/pull/5882
Version: openshift-ansible-3.7.0-0.185.0.git.0.eb61aff.el7.noarch Steps: 1. HA install ocp v3.6 2 .Upgrade v3.6 to lates v3.7 Still hit the issue. Checked upgrade log to find that TASK [Create journald persistence directories] had been run, and corresponding dir had been created. # ls -la /var/log/journal/ total 8 drwxr-xr-x. 3 root root 46 Oct 30 02:36 . drwxr-xr-x. 13 root root 4096 Oct 30 02:36 .. drwxr-xr-x. 2 root root 4096 Oct 30 02:40 8cc63b6b56124a29b52a8d655f648e46 Upgrade log in attachment.
The original workaround does not work now. After failed and failed on each master once, and re-run upgrade for N+1 time, it will lead to other issue which results from upgrade was done against a semi-upgrade env. # openshift version openshift v3.6.173.0.62 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 # oc version oc v3.6.173.0.62 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-jliu-conha-lb-nfs-1.1030-luw.qe.rhcloud.com openshift v3.7.0-0.184.0 kubernetes v1.7.6+a08f5eeb62 TASK [Upgrade all storage] ************************************************************************************************************************************************** fatal: [x.x.x.x]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:00:21.045391", "end": "2017-10-30 03:28:56.455407", "failed": true, "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2017-10-30 03:28:35.410016", "stderr": "", "stderr_lines": [], "stdout": "summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69\nerror: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\"", "stdout_lines": ["summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69", "error: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\""]} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry Marked it as urgent because the issue blocked all upgrade test now.
I'm unsure what would be causing this. My guess is too much is trying to be flushed at once and we're hitting some kind of race or overflow in journald. I am not able to replicate this issue on a clean host running just the journald steps. Retry PR Created: https://github.com/openshift/openshift-ansible/pull/5930
PR works well, please merge.
Still no this pr in v3.7.0-0.188.0.
Verified on openshift-ansible-3.7.0-0.189.0.git.0.d497c5e.el7.noarch.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188