Bug 1506141
Summary: | Upgrade failed in turn at task [Restart journald] for the first time when run upgrade playbook on master hosts | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> |
Component: | Cluster Version Operator | Assignee: | Michael Gugino <mgugino> |
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.7.0 | CC: | aos-bugs, jokerman, mmccomas, pportant, wsun, xtian |
Target Milestone: | --- | ||
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-11-28 22:19:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
liujia
2017-10-25 08:58:05 UTC
I saw this the first time I installed with the journald rate limiting changes myself. I too had journald running fine by the time I went to look and the suspicion was some sort of transition taking too long or something so I suspect we should just add a retry on this task and hope that improves the situation. https://github.com/openshift/openshift-ansible/pull/3753#issuecomment-330971553 I believe the issue here is we are setting a persistence file, and journald is not able to create it (ie, doesn't have permissions to). I will investigate this. According to our documentation, we must create the persistence log storage directories manually: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/s1-using_the_journal Version: openshift-ansible-3.7.0-0.185.0.git.0.eb61aff.el7.noarch Steps: 1. HA install ocp v3.6 2 .Upgrade v3.6 to lates v3.7 Still hit the issue. Checked upgrade log to find that TASK [Create journald persistence directories] had been run, and corresponding dir had been created. # ls -la /var/log/journal/ total 8 drwxr-xr-x. 3 root root 46 Oct 30 02:36 . drwxr-xr-x. 13 root root 4096 Oct 30 02:36 .. drwxr-xr-x. 2 root root 4096 Oct 30 02:40 8cc63b6b56124a29b52a8d655f648e46 Upgrade log in attachment. The original workaround does not work now. After failed and failed on each master once, and re-run upgrade for N+1 time, it will lead to other issue which results from upgrade was done against a semi-upgrade env. # openshift version openshift v3.6.173.0.62 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 # oc version oc v3.6.173.0.62 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-jliu-conha-lb-nfs-1.1030-luw.qe.rhcloud.com openshift v3.7.0-0.184.0 kubernetes v1.7.6+a08f5eeb62 TASK [Upgrade all storage] ************************************************************************************************************************************************** fatal: [x.x.x.x]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:00:21.045391", "end": "2017-10-30 03:28:56.455407", "failed": true, "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2017-10-30 03:28:35.410016", "stderr": "", "stderr_lines": [], "stdout": "summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69\nerror: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\"", "stdout_lines": ["summary: total=819 errors=0 ignored=0 unchanged=750 migrated=69", "error: exited without processing all resources: no kind \"ControllerRevisionList\" is registered for version \"apps/v1beta1\""]} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry Marked it as urgent because the issue blocked all upgrade test now. I'm unsure what would be causing this. My guess is too much is trying to be flushed at once and we're hitting some kind of race or overflow in journald. I am not able to replicate this issue on a clean host running just the journald steps. Retry PR Created: https://github.com/openshift/openshift-ansible/pull/5930 PR works well, please merge. Still no this pr in v3.7.0-0.188.0. Verified on openshift-ansible-3.7.0-0.189.0.git.0.d497c5e.el7.noarch. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |