Bug 1504525
Summary: | Upgrade failed due to masters can not finish reconciling | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> |
Component: | Cluster Version Operator | Assignee: | Scott Dodson <sdodson> |
Status: | CLOSED ERRATA | QA Contact: | liujia <jiajliu> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.7.0 | CC: | aos-bugs, eparis, jokerman, mmccomas, sdodson, wsun |
Target Milestone: | --- | ||
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: |
undefined
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-11-28 22:18:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
liujia
2017-10-20 06:07:54 UTC
Container upgrade blocked. Simo, Should we switch from replace to patch? -- Scott Alternatively we can wrap it in retries but that's a dirty hack. Scott, two things here. It looks like the script is running for 3.6 -> 3.7 upgrade and it shouldn't, as 3.7 code will reconcile on it's own, this was meant to be run only for < 3.7 upgrades, so 3.5 -> 3.6 and (if we do it) for 3.6 to a higher 3.6 release. The other thing is that we probably want to rety a couple of times if it fails and then fianlly give up with a warning and not an error. Can you do that ? Version: atomic-openshift-utils-3.7.0-0.185.0.git.0.eb61aff.el7.noarch Checkin https://github.com/openshift/openshift-ansible/pull/5930 for blocker bug https://bugzilla.redhat.com/show_bug.cgi?id=1506141. Steps: 1. Container install ocp v3.6 2. Run pre-upgrade 3. Run upgrade #ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml Upgrade succeed. Unfortunately, hit the issue again when upgrade against another container cluster deployed on atomic hosts. TASK [fail] ***************************************************************************************************************************************************************** fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrade cannot continue. The following masters did not finish reconciling: x.x.x.x"} The installer version is the same version with comment9. Upgrade log in attachment. Checked log, it seems it failed more earlier than [Fixup shared-resource-viewer role] this time. At task [openshift_cli : Copy client binaries/symlinks out of CLI image for use on the host], one of master failed and result the whole upgrade failed at the end. fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to host-8-241-39.host.centralci.eng.rdu2.redhat.com closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 175, in <module>\r\n main()\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 166, in main\r\n binary_syncer.sync()\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 48, in sync\r\n return self._sync_docker()\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 95, in _sync_docker\r\n self._sync_binaries()\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 107, in _sync_binaries\r\n self._sync_binary('oc')\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 141, in _sync_binary\r\n shutil.move(src_path, dest_path)\r\n File \"/usr/lib64/python2.7/shutil.py\", line 301, in move\r\n copy2(src, real_dst)\r\n File \"/usr/lib64/python2.7/shutil.py\", line 130, in copy2\r\n copyfile(src, dst)\r\n File \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\r\n with open(dst, 'wb') as fdst:\r\nIOError: [Errno 26] Text file busy: '/usr/local/bin/oc'\r\n", "msg": "MODULE FAILURE", "rc": 0} (In reply to liujia from comment #12) > Checked log, it seems it failed more earlier than [Fixup > shared-resource-viewer role] this time. At task [openshift_cli : Copy client > binaries/symlinks out of CLI image for use on the host], one of master > failed and result the whole upgrade failed at the end. > > fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, > "module_stderr": "Shared connection to > host-8-241-39.host.centralci.eng.rdu2.redhat.com closed.\r\n", > "module_stdout": "Traceback (most recent call last):\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 175, in <module>\r\n main()\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 166, in main\r\n binary_syncer.sync()\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 48, in sync\r\n return self._sync_docker()\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 95, in _sync_docker\r\n self._sync_binaries()\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 107, in _sync_binaries\r\n self._sync_binary('oc')\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 141, in _sync_binary\r\n shutil.move(src_path, dest_path)\r\n File > \"/usr/lib64/python2.7/shutil.py\", line 301, in move\r\n copy2(src, > real_dst)\r\n File \"/usr/lib64/python2.7/shutil.py\", line 130, in > copy2\r\n copyfile(src, dst)\r\n File > \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\r\n with > open(dst, 'wb') as fdst:\r\nIOError: [Errno 26] Text file busy: > '/usr/local/bin/oc'\r\n", "msg": "MODULE FAILURE", "rc": 0} That happens when someone is using the oc command on the host. Are you running `oc watch` or something else that's running the oc command repeatedly while performing the upgrade? https://bugzilla.redhat.com/show_bug.cgi?id=1423363 is the bug for not being able to replace the oc command for containerized environments. Verified on openshift-ansible-3.7.0-0.188.0.git.0.aebb674.el7.noarch. Issue in comment 12 will be tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1423363. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |