Description of problem: Run upgrade against containerzied ha env, upgrade failed at task [Fixup shared-resource-viewer role]. Failure summary: 1. Hosts: x.x.x.x Play: Reconcile Cluster Roles and Cluster Role Bindings and Security Context Constraints Task: Fixup shared-resource-viewer role Message: {u'returncode': 1, u'cmd': u'/usr/local/bin/oc replace -f /tmp/shared_resource_viewer_role.yaml -n openshift', u'results': {}, u'stderr': u'Error from server (Conflict): error when replacing "/tmp/shared_resource_viewer_role.yaml": Operation cannot be fulfilled on roles "shared-resource-viewer": the object has been modified; please apply your changes to the latest version and try again\n', u'stdout': u''} 2. Hosts: localhost Play: Gate on reconcile Task: fail Message: Upgrade cannot continue. The following masters did not finish reconciling: x.x.x.x ============================== The command can be run manually on the host, but after it, upgrade still failed. # oc replace -f /tmp/shared_resource_viewer_role.yaml -n openshift role "shared-resource-viewer" replaced Version-Release number of the following components: openshift-ansible-3.7.0-0.167.0.git.0.0e34535.el7.noarch ansible-2.4.1.0-0.1.beta2.el7.noarch How reproducible: always Steps to Reproduce: 1. Container install ocp v3.6 2. Run pre-upgrade 3. Run upgrade #ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Please attach logs from ansible-playbook with the -vvv flag
Container upgrade blocked.
Simo, Should we switch from replace to patch? -- Scott
Alternatively we can wrap it in retries but that's a dirty hack.
Scott, two things here. It looks like the script is running for 3.6 -> 3.7 upgrade and it shouldn't, as 3.7 code will reconcile on it's own, this was meant to be run only for < 3.7 upgrades, so 3.5 -> 3.6 and (if we do it) for 3.6 to a higher 3.6 release. The other thing is that we probably want to rety a couple of times if it fails and then fianlly give up with a warning and not an error. Can you do that ?
https://github.com/openshift/openshift-ansible/pull/5832 proposed fix
Version: atomic-openshift-utils-3.7.0-0.185.0.git.0.eb61aff.el7.noarch Checkin https://github.com/openshift/openshift-ansible/pull/5930 for blocker bug https://bugzilla.redhat.com/show_bug.cgi?id=1506141. Steps: 1. Container install ocp v3.6 2. Run pre-upgrade 3. Run upgrade #ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml Upgrade succeed.
Unfortunately, hit the issue again when upgrade against another container cluster deployed on atomic hosts. TASK [fail] ***************************************************************************************************************************************************************** fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrade cannot continue. The following masters did not finish reconciling: x.x.x.x"} The installer version is the same version with comment9. Upgrade log in attachment.
Checked log, it seems it failed more earlier than [Fixup shared-resource-viewer role] this time. At task [openshift_cli : Copy client binaries/symlinks out of CLI image for use on the host], one of master failed and result the whole upgrade failed at the end. fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to host-8-241-39.host.centralci.eng.rdu2.redhat.com closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 175, in <module>\r\n main()\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 166, in main\r\n binary_syncer.sync()\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 48, in sync\r\n return self._sync_docker()\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 95, in _sync_docker\r\n self._sync_binaries()\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 107, in _sync_binaries\r\n self._sync_binary('oc')\r\n File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 141, in _sync_binary\r\n shutil.move(src_path, dest_path)\r\n File \"/usr/lib64/python2.7/shutil.py\", line 301, in move\r\n copy2(src, real_dst)\r\n File \"/usr/lib64/python2.7/shutil.py\", line 130, in copy2\r\n copyfile(src, dst)\r\n File \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\r\n with open(dst, 'wb') as fdst:\r\nIOError: [Errno 26] Text file busy: '/usr/local/bin/oc'\r\n", "msg": "MODULE FAILURE", "rc": 0}
(In reply to liujia from comment #12) > Checked log, it seems it failed more earlier than [Fixup > shared-resource-viewer role] this time. At task [openshift_cli : Copy client > binaries/symlinks out of CLI image for use on the host], one of master > failed and result the whole upgrade failed at the end. > > fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, > "module_stderr": "Shared connection to > host-8-241-39.host.centralci.eng.rdu2.redhat.com closed.\r\n", > "module_stdout": "Traceback (most recent call last):\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 175, in <module>\r\n main()\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 166, in main\r\n binary_syncer.sync()\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 48, in sync\r\n return self._sync_docker()\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 95, in _sync_docker\r\n self._sync_binaries()\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 107, in _sync_binaries\r\n self._sync_binary('oc')\r\n File > \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", > line 141, in _sync_binary\r\n shutil.move(src_path, dest_path)\r\n File > \"/usr/lib64/python2.7/shutil.py\", line 301, in move\r\n copy2(src, > real_dst)\r\n File \"/usr/lib64/python2.7/shutil.py\", line 130, in > copy2\r\n copyfile(src, dst)\r\n File > \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\r\n with > open(dst, 'wb') as fdst:\r\nIOError: [Errno 26] Text file busy: > '/usr/local/bin/oc'\r\n", "msg": "MODULE FAILURE", "rc": 0} That happens when someone is using the oc command on the host. Are you running `oc watch` or something else that's running the oc command repeatedly while performing the upgrade?
https://bugzilla.redhat.com/show_bug.cgi?id=1423363 is the bug for not being able to replace the oc command for containerized environments.
Verified on openshift-ansible-3.7.0-0.188.0.git.0.aebb674.el7.noarch. Issue in comment 12 will be tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1423363.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188