Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1504525 - Upgrade failed due to masters can not finish reconciling
Upgrade failed due to masters can not finish reconciling
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.7.0
Unspecified Unspecified
urgent Severity urgent
: ---
: 3.7.0
Assigned To: Scott Dodson
liujia
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-10-20 02:07 EDT by liujia
Modified: 2017-11-28 17:18 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-28 17:18:08 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-28 21:34:54 EST

  None (edit)
Description liujia 2017-10-20 02:07:54 EDT
Description of problem:
Run upgrade against containerzied ha env, upgrade failed at task [Fixup shared-resource-viewer role].

Failure summary:
  1. Hosts:    x.x.x.x
     Play:     Reconcile Cluster Roles and Cluster Role Bindings and Security Context Constraints
     Task:     Fixup shared-resource-viewer role
     Message:  {u'returncode': 1, u'cmd': u'/usr/local/bin/oc replace -f /tmp/shared_resource_viewer_role.yaml -n openshift', u'results': {}, u'stderr': u'Error from server (Conflict): error when replacing "/tmp/shared_resource_viewer_role.yaml": Operation cannot be fulfilled on roles "shared-resource-viewer": the object has been modified; please apply your changes to the latest version and try again\n', u'stdout': u''}

  2. Hosts:    localhost
     Play:     Gate on reconcile
     Task:     fail
     Message:  Upgrade cannot continue. The following masters did not finish reconciling: x.x.x.x

==============================
The command can be run manually on the host, but after it, upgrade still failed.
# oc replace -f /tmp/shared_resource_viewer_role.yaml -n openshift
role "shared-resource-viewer" replaced

Version-Release number of the following components:
openshift-ansible-3.7.0-0.167.0.git.0.0e34535.el7.noarch
ansible-2.4.1.0-0.1.beta2.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Container install ocp v3.6
2. Run pre-upgrade
3. Run upgrade
#ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag
Comment 3 liujia 2017-10-20 03:36:10 EDT
Container upgrade blocked.
Comment 4 Scott Dodson 2017-10-20 09:36:22 EDT
Simo,

Should we switch from replace to patch?

--
Scott
Comment 5 Scott Dodson 2017-10-20 09:37:14 EDT
Alternatively we can wrap it in retries but that's a dirty hack.
Comment 6 Simo Sorce 2017-10-20 10:23:08 EDT
Scott,
two things here.

It looks like the script is running for 3.6 -> 3.7 upgrade and it shouldn't, as 3.7 code will reconcile on it's own, this was meant to be run only for < 3.7 upgrades, so 3.5 -> 3.6 and (if we do it) for 3.6 to a higher 3.6 release.

The other thing is that we probably want to rety a couple of times if it fails and then fianlly give up with a warning and not an error.

Can you do that ?
Comment 7 Scott Dodson 2017-10-23 08:49:03 EDT
https://github.com/openshift/openshift-ansible/pull/5832 proposed fix
Comment 9 liujia 2017-10-31 02:14:40 EDT
Version:
atomic-openshift-utils-3.7.0-0.185.0.git.0.eb61aff.el7.noarch

Checkin https://github.com/openshift/openshift-ansible/pull/5930 for blocker bug https://bugzilla.redhat.com/show_bug.cgi?id=1506141.


Steps:
1. Container install ocp v3.6
2. Run pre-upgrade
3. Run upgrade
#ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml


Upgrade succeed.
Comment 10 liujia 2017-10-31 02:24:08 EDT
Unfortunately, hit the issue again when upgrade against another container cluster deployed on atomic hosts. 

TASK [fail] *****************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrade cannot continue. The following masters did not finish reconciling: x.x.x.x"}


The installer version is the same version with comment9. Upgrade log in attachment.
Comment 12 liujia 2017-10-31 02:43:08 EDT
Checked log, it seems it failed more earlier than [Fixup shared-resource-viewer role] this time. At task [openshift_cli : Copy client binaries/symlinks out of CLI image for use on the host], one of master failed and result the whole upgrade failed at the end.

fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to host-8-241-39.host.centralci.eng.rdu2.redhat.com closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 175, in <module>\r\n    main()\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 166, in main\r\n    binary_syncer.sync()\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 48, in sync\r\n    return self._sync_docker()\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 95, in _sync_docker\r\n    self._sync_binaries()\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 107, in _sync_binaries\r\n    self._sync_binary('oc')\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 141, in _sync_binary\r\n    shutil.move(src_path, dest_path)\r\n  File \"/usr/lib64/python2.7/shutil.py\", line 301, in move\r\n    copy2(src, real_dst)\r\n  File \"/usr/lib64/python2.7/shutil.py\", line 130, in copy2\r\n    copyfile(src, dst)\r\n  File \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\r\n    with open(dst, 'wb') as fdst:\r\nIOError: [Errno 26] Text file busy: '/usr/local/bin/oc'\r\n", "msg": "MODULE FAILURE", "rc": 0}
Comment 15 Scott Dodson 2017-10-31 09:30:14 EDT
(In reply to liujia from comment #12)
> Checked log, it seems it failed more earlier than [Fixup
> shared-resource-viewer role] this time. At task [openshift_cli : Copy client
> binaries/symlinks out of CLI image for use on the host], one of master
> failed and result the whole upgrade failed at the end.
> 
> fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true,
> "module_stderr": "Shared connection to
> host-8-241-39.host.centralci.eng.rdu2.redhat.com closed.\r\n",
> "module_stdout": "Traceback (most recent call last):\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 175, in <module>\r\n    main()\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 166, in main\r\n    binary_syncer.sync()\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 48, in sync\r\n    return self._sync_docker()\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 95, in _sync_docker\r\n    self._sync_binaries()\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 107, in _sync_binaries\r\n    self._sync_binary('oc')\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 141, in _sync_binary\r\n    shutil.move(src_path, dest_path)\r\n  File
> \"/usr/lib64/python2.7/shutil.py\", line 301, in move\r\n    copy2(src,
> real_dst)\r\n  File \"/usr/lib64/python2.7/shutil.py\", line 130, in
> copy2\r\n    copyfile(src, dst)\r\n  File
> \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\r\n    with
> open(dst, 'wb') as fdst:\r\nIOError: [Errno 26] Text file busy:
> '/usr/local/bin/oc'\r\n", "msg": "MODULE FAILURE", "rc": 0}

That happens when someone is using the oc command on the host. Are you running `oc watch` or something else that's running the oc command repeatedly while performing the upgrade?
Comment 16 Scott Dodson 2017-10-31 14:17:38 EDT
https://bugzilla.redhat.com/show_bug.cgi?id=1423363 is the bug for not being able to replace the oc command for containerized environments.
Comment 18 liujia 2017-11-01 23:10:10 EDT
Verified on openshift-ansible-3.7.0-0.188.0.git.0.aebb674.el7.noarch. Issue in comment 12 will be tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1423363.
Comment 21 errata-xmlrpc 2017-11-28 17:18:08 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.