Bug 1504525 - Upgrade failed due to masters can not finish reconciling
Summary: Upgrade failed due to masters can not finish reconciling
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.7.0
Assignee: Scott Dodson
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-20 06:07 UTC by liujia
Modified: 2017-11-28 22:18 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-11-28 22:18:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description liujia 2017-10-20 06:07:54 UTC
Description of problem:
Run upgrade against containerzied ha env, upgrade failed at task [Fixup shared-resource-viewer role].

Failure summary:
  1. Hosts:    x.x.x.x
     Play:     Reconcile Cluster Roles and Cluster Role Bindings and Security Context Constraints
     Task:     Fixup shared-resource-viewer role
     Message:  {u'returncode': 1, u'cmd': u'/usr/local/bin/oc replace -f /tmp/shared_resource_viewer_role.yaml -n openshift', u'results': {}, u'stderr': u'Error from server (Conflict): error when replacing "/tmp/shared_resource_viewer_role.yaml": Operation cannot be fulfilled on roles "shared-resource-viewer": the object has been modified; please apply your changes to the latest version and try again\n', u'stdout': u''}

  2. Hosts:    localhost
     Play:     Gate on reconcile
     Task:     fail
     Message:  Upgrade cannot continue. The following masters did not finish reconciling: x.x.x.x

==============================
The command can be run manually on the host, but after it, upgrade still failed.
# oc replace -f /tmp/shared_resource_viewer_role.yaml -n openshift
role "shared-resource-viewer" replaced

Version-Release number of the following components:
openshift-ansible-3.7.0-0.167.0.git.0.0e34535.el7.noarch
ansible-2.4.1.0-0.1.beta2.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Container install ocp v3.6
2. Run pre-upgrade
3. Run upgrade
#ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 liujia 2017-10-20 07:36:10 UTC
Container upgrade blocked.

Comment 4 Scott Dodson 2017-10-20 13:36:22 UTC
Simo,

Should we switch from replace to patch?

--
Scott

Comment 5 Scott Dodson 2017-10-20 13:37:14 UTC
Alternatively we can wrap it in retries but that's a dirty hack.

Comment 6 Simo Sorce 2017-10-20 14:23:08 UTC
Scott,
two things here.

It looks like the script is running for 3.6 -> 3.7 upgrade and it shouldn't, as 3.7 code will reconcile on it's own, this was meant to be run only for < 3.7 upgrades, so 3.5 -> 3.6 and (if we do it) for 3.6 to a higher 3.6 release.

The other thing is that we probably want to rety a couple of times if it fails and then fianlly give up with a warning and not an error.

Can you do that ?

Comment 7 Scott Dodson 2017-10-23 12:49:03 UTC
https://github.com/openshift/openshift-ansible/pull/5832 proposed fix

Comment 9 liujia 2017-10-31 06:14:40 UTC
Version:
atomic-openshift-utils-3.7.0-0.185.0.git.0.eb61aff.el7.noarch

Checkin https://github.com/openshift/openshift-ansible/pull/5930 for blocker bug https://bugzilla.redhat.com/show_bug.cgi?id=1506141.


Steps:
1. Container install ocp v3.6
2. Run pre-upgrade
3. Run upgrade
#ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml


Upgrade succeed.

Comment 10 liujia 2017-10-31 06:24:08 UTC
Unfortunately, hit the issue again when upgrade against another container cluster deployed on atomic hosts. 

TASK [fail] *****************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrade cannot continue. The following masters did not finish reconciling: x.x.x.x"}


The installer version is the same version with comment9. Upgrade log in attachment.

Comment 12 liujia 2017-10-31 06:43:08 UTC
Checked log, it seems it failed more earlier than [Fixup shared-resource-viewer role] this time. At task [openshift_cli : Copy client binaries/symlinks out of CLI image for use on the host], one of master failed and result the whole upgrade failed at the end.

fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Shared connection to host-8-241-39.host.centralci.eng.rdu2.redhat.com closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 175, in <module>\r\n    main()\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 166, in main\r\n    binary_syncer.sync()\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 48, in sync\r\n    return self._sync_docker()\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 95, in _sync_docker\r\n    self._sync_binaries()\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 107, in _sync_binaries\r\n    self._sync_binary('oc')\r\n  File \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\", line 141, in _sync_binary\r\n    shutil.move(src_path, dest_path)\r\n  File \"/usr/lib64/python2.7/shutil.py\", line 301, in move\r\n    copy2(src, real_dst)\r\n  File \"/usr/lib64/python2.7/shutil.py\", line 130, in copy2\r\n    copyfile(src, dst)\r\n  File \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\r\n    with open(dst, 'wb') as fdst:\r\nIOError: [Errno 26] Text file busy: '/usr/local/bin/oc'\r\n", "msg": "MODULE FAILURE", "rc": 0}

Comment 15 Scott Dodson 2017-10-31 13:30:14 UTC
(In reply to liujia from comment #12)
> Checked log, it seems it failed more earlier than [Fixup
> shared-resource-viewer role] this time. At task [openshift_cli : Copy client
> binaries/symlinks out of CLI image for use on the host], one of master
> failed and result the whole upgrade failed at the end.
> 
> fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true,
> "module_stderr": "Shared connection to
> host-8-241-39.host.centralci.eng.rdu2.redhat.com closed.\r\n",
> "module_stdout": "Traceback (most recent call last):\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 175, in <module>\r\n    main()\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 166, in main\r\n    binary_syncer.sync()\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 48, in sync\r\n    return self._sync_docker()\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 95, in _sync_docker\r\n    self._sync_binaries()\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 107, in _sync_binaries\r\n    self._sync_binary('oc')\r\n  File
> \"/tmp/ansible_d4sRB6/ansible_module_openshift_container_binary_sync.py\",
> line 141, in _sync_binary\r\n    shutil.move(src_path, dest_path)\r\n  File
> \"/usr/lib64/python2.7/shutil.py\", line 301, in move\r\n    copy2(src,
> real_dst)\r\n  File \"/usr/lib64/python2.7/shutil.py\", line 130, in
> copy2\r\n    copyfile(src, dst)\r\n  File
> \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\r\n    with
> open(dst, 'wb') as fdst:\r\nIOError: [Errno 26] Text file busy:
> '/usr/local/bin/oc'\r\n", "msg": "MODULE FAILURE", "rc": 0}

That happens when someone is using the oc command on the host. Are you running `oc watch` or something else that's running the oc command repeatedly while performing the upgrade?

Comment 16 Scott Dodson 2017-10-31 18:17:38 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1423363 is the bug for not being able to replace the oc command for containerized environments.

Comment 18 liujia 2017-11-02 03:10:10 UTC
Verified on openshift-ansible-3.7.0-0.188.0.git.0.aebb674.el7.noarch. Issue in comment 12 will be tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1423363.

Comment 21 errata-xmlrpc 2017-11-28 22:18:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.