Bug 1705642 - restart.yml role doesn't check/wait for node-config.yaml to land on the nodes - this adds that check
Summary: restart.yml role doesn't check/wait for node-config.yaml to land on the nodes...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
: 3.10.z
Assignee: Joseph Callen
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-02 16:12 UTC by Dan Yocum
Modified: 2019-07-24 13:47 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The sync pod creates the node-config from the configmap sometimes this is slow to create Consequence: The atomic-openshift-node service does not start properly Fix: Restart the node service once the file exists Result: The upgrade process will succeed in a large cluster
Clone Of:
Environment:
Last Closed: 2019-07-24 13:47:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1755 0 None None None 2019-07-24 13:47:24 UTC

Description Dan Yocum 2019-05-02 16:12:34 UTC
Description of problem:

Upgrading from 3.9 to 3.10 can fail due to node-config.yml not landing on a node in time for the checks to finish.  This patch written by Ben Draper at Experian fixes that:

https://github.com/openshift/openshift-ansible/pull/11576

Also, here's a patch to 3.11 which may or may not be relevent - feel free to suggest/nuke this PR:

https://github.com/openshift/openshift-ansible/pull/11574/files


Version-Release number of the following components:
rpm -q openshift-ansible

3.10.x

rpm -q ansible

ansible-2.6.x

How reproducible:

When there are A LOT of nodes >20 it's almost 100% of the time.  For small clusters this can happen, too.

Comment 3 Weihua Meng 2019-07-15 06:53:31 UTC
Verified.
openshift-ansible-3.10.153-1.git.0.2363fa8.el7

upgrade OCP v3.9 cluster of 1 lb + 3 masters + 2 infra nodes + 21 compute nodes, success!

PLAY RECAP *********************************************************************
localhost                  : ok=38   changed=0    unreachable=0    failed=0   
qe-wmeng1ug39-lb-1.0715-j7k.qe.rhcloud.com : ok=71   changed=6    unreachable=0    failed=0   
qe-wmeng1ug39-master-etcd-1.0715-j7k.qe.rhcloud.com : ok=870  changed=237  unreachable=0    failed=0   
qe-wmeng1ug39-master-etcd-2.0715-j7k.qe.rhcloud.com : ok=394  changed=112  unreachable=0    failed=0   
qe-wmeng1ug39-master-etcd-3.0715-j7k.qe.rhcloud.com : ok=394  changed=112  unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-1.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-10.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-11.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-12.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-13.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-14.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-15.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-16.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-17.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-18.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-19.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-2.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-20.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-21.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-3.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-4.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-5.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-6.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-7.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-8.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-node-primary-9.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-nrri-1.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0   
qe-wmeng1ug39-nrri-2.0715-j7k.qe.rhcloud.com : ok=187  changed=60   unreachable=0    failed=0

Comment 5 errata-xmlrpc 2019-07-24 13:47:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1755


Note You need to log in before you can comment on or make changes to this bug.