Bug 1579932
Summary: | 3.9.27 to 3.10.0-0.47.0 upgrade failed: oc get node failed, but node ready a short time later | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||||
Component: | Installer | Assignee: | Vadim Rutkovsky <vrutkovs> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Vikas Laad <vlaad> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.10.0 | CC: | aos-bugs, jokerman, mifiedle, mmccomas, wmeng | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 3.10.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-07-30 19:16:09 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Created attachment 1438754 [details]
ansible -vvv log
Created attachment 1438755 [details]
node.json showing the oc get node command run after install failed
We've got multiple problems related to node bootstrapping that we're addressing. 1) we're working on pre-pulling requisite images early in the process. 2) We've identified a race condition in certificate approval which may contribute to this Lets get those two issues addressed and we'll check back on this. I'll try to find the bug associated with these two problems and link them here after standup. https://bugzilla.redhat.com/show_bug.cgi?id=1578790 is the pre-pull images bug https://github.com/openshift/openshift-ansible/pull/8172 added pre-pulling and is in openshift-ansible-3.10.0-0.51.0 Can we please re-test this and see if the problem has been resolved? @vlaad is retesting this now Upgrade completed fine to openshift v3.10.0-0.51.0 with following version of openshift-ansible commit d0c4e258276e316d26d7322c4064df5b915f8fd6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |
Created attachment 1438752 [details] inventory Description of problem: Control plane upgrade to 3.10 with the attached inventory using openshift-ansible 3.10.0-0.47.0. The upgrade failed on the first master for /usr/bin/oc get node ip-172-31-1-199.us-west-2.compute.internal indicating the node was not ready. The install failed and while investigating I ran the same command a while later and the node was ready (json file attached). All systems are AWS m4.xlarge (4vCPU/16GB) Version-Release number of the following components: root@ip-172-31-31-229: ~ # rpm -q openshift-ansible openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7.noarch root@ip-172-31-31-229: ~ # rpm -q ansible ansible-2.4.4.0-1.el7ae.noarch root@ip-172-31-31-229: ~ # ansible --version ansible 2.4.4.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, May 4 2018, 09:38:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-34)] How reproducible: Once so far Steps to Reproduce: 1.Control plane upgrade of 3.9.27 to 3.10.0-0.47.0 in an HA cluster (see inventory) Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Please attach logs from ansible-playbook with the -vvv flag