Bug 1579932 - 3.9.27 to 3.10.0-0.47.0 upgrade failed: oc get node failed, but node ready a short time later
Summary: 3.9.27 to 3.10.0-0.47.0 upgrade failed: oc get node failed, but node ready a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.10.0
Assignee: Vadim Rutkovsky
QA Contact: Vikas Laad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-18 17:44 UTC by Mike Fiedler
Modified: 2018-07-30 19:16 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-30 19:16:09 UTC
Target Upstream Version:


Attachments (Terms of Use)
inventory (7.71 KB, text/plain)
2018-05-18 17:44 UTC, Mike Fiedler
no flags Details
ansible -vvv log (1.99 MB, text/plain)
2018-05-18 17:52 UTC, Mike Fiedler
no flags Details
node.json showing the oc get node command run after install failed (12.21 KB, text/plain)
2018-05-18 17:53 UTC, Mike Fiedler
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:16:29 UTC

Description Mike Fiedler 2018-05-18 17:44:26 UTC
Created attachment 1438752 [details]
inventory

Description of problem:

Control plane upgrade to 3.10 with the attached inventory using openshift-ansible 3.10.0-0.47.0.   The upgrade failed on the first master for /usr/bin/oc get node ip-172-31-1-199.us-west-2.compute.internal indicating the node was not ready.

The install failed and while investigating I ran the same command a while later and the node was ready (json file attached).   All systems are AWS m4.xlarge (4vCPU/16GB)

Version-Release number of the following components:
root@ip-172-31-31-229: ~ # rpm -q openshift-ansible
openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7.noarch
root@ip-172-31-31-229: ~ # rpm -q ansible
ansible-2.4.4.0-1.el7ae.noarch
root@ip-172-31-31-229: ~ # ansible --version
ansible 2.4.4.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']                                                                                                      
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  4 2018, 09:38:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-34)]

How reproducible: Once so far

Steps to Reproduce:
1.Control plane upgrade of 3.9.27 to 3.10.0-0.47.0 in an HA cluster (see inventory)


Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Mike Fiedler 2018-05-18 17:52:45 UTC
Created attachment 1438754 [details]
ansible -vvv log

Comment 2 Mike Fiedler 2018-05-18 17:53:34 UTC
Created attachment 1438755 [details]
node.json showing the oc get node command run after install failed

Comment 3 Scott Dodson 2018-05-21 12:56:25 UTC
We've got multiple problems related to node bootstrapping that we're addressing.

1) we're working on pre-pulling requisite images early in the process.
2) We've identified a race condition in certificate approval which may contribute to this

Lets get those two issues addressed and we'll check back on this. I'll try to find the bug associated with these two problems and link them here after standup.

Comment 4 Scott Dodson 2018-05-21 13:16:27 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1578790 is the pre-pull images bug

Comment 5 Scott Dodson 2018-05-24 13:31:12 UTC
https://github.com/openshift/openshift-ansible/pull/8172 added pre-pulling and is in openshift-ansible-3.10.0-0.51.0 Can we please re-test this and see if the problem has been resolved?

Comment 6 Mike Fiedler 2018-05-24 15:55:28 UTC
@vlaad is retesting this now

Comment 7 Vikas Laad 2018-05-24 19:30:59 UTC
Upgrade completed fine to openshift v3.10.0-0.51.0 with following version of openshift-ansible

commit d0c4e258276e316d26d7322c4064df5b915f8fd6

Comment 9 errata-xmlrpc 2018-07-30 19:16:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.