Bug 1730736

Summary: [3.10] Atomic Host - Upgrade failed at Task: Wait for node to be ready
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED WONTFIX QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, dustymabe, jcallen, jialiu, jokerman, mmccomas, padillon, rkrawitz, sdodson, wmeng, wsun
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1720978 Environment:
Last Closed: 2019-08-28 17:08:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1720978    
Bug Blocks: 1508040    
Attachments:
Description Flags
Node NotReady none

Comment 1 Russell Teague 2019-07-17 13:54:21 UTC
New bug for further investigation of the issue on atomic host.

Comment 4 Joseph Callen 2019-07-30 17:10:12 UTC
Fixed the ASB issue.  Now I see the issue reported.  Investigating.

Comment 16 Scott Dodson 2019-08-23 13:23:33 UTC
QE, there's suspicion that this may be related to a bug in the container runtime, does this problem exist in the latest versions of Atomic Host?

Comment 17 Johnny Liu 2019-08-26 08:24:12 UTC
(In reply to Scott Dodson from comment #16)
> QE, there's suspicion that this may be related to a bug in the container
> runtime, does this problem exist in the latest versions of Atomic Host?

@wmeng, pls help have one more check on this.

Comment 18 Weihua Meng 2019-08-27 04:40:46 UTC
latest Atomic Host, meet this issue, too
openshift-ansible-3.10.165-1.git.0.5ef95e3.el7

Red Hat Enterprise Linux Atomic Host 7.7.0
Linux 3.10.0-1062.el7.x86_64
docker-1.13.1-103.git7f2769b.el7.x86_64

when upgrade failed,
# oc get nodes
NAME                                 STATUS                        ROLES     AGE       VERSION
wmengug4ah770-master-etcd-zone1-1    Ready                         master    15h       v1.10.0+b81c8f8
wmengug4ah770-master-etcd-zone2-1    Ready                         master    15h       v1.10.0+b81c8f8
wmengug4ah770-master-etcd-zone2-2    Ready                         master    15h       v1.10.0+b81c8f8
wmengug4ah770-node-zone1-primary-1   Ready                         compute   15h       v1.9.1+a0ce1bc657
wmengug4ah770-node-zone2-primary-1   Ready                         compute   15h       v1.9.1+a0ce1bc657
wmengug4ah770-nrriz-1                NotReady,SchedulingDisabled   infra     15h       v1.10.0+b81c8f8
wmengug4ah770-nrriz-2                Ready                         <none>    15h       v1.9.1+a0ce1bc657

upgrade log:
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Run-Ansible-Playbooks-Nextge/470/consoleFull

Comment 20 Joseph Callen 2019-08-27 12:44:26 UTC
Created attachment 1608562 [details]
Node NotReady

atomic-openshift-node logs
ec2-23-20-104-227.compute-1.amazonaws.com

Comment 21 Joseph Callen 2019-08-27 12:46:57 UTC
2349 Aug 26 21:52:42 ip-172-18-10-19.ec2.internal atomic-openshift-node[432]: I0826 21:52:42.130753     444 container_manager_linux.go:266] Creating device plugin manager: true
 2350 Aug 26 21:52:42 ip-172-18-10-19.ec2.internal atomic-openshift-node[432]: I0826 21:52:42.130766     444 manager.go:102] Creating Device Plugin manager at /var/lib/kubelet/device-plugins/kubelet.sock


[root@ip-172-18-10-19 ~]# ls -alh /var/lib/kubelet/device-plugins/
total 0
drwxr-xr-x. 2 root root  6 Aug 26 21:52 .
drwxr-x---. 3 root root 28 Aug 26 21:52 ..

Comment 27 Scott Dodson 2019-08-28 17:08:14 UTC
This was root caused to be the same as Bug 1508040.

The suggested work around is to reboot the affected node and restart the upgrade.