1730736 – [3.10] Atomic Host - Upgrade failed at Task: Wait for node to be ready

Bug 1730736 - [3.10] Atomic Host - Upgrade failed at Task: Wait for node to be ready

Summary: [3.10] Atomic Host - Upgrade failed at Task: Wait for node to be ready

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Seth Jennings
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1720978
Blocks:	1508040
TreeView+	depends on / blocked

Reported:	2019-07-17 13:43 UTC by Vikas Laad
Modified:	2019-08-28 17:08 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1720978
Environment:
Last Closed:	2019-08-28 17:08:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Node NotReady (5.73 MB, text/plain) 2019-08-27 12:44 UTC, Joseph Callen	no flags	Details
View All

Comment 1 Russell Teague 2019-07-17 13:54:21 UTC

New bug for further investigation of the issue on atomic host.

Comment 4 Joseph Callen 2019-07-30 17:10:12 UTC

Fixed the ASB issue.  Now I see the issue reported.  Investigating.

Comment 16 Scott Dodson 2019-08-23 13:23:33 UTC

QE, there's suspicion that this may be related to a bug in the container runtime, does this problem exist in the latest versions of Atomic Host?

Comment 17 Johnny Liu 2019-08-26 08:24:12 UTC

(In reply to Scott Dodson from comment #16)
> QE, there's suspicion that this may be related to a bug in the container
> runtime, does this problem exist in the latest versions of Atomic Host?

@wmeng, pls help have one more check on this.

Comment 18 Weihua Meng 2019-08-27 04:40:46 UTC

latest Atomic Host, meet this issue, too
openshift-ansible-3.10.165-1.git.0.5ef95e3.el7

Red Hat Enterprise Linux Atomic Host 7.7.0
Linux 3.10.0-1062.el7.x86_64
docker-1.13.1-103.git7f2769b.el7.x86_64

when upgrade failed,
# oc get nodes
NAME                                 STATUS                        ROLES     AGE       VERSION
wmengug4ah770-master-etcd-zone1-1    Ready                         master    15h       v1.10.0+b81c8f8
wmengug4ah770-master-etcd-zone2-1    Ready                         master    15h       v1.10.0+b81c8f8
wmengug4ah770-master-etcd-zone2-2    Ready                         master    15h       v1.10.0+b81c8f8
wmengug4ah770-node-zone1-primary-1   Ready                         compute   15h       v1.9.1+a0ce1bc657
wmengug4ah770-node-zone2-primary-1   Ready                         compute   15h       v1.9.1+a0ce1bc657
wmengug4ah770-nrriz-1                NotReady,SchedulingDisabled   infra     15h       v1.10.0+b81c8f8
wmengug4ah770-nrriz-2                Ready                         <none>    15h       v1.9.1+a0ce1bc657

upgrade log:
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Run-Ansible-Playbooks-Nextge/470/consoleFull

Comment 20 Joseph Callen 2019-08-27 12:44:26 UTC

Created attachment 1608562 [details]
Node NotReady

atomic-openshift-node logs
ec2-23-20-104-227.compute-1.amazonaws.com

Comment 21 Joseph Callen 2019-08-27 12:46:57 UTC

2349 Aug 26 21:52:42 ip-172-18-10-19.ec2.internal atomic-openshift-node[432]: I0826 21:52:42.130753     444 container_manager_linux.go:266] Creating device plugin manager: true
 2350 Aug 26 21:52:42 ip-172-18-10-19.ec2.internal atomic-openshift-node[432]: I0826 21:52:42.130766     444 manager.go:102] Creating Device Plugin manager at /var/lib/kubelet/device-plugins/kubelet.sock


[root@ip-172-18-10-19 ~]# ls -alh /var/lib/kubelet/device-plugins/
total 0
drwxr-xr-x. 2 root root  6 Aug 26 21:52 .
drwxr-x---. 3 root root 28 Aug 26 21:52 ..

Comment 27 Scott Dodson 2019-08-28 17:08:14 UTC

This was root caused to be the same as Bug 1508040.

The suggested work around is to reboot the affected node and restart the upgrade.

Note You need to log in before you can comment on or make changes to this bug.