Bug 1627268

Summary:	[starter-us-east-1] NodeWithImpairedVolumes is not cleared after node reboot
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Storage	Assignee:	Bradley Childs <bchilds>
Status:	CLOSED WONTFIX	QA Contact:	Liang Xia <lxia>
Severity:	low	Docs Contact:
Priority:	high
Version:	3.11.0	CC:	aos-bugs, aos-storage-staff, bchilds, hekumar, jokerman, lxia, mmccomas, mwoodson, nmalik, tnozicka
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-25 15:18:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin Pierce 2018-09-10 14:09:07 UTC

Description of problem:
During an upgrade from 3.10.9 to v3.11.0-0.21.0 (using v3.11.0-0.28.0 openshift-ansible), one node lost a critical cni configuration file. Prior to the upgrade, the node was recorded as ready in the pre-upgrade report. After the upgrade, it reported NotReady and atomic-openshift-node is reporting CNI errors:

Sep 10 13:23:58 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:23:58.532706   93041 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-63-97.e..." collected
Sep 10 13:23:59 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:23:59.215538   93041 container_manager_linux.go:428] [ContainerManager]: Discovered runtime cgroups name: /system...ker.service
Sep 10 13:24:00 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:00.548038   93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Sep 10 13:24:00 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:00.548211   93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized
Sep 10 13:24:05 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:05.549542   93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Sep 10 13:24:05 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:05.549678   93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized
Sep 10 13:24:08 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:24:08.532943   93041 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ip-172-...2.internal"
Sep 10 13:24:08 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:24:08.539266   93041 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-63-97.e..." collected
Sep 10 13:24:10 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:10.550987   93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Sep 10 13:24:10 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:10.551085   93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized
..etc..

The upgrade failed due to CSR timeout. I believe this indicates some condition in which the cni configuration file can be lost during the upgrade. 

I manually restored the file content with the following the node recovered to Ready:

{
  "cniVersion": "0.2.0",
  "name": "openshift-sdn",
  "type": "openshift-sdn"
}


Version-Release number of the following components:
ansible: 3.11.0-0.28.0
pre-upgrade ocp: 3.10.9
post-upgrade ocp: 3.11.0-0.21.0


Actual results:
08:54:10 TASK [openshift_node : Approve node certificates when bootstrapping] ***********
08:54:10 Sunday 09 September 2018  12:54:08 +0000 (0:00:00.255)       0:09:20.678 ****** 
08:54:10 FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
08:54:10 FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
08:54:16 FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
08:54:17 ok: [starter-us-east-1-node-compute-f9fac -> None]
08:54:24 FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left).
...
08:57:18 FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).
08:57:23 fatal: [starter-us-east-1-node-compute-123a1 -> None]: FAILED! => {"attempts": 30, "changed": false, "msg": "Could not find csr for nodes: ip-172-31-63-97.ec2.internal", "state": "unknown"}
08:57:23

Comment 2 Scott Dodson 2018-09-10 15:14:54 UTC

At the time this was reported the node was NotReady,NotSchedulable. That shouldn't have prevented the node from getting SDN pods which deploys the file that makes the node go Ready.

Justin has since deployed the missing cni file and now the node is Ready and Schedulable, however there are still no SDN pods scheduled on the node at this time.

Looking at the controllers I see two that seem to be active.

master-controllers-ip-172-31-54-162.ec2.internal spews errors like this

I0910 14:57:32.526656       1 scheduler.go:194] Failed to schedule pod: auditorias/postgresql-1-69jfw
I0910 14:57:32.526730       1 factory.go:1416] Updating pod condition for auditorias/postgresql-1-69jfw to (PodScheduled==False)


master-controllers-ip-172-31-60-65.ec2.internal

I0910 15:01:15.114662       1 daemon_controller.go:880] Found failed daemon pod openshift-image-inspector/oso-image-inspector-5l9sg on node ip-172-31-60-66.ec2.internal, will try to kill it
I0910 15:01:15.114965       1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-image-inspector", Name:"oso-image-inspector", UID:"952719f6-4743-11e8-bf06-12d641ec7610", APIVersion:"apps/v1", ResourceVersion:"3125998541", FieldPath:""}): type: 'Warning' reason: 'FailedDaemonPod' Found failed daemon pod openshift-image-inspector/oso-image-inspector-5l9sg on node ip-172-31-60-66.ec2.internal, will try to kill it
I0910 15:01:15.125678       1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-rksvt
I0910 15:01:15.126169       1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-ff6sq
I0910 15:01:15.126388       1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-2q7dc
I0910 15:01:15.126395       1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-node-problem-detector", Name:"node-problem-detector", UID:"6cf71cdf-8f88-11e8-a8ba-1250f17a13c8", APIVersion:"apps/v1", ResourceVersion:"3126000189", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: node-problem-detector-7lm75

At this point I'd say whatever is responsible for scheduling the SDN pods is broken.

Comment 4 Justin Pierce 2018-09-10 18:46:55 UTC

tnozicka found that the sdn (and other daemonsets) pods were not being scheduled on the node because it had a taint: NodeWithImpairedVolumes . This taint was set at some unknown point in the past. 

In 3.10, a normal SOP was to reboot such a node - which would cause the cloud provider to delete the node API object associated with it. However, in 3.11, when a node is rebooted, its associated API object is no longer deleted. In other words, the taint persists across reboots. 

Changing the intent of this BZ (reflected with new title) since I don't think an SRE team can reasonable keep up with clearing taints like this.

Comment 5 Michal Fojtik 2018-09-12 07:28:18 UTC

Moving to Pod team to investigate the persisting node taints.

Comment 7 Bradley Childs 2018-09-12 15:25:13 UTC

The work around for this issue is to manually remove the NodeWithImpairedVolumes taint.

It's an RFE for 4.0 as it could be complicated to automatically remove the taint.

Comment 8 Justin Pierce 2018-09-12 20:53:25 UTC

Opened https://bugzilla.redhat.com/show_bug.cgi?id=1628381 to track documentation for accomplishing manual detection and removal at scale.