1627268 – [starter-us-east-1] NodeWithImpairedVolumes is not cleared after node reboot

Bug 1627268 - [starter-us-east-1] NodeWithImpairedVolumes is not cleared after node reboot

Summary: [starter-us-east-1] NodeWithImpairedVolumes is not cleared after node reboot

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Bradley Childs
QA Contact:	Liang Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-10 14:09 UTC by Justin Pierce
Modified:	2019-10-25 15:18 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-25 15:18:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Justin Pierce 2018-09-10 14:09:07 UTC

Description of problem:
During an upgrade from 3.10.9 to v3.11.0-0.21.0 (using v3.11.0-0.28.0 openshift-ansible), one node lost a critical cni configuration file. Prior to the upgrade, the node was recorded as ready in the pre-upgrade report. After the upgrade, it reported NotReady and atomic-openshift-node is reporting CNI errors:

Sep 10 13:23:58 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:23:58.532706   93041 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-63-97.e..." collected
Sep 10 13:23:59 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:23:59.215538   93041 container_manager_linux.go:428] [ContainerManager]: Discovered runtime cgroups name: /system...ker.service
Sep 10 13:24:00 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:00.548038   93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Sep 10 13:24:00 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:00.548211   93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized
Sep 10 13:24:05 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:05.549542   93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Sep 10 13:24:05 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:05.549678   93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized
Sep 10 13:24:08 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:24:08.532943   93041 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ip-172-...2.internal"
Sep 10 13:24:08 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:24:08.539266   93041 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-63-97.e..." collected
Sep 10 13:24:10 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:10.550987   93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d
Sep 10 13:24:10 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:10.551085   93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized
..etc..

The upgrade failed due to CSR timeout. I believe this indicates some condition in which the cni configuration file can be lost during the upgrade. 

I manually restored the file content with the following the node recovered to Ready:

{
  "cniVersion": "0.2.0",
  "name": "openshift-sdn",
  "type": "openshift-sdn"
}


Version-Release number of the following components:
ansible: 3.11.0-0.28.0
pre-upgrade ocp: 3.10.9
post-upgrade ocp: 3.11.0-0.21.0


Actual results:
08:54:10 TASK [openshift_node : Approve node certificates when bootstrapping] ***********
08:54:10 Sunday 09 September 2018  12:54:08 +0000 (0:00:00.255)       0:09:20.678 ****** 
08:54:10 FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
08:54:10 FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
08:54:16 FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
08:54:17 ok: [starter-us-east-1-node-compute-f9fac -> None]
08:54:24 FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left).
...
08:57:18 FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).
08:57:23 fatal: [starter-us-east-1-node-compute-123a1 -> None]: FAILED! => {"attempts": 30, "changed": false, "msg": "Could not find csr for nodes: ip-172-31-63-97.ec2.internal", "state": "unknown"}
08:57:23

Comment 2 Scott Dodson 2018-09-10 15:14:54 UTC

At the time this was reported the node was NotReady,NotSchedulable. That shouldn't have prevented the node from getting SDN pods which deploys the file that makes the node go Ready.

Justin has since deployed the missing cni file and now the node is Ready and Schedulable, however there are still no SDN pods scheduled on the node at this time.

Looking at the controllers I see two that seem to be active.

master-controllers-ip-172-31-54-162.ec2.internal spews errors like this

I0910 14:57:32.526656       1 scheduler.go:194] Failed to schedule pod: auditorias/postgresql-1-69jfw
I0910 14:57:32.526730       1 factory.go:1416] Updating pod condition for auditorias/postgresql-1-69jfw to (PodScheduled==False)


master-controllers-ip-172-31-60-65.ec2.internal

I0910 15:01:15.114662       1 daemon_controller.go:880] Found failed daemon pod openshift-image-inspector/oso-image-inspector-5l9sg on node ip-172-31-60-66.ec2.internal, will try to kill it
I0910 15:01:15.114965       1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-image-inspector", Name:"oso-image-inspector", UID:"952719f6-4743-11e8-bf06-12d641ec7610", APIVersion:"apps/v1", ResourceVersion:"3125998541", FieldPath:""}): type: 'Warning' reason: 'FailedDaemonPod' Found failed daemon pod openshift-image-inspector/oso-image-inspector-5l9sg on node ip-172-31-60-66.ec2.internal, will try to kill it
I0910 15:01:15.125678       1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-rksvt
I0910 15:01:15.126169       1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-ff6sq
I0910 15:01:15.126388       1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-2q7dc
I0910 15:01:15.126395       1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-node-problem-detector", Name:"node-problem-detector", UID:"6cf71cdf-8f88-11e8-a8ba-1250f17a13c8", APIVersion:"apps/v1", ResourceVersion:"3126000189", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: node-problem-detector-7lm75

At this point I'd say whatever is responsible for scheduling the SDN pods is broken.

Comment 4 Justin Pierce 2018-09-10 18:46:55 UTC

tnozicka found that the sdn (and other daemonsets) pods were not being scheduled on the node because it had a taint: NodeWithImpairedVolumes . This taint was set at some unknown point in the past. 

In 3.10, a normal SOP was to reboot such a node - which would cause the cloud provider to delete the node API object associated with it. However, in 3.11, when a node is rebooted, its associated API object is no longer deleted. In other words, the taint persists across reboots. 

Changing the intent of this BZ (reflected with new title) since I don't think an SRE team can reasonable keep up with clearing taints like this.

Comment 5 Michal Fojtik 2018-09-12 07:28:18 UTC

Moving to Pod team to investigate the persisting node taints.

Comment 7 Bradley Childs 2018-09-12 15:25:13 UTC

The work around for this issue is to manually remove the NodeWithImpairedVolumes taint.

It's an RFE for 4.0 as it could be complicated to automatically remove the taint.

Comment 8 Justin Pierce 2018-09-12 20:53:25 UTC

Opened https://bugzilla.redhat.com/show_bug.cgi?id=1628381 to track documentation for accomplishing manual detection and removal at scale.

Note You need to log in before you can comment on or make changes to this bug.