Description of problem: During an upgrade from 3.10.9 to v3.11.0-0.21.0 (using v3.11.0-0.28.0 openshift-ansible), one node lost a critical cni configuration file. Prior to the upgrade, the node was recorded as ready in the pre-upgrade report. After the upgrade, it reported NotReady and atomic-openshift-node is reporting CNI errors: Sep 10 13:23:58 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:23:58.532706 93041 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-63-97.e..." collected Sep 10 13:23:59 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:23:59.215538 93041 container_manager_linux.go:428] [ContainerManager]: Discovered runtime cgroups name: /system...ker.service Sep 10 13:24:00 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:00.548038 93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d Sep 10 13:24:00 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:00.548211 93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized Sep 10 13:24:05 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:05.549542 93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d Sep 10 13:24:05 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:05.549678 93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized Sep 10 13:24:08 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:24:08.532943 93041 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ip-172-...2.internal" Sep 10 13:24:08 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: I0910 13:24:08.539266 93041 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-63-97.e..." collected Sep 10 13:24:10 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: W0910 13:24:10.550987 93041 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d Sep 10 13:24:10 ip-172-31-63-97.ec2.internal atomic-openshift-node[93041]: E0910 13:24:10.551085 93041 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPlugi...initialized ..etc.. The upgrade failed due to CSR timeout. I believe this indicates some condition in which the cni configuration file can be lost during the upgrade. I manually restored the file content with the following the node recovered to Ready: { "cniVersion": "0.2.0", "name": "openshift-sdn", "type": "openshift-sdn" } Version-Release number of the following components: ansible: 3.11.0-0.28.0 pre-upgrade ocp: 3.10.9 post-upgrade ocp: 3.11.0-0.21.0 Actual results: 08:54:10 TASK [openshift_node : Approve node certificates when bootstrapping] *********** 08:54:10 Sunday 09 September 2018 12:54:08 +0000 (0:00:00.255) 0:09:20.678 ****** 08:54:10 FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left). 08:54:10 FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left). 08:54:16 FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left). 08:54:17 ok: [starter-us-east-1-node-compute-f9fac -> None] 08:54:24 FAILED - RETRYING: Approve node certificates when bootstrapping (28 retries left). ... 08:57:18 FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left). 08:57:23 fatal: [starter-us-east-1-node-compute-123a1 -> None]: FAILED! => {"attempts": 30, "changed": false, "msg": "Could not find csr for nodes: ip-172-31-63-97.ec2.internal", "state": "unknown"} 08:57:23
At the time this was reported the node was NotReady,NotSchedulable. That shouldn't have prevented the node from getting SDN pods which deploys the file that makes the node go Ready. Justin has since deployed the missing cni file and now the node is Ready and Schedulable, however there are still no SDN pods scheduled on the node at this time. Looking at the controllers I see two that seem to be active. master-controllers-ip-172-31-54-162.ec2.internal spews errors like this I0910 14:57:32.526656 1 scheduler.go:194] Failed to schedule pod: auditorias/postgresql-1-69jfw I0910 14:57:32.526730 1 factory.go:1416] Updating pod condition for auditorias/postgresql-1-69jfw to (PodScheduled==False) master-controllers-ip-172-31-60-65.ec2.internal I0910 15:01:15.114662 1 daemon_controller.go:880] Found failed daemon pod openshift-image-inspector/oso-image-inspector-5l9sg on node ip-172-31-60-66.ec2.internal, will try to kill it I0910 15:01:15.114965 1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-image-inspector", Name:"oso-image-inspector", UID:"952719f6-4743-11e8-bf06-12d641ec7610", APIVersion:"apps/v1", ResourceVersion:"3125998541", FieldPath:""}): type: 'Warning' reason: 'FailedDaemonPod' Found failed daemon pod openshift-image-inspector/oso-image-inspector-5l9sg on node ip-172-31-60-66.ec2.internal, will try to kill it I0910 15:01:15.125678 1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-rksvt I0910 15:01:15.126169 1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-ff6sq I0910 15:01:15.126388 1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-2q7dc I0910 15:01:15.126395 1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-node-problem-detector", Name:"node-problem-detector", UID:"6cf71cdf-8f88-11e8-a8ba-1250f17a13c8", APIVersion:"apps/v1", ResourceVersion:"3126000189", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: node-problem-detector-7lm75 At this point I'd say whatever is responsible for scheduling the SDN pods is broken.
tnozicka found that the sdn (and other daemonsets) pods were not being scheduled on the node because it had a taint: NodeWithImpairedVolumes . This taint was set at some unknown point in the past. In 3.10, a normal SOP was to reboot such a node - which would cause the cloud provider to delete the node API object associated with it. However, in 3.11, when a node is rebooted, its associated API object is no longer deleted. In other words, the taint persists across reboots. Changing the intent of this BZ (reflected with new title) since I don't think an SRE team can reasonable keep up with clearing taints like this.
Moving to Pod team to investigate the persisting node taints.
The work around for this issue is to manually remove the NodeWithImpairedVolumes taint. It's an RFE for 4.0 as it could be complicated to automatically remove the taint.
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1628381 to track documentation for accomplishing manual detection and removal at scale.