Bug 1627268
Summary: | [starter-us-east-1] NodeWithImpairedVolumes is not cleared after node reboot | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
Component: | Storage | Assignee: | Bradley Childs <bchilds> |
Status: | CLOSED WONTFIX | QA Contact: | Liang Xia <lxia> |
Severity: | low | Docs Contact: | |
Priority: | high | ||
Version: | 3.11.0 | CC: | aos-bugs, aos-storage-staff, bchilds, hekumar, jokerman, lxia, mmccomas, mwoodson, nmalik, tnozicka |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-10-25 15:18:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Justin Pierce
2018-09-10 14:09:07 UTC
At the time this was reported the node was NotReady,NotSchedulable. That shouldn't have prevented the node from getting SDN pods which deploys the file that makes the node go Ready. Justin has since deployed the missing cni file and now the node is Ready and Schedulable, however there are still no SDN pods scheduled on the node at this time. Looking at the controllers I see two that seem to be active. master-controllers-ip-172-31-54-162.ec2.internal spews errors like this I0910 14:57:32.526656 1 scheduler.go:194] Failed to schedule pod: auditorias/postgresql-1-69jfw I0910 14:57:32.526730 1 factory.go:1416] Updating pod condition for auditorias/postgresql-1-69jfw to (PodScheduled==False) master-controllers-ip-172-31-60-65.ec2.internal I0910 15:01:15.114662 1 daemon_controller.go:880] Found failed daemon pod openshift-image-inspector/oso-image-inspector-5l9sg on node ip-172-31-60-66.ec2.internal, will try to kill it I0910 15:01:15.114965 1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-image-inspector", Name:"oso-image-inspector", UID:"952719f6-4743-11e8-bf06-12d641ec7610", APIVersion:"apps/v1", ResourceVersion:"3125998541", FieldPath:""}): type: 'Warning' reason: 'FailedDaemonPod' Found failed daemon pod openshift-image-inspector/oso-image-inspector-5l9sg on node ip-172-31-60-66.ec2.internal, will try to kill it I0910 15:01:15.125678 1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-rksvt I0910 15:01:15.126169 1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-ff6sq I0910 15:01:15.126388 1 controller_utils.go:596] Controller node-problem-detector deleting pod openshift-node-problem-detector/node-problem-detector-2q7dc I0910 15:01:15.126395 1 event.go:221] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-node-problem-detector", Name:"node-problem-detector", UID:"6cf71cdf-8f88-11e8-a8ba-1250f17a13c8", APIVersion:"apps/v1", ResourceVersion:"3126000189", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: node-problem-detector-7lm75 At this point I'd say whatever is responsible for scheduling the SDN pods is broken. tnozicka found that the sdn (and other daemonsets) pods were not being scheduled on the node because it had a taint: NodeWithImpairedVolumes . This taint was set at some unknown point in the past. In 3.10, a normal SOP was to reboot such a node - which would cause the cloud provider to delete the node API object associated with it. However, in 3.11, when a node is rebooted, its associated API object is no longer deleted. In other words, the taint persists across reboots. Changing the intent of this BZ (reflected with new title) since I don't think an SRE team can reasonable keep up with clearing taints like this. Moving to Pod team to investigate the persisting node taints. The work around for this issue is to manually remove the NodeWithImpairedVolumes taint. It's an RFE for 4.0 as it could be complicated to automatically remove the taint. Opened https://bugzilla.redhat.com/show_bug.cgi?id=1628381 to track documentation for accomplishing manual detection and removal at scale. |