Description of problem: With the new drain behavior, if an alert is triggered because a pod cannot be evicted for 1 hour, if we fix the pod after the alert is triggered the MCP cannot apply the configuration and is stuck reporting an error in the MCC logs. Version-Release number of MCO (Machine Config Operator) (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-25-193227 True False 5h22m Cluster version is 4.11.0-0.nightly-2022-05-25-193227 Platform (AWS, VSphere, Metal, etc.): Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Yes How reproducible: Always Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. Create PodDisruptionBudget cat << EOF | oc create -f - apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict EOF 2. Create a pod using this PodDisruptionBudget so that the pod cannot be evicted $ oc run --restart=Never --labels app=dontevict --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 2h $ oc get pods NAME READY STATUS RESTARTS AGE dont-evict-this-pod 1/1 Running 0 5m5s 3. Create a machine config resource that triggers a drain operation in the nodes cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-drain-maxunavail spec: config: ignition: version: 3.2.0 kernelArguments: - quiet kernelType: realtime EOF 4. After 1 hour a MCDDrainError alarm will be raised and in the MCC logs we will see a message reporting the the drain operation has failed $ oc logs -n openshift-machine-config-operator $(oc get pod -n openshift-machine-config-operator -l k8s-app=machine-config-controller -o name) 0526 14:09:59.897126 1 drain_controller.go:141] node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: initiating drain I0526 14:09:59.897169 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 3.6754288937169446 hours I0526 14:09:59.897191 1 drain_controller.go:213] Error syncing node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: drain exceeded timeout: 1h0m0s The worker MCP reports a Degraded status. 5. Remove the pod that cannot be evicted $ oc delete pod dont-evict-this-pod Actual results: The MCC is stuck reporting the error 0526 14:09:59.897126 1 drain_controller.go:141] node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: initiating drain I0526 14:09:59.897169 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 3.6754288937169446 hours I0526 14:09:59.897191 1 drain_controller.go:213] Error syncing node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: drain exceeded timeout: 1h0m0s The worker pool continues reporting Degraded status. Expected results: After manually removing the pod that causes the eviction problem the MCP should be able to finish the application of the machine configuration. Additional info: When the drain logic was in the daemonsets the MCP could apply the configuration without problems once the pod was manually deleted.
Marking as blocker due to a change of behaviour from previous updates, which may cause updates to stall Does not apply to previous versions. Change only in 4.11
verified on 4.11.0-0.nightly-2022-06-11-120123 1. create pdb oc create -f pdb.yaml ⎈ admin poddisruptionbudget.policy/dontevict created oc get pdb ⎈ admin NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE dontevict 1 N/A 0 25s 2. create pod oc run --restart=Never --labels app=dontevict --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 3h ⎈ admin Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "dont-evict-this-pod" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "dont-evict-this-pod" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "dont-evict-this-pod" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "dont-evict-this-pod" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/dont-evict-this-pod created oc get pod ⎈ admin NAME READY STATUS RESTARTS AGE dont-evict-this-pod 1/1 Running 0 9s 3. create a mc oc create -f file-ig3.yaml ⎈ admin machineconfig.machineconfiguration.openshift.io/test-file created 4. drain error and node is degraded. oc logs -n openshift-machine-config-operator machine-config-controller-fbc49f6f6-l5s8k -c machine-config-controller|grep 'drain exceeded timeout' ⎈ admin E0613 00:06:03.576184 1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. E0613 00:07:34.365247 1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. oc get mcp ⎈ admin NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c3fe10afdd8424f7d70826fc900d0d48 True False False 3 3 3 0 169m worker rendered-worker-d9fcce43a37d625ffe3d684875ef35af False True True 3 2 2 1 169m 5. delete pdb oc delete pdb/dontevict ⎈ admin poddisruptionbudget.policy "dontevict" deleted 6. pod can be evicted successfully and drain controller is recovered I0613 01:14:08.235712 1 drain_controller.go:302] Previous node drain found. Drain has been going on for 2.1432277348847224 hours E0613 01:14:08.235725 1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0613 01:14:08.235728 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: initiating drain E0613 01:14:08.870044 1 drain_controller.go:106] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-6n6fj, openshift-cluster-node-tuning-operator/tuned-7mdqc, openshift-dns/dns-default-n7rmm, openshift-dns/node-resolver-d7gwx, openshift-image-registry/node-ca-jkrjv, openshift-ingress-canary/ingress-canary-96nm5, openshift-machine-config-operator/machine-config-daemon-wxx2k, openshift-monitoring/node-exporter-27bd8, openshift-multus/multus-additional-cni-plugins-qzgpp, openshift-multus/multus-d952k, openshift-multus/network-metrics-daemon-677z8, openshift-network-diagnostics/network-check-target-7bc27, openshift-sdn/sdn-sdq9d I0613 01:14:08.871381 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod E0613 01:14:08.878517 1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0613 01:14:13.881627 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod E0613 01:14:13.886294 1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0613 01:14:18.888627 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod E0613 01:14:18.895783 1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0613 01:14:23.896687 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod E0613 01:14:23.903074 1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0613 01:14:28.904629 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod I0613 01:15:00.922616 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: Evicted pod default/dont-evict-this-pod I0613 01:15:00.922647 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: operation successful; applying completion annotation I0613 01:15:44.011771 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Reporting unready: node ip-10-0-218-5.us-east-2.compute.internal is reporting OutOfDisk=Unknown I0613 01:15:44.055115 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints I0613 01:15:47.148004 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Reporting unready: node ip-10-0-218-5.us-east-2.compute.internal is reporting Unschedulable I0613 01:15:47.176572 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints I0613 01:15:52.321968 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Completed update to rendered-worker-f77b25199d32ebeb8a8af036b8cd129d I0613 01:15:57.322785 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: uncordoning I0613 01:15:57.322807 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: initiating uncordon (currently schedulable: false) I0613 01:15:57.345877 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: uncordon succeeded (currently schedulable: true) I0613 01:15:57.345892 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: operation successful; applying completion annotation I0613 01:15:57.369270 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints I0613 01:16:02.332765 1 status.go:90] Pool worker: All nodes are updated with rendered-worker-f77b25199d32ebeb8a8af036b8cd129d oc get mcp ⎈ admin NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c3fe10afdd8424f7d70826fc900d0d48 True False False 3 3 3 0 176m worker rendered-worker-f77b25199d32ebeb8a8af036b8cd129d True False False 3 3 3 0 176m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069