Description of problem: When the drain operation was executed in the daemonsets, when a drain operation failed the first 5 retries were executed after waiting 1 minute, and the rest of the retries were executed after waiting 5 minutes instead of 1. After moving the drain execution to the machine-config controller, all drain retries are executed after waiting 1 minute. No matter how many times we have retried. Version-Release number of MCO (Machine Config Operator) (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-25-123329 True False 3h21m Cluster version is 4.11.0-0.nightly-2022-05-25-123329 Platform (AWS, VSphere, Metal, etc.): Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Y How reproducible: Always Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. Create PodDisruptionBudget cat << EOF | oc create -f - apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict EOF 2. Create a pod using this PodDisruptionBudget so that the pod cannot be evicted $ oc run --restart=Never --labels app=dontevict --image=quay.io/prometheus/busybox dont-evict-43245 -- sleep 2h $ oc get pods NAME READY STATUS RESTARTS AGE dont-evict-this-pod 1/1 Running 0 5m5s 3. Create a machine config resource that triggers a drain operation in the nodes cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-drain-maxunavail spec: config: ignition: version: 3.2.0 kernelArguments: - quiet kernelType: realtime EOF Actual results: We can see the drain retries in the machine-config controller $ oc logs machine-config-controller-7b965cf5f7-vh6xw | grep Drain I0601 11:16:29.400285 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:16:29.400321 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.02522114774361111 hours I0601 11:18:00.172946 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:18:00.172995 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.050435779038611114 hours I0601 11:19:30.929790 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:19:30.929825 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.07564600984527778 hours I0601 11:21:01.697937 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:21:01.697979 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.10085938584583333 hours I0601 11:22:32.462456 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:22:32.462528 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.12607176003444445 hours I0601 11:24:03.243156 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:24:03.243195 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1512886127861111 hours I0601 11:25:34.014046 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:25:34.014092 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1765027507986111 hours I0601 11:27:04.789156 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:27:04.789200 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.2017180583877778 hours I0601 11:28:35.560892 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s Looking at the logs' timestamps we can see that no matter how many times the drain operation is retried, it never waits 5 minutes to retry again. Expected results: From 5th retry on, the machine-config controller should wait 5 minutes to try again a drain execution. Hence, the time gap between 2 drain executions should be at least 5 minutes. It could be 6 or 7 depending on the situation of the cluster, but at least there should be 5 minutes between them. Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069