Bug 2092442
| Summary: | Minimum time between drain retries is not the expected one | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Sergio <sregidor> |
| Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Sergio <sregidor> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | jerzhang, mkrejci, rioliu, wking |
| Version: | 4.11 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 11:15:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |
Description of problem: When the drain operation was executed in the daemonsets, when a drain operation failed the first 5 retries were executed after waiting 1 minute, and the rest of the retries were executed after waiting 5 minutes instead of 1. After moving the drain execution to the machine-config controller, all drain retries are executed after waiting 1 minute. No matter how many times we have retried. Version-Release number of MCO (Machine Config Operator) (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-25-123329 True False 3h21m Cluster version is 4.11.0-0.nightly-2022-05-25-123329 Platform (AWS, VSphere, Metal, etc.): Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Y How reproducible: Always Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. Create PodDisruptionBudget cat << EOF | oc create -f - apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict EOF 2. Create a pod using this PodDisruptionBudget so that the pod cannot be evicted $ oc run --restart=Never --labels app=dontevict --image=quay.io/prometheus/busybox dont-evict-43245 -- sleep 2h $ oc get pods NAME READY STATUS RESTARTS AGE dont-evict-this-pod 1/1 Running 0 5m5s 3. Create a machine config resource that triggers a drain operation in the nodes cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-drain-maxunavail spec: config: ignition: version: 3.2.0 kernelArguments: - quiet kernelType: realtime EOF Actual results: We can see the drain retries in the machine-config controller $ oc logs machine-config-controller-7b965cf5f7-vh6xw | grep Drain I0601 11:16:29.400285 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:16:29.400321 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.02522114774361111 hours I0601 11:18:00.172946 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:18:00.172995 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.050435779038611114 hours I0601 11:19:30.929790 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:19:30.929825 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.07564600984527778 hours I0601 11:21:01.697937 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:21:01.697979 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.10085938584583333 hours I0601 11:22:32.462456 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:22:32.462528 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.12607176003444445 hours I0601 11:24:03.243156 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:24:03.243195 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1512886127861111 hours I0601 11:25:34.014046 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:25:34.014092 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1765027507986111 hours I0601 11:27:04.789156 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s I0601 11:27:04.789200 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.2017180583877778 hours I0601 11:28:35.560892 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s Looking at the logs' timestamps we can see that no matter how many times the drain operation is retried, it never waits 5 minutes to retry again. Expected results: From 5th retry on, the machine-config controller should wait 5 minutes to try again a drain execution. Hence, the time gap between 2 drain executions should be at least 5 minutes. It could be 6 or 7 depending on the situation of the cluster, but at least there should be 5 minutes between them. Additional info: