2092442 – Minimum time between drain retries is not the expected one

Bug 2092442 - Minimum time between drain retries is not the expected one

Summary: Minimum time between drain retries is not the expected one

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Yu Qi Zhang
QA Contact:	Sergio
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-01 14:39 UTC by Sergio
Modified:	2022-08-10 11:16 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:15:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3178	0	None	open	Bug 2092442: drain_controller: slow down retries for failing nodes	2022-06-03 01:00:38 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:16:04 UTC

Description Sergio 2022-06-01 14:39:07 UTC

Description of problem:
When the drain operation was executed in the daemonsets, when a drain operation failed the first 5 retries were executed after waiting 1 minute, and the rest of the retries were executed after waiting 5 minutes instead of 1.

After moving the drain execution to the machine-config controller, all drain retries are executed after waiting 1 minute. No matter how many times we have retried.

Version-Release number of MCO (Machine Config Operator) (if applicable):
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-05-25-123329 True False 3h21m Cluster version is 4.11.0-0.nightly-2022-05-25-123329

Platform (AWS, VSphere, Metal, etc.):

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:
Always

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Create PodDisruptionBudget

cat << EOF | oc create -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: dontevict
spec:
minAvailable: 1
selector:
matchLabels:
app: dontevict
EOF

2. Create a pod using this PodDisruptionBudget so that the pod cannot be evicted

$ oc run --restart=Never --labels app=dontevict --image=quay.io/prometheus/busybox dont-evict-43245 -- sleep 2h

$ oc get pods
NAME READY STATUS RESTARTS AGE
dont-evict-this-pod 1/1 Running 0 5m5s

3. Create a machine config resource that triggers a drain operation in the nodes

cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: test-drain-maxunavail
spec:
config:
ignition:
version: 3.2.0
kernelArguments:
- quiet
kernelType: realtime
EOF

Actual results:
We can see the drain retries in the machine-config controller

$ oc logs machine-config-controller-7b965cf5f7-vh6xw | grep Drain

I0601 11:16:29.400285 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:16:29.400321 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.02522114774361111 hours
I0601 11:18:00.172946 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:18:00.172995 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.050435779038611114 hours
I0601 11:19:30.929790 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:19:30.929825 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.07564600984527778 hours
I0601 11:21:01.697937 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:21:01.697979 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.10085938584583333 hours
I0601 11:22:32.462456 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:22:32.462528 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.12607176003444445 hours
I0601 11:24:03.243156 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:24:03.243195 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1512886127861111 hours
I0601 11:25:34.014046 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:25:34.014092 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1765027507986111 hours
I0601 11:27:04.789156 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:27:04.789200 1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.2017180583877778 hours
I0601 11:28:35.560892 1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s

Looking at the logs' timestamps we can see that no matter how many times the drain operation is retried, it never waits 5 minutes to retry again.

Expected results:

From 5th retry on, the machine-config controller should wait 5 minutes to try again a drain execution. Hence, the time gap between 2 drain executions should be at least 5 minutes. It could be 6 or 7 depending on the situation of the cluster, but at least there should be 5 minutes between them.

Additional info:

Comment 6 errata-xmlrpc 2022-08-10 11:15:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.