Bug 2092442

Summary: Minimum time between drain retries is not the expected one
Product: OpenShift Container Platform Reporter: Sergio <sregidor>
Component: Machine Config OperatorAssignee: Yu Qi Zhang <jerzhang>
Machine Config Operator sub component: Machine Config Operator QA Contact: Sergio <sregidor>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: jerzhang, mkrejci, rioliu, wking
Version: 4.11   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:15:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergio 2022-06-01 14:39:07 UTC
Description of problem:
When the drain operation was executed in the daemonsets, when a drain operation failed the first 5 retries were executed after waiting 1 minute, and the rest of the retries were executed after waiting 5 minutes instead of 1.

After moving the drain execution to the machine-config controller, all drain retries are executed after waiting 1 minute. No matter how many times we have retried.



Version-Release number of MCO (Machine Config Operator) (if applicable):
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-25-123329   True        False         3h21m   Cluster version is 4.11.0-0.nightly-2022-05-25-123329



Platform (AWS, VSphere, Metal, etc.):

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:
Always

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Create PodDisruptionBudget

cat << EOF | oc create -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
        app: dontevict
EOF

2. Create a pod using this PodDisruptionBudget so that the pod cannot be evicted

$ oc run --restart=Never --labels app=dontevict  --image=quay.io/prometheus/busybox dont-evict-43245 -- sleep 2h

$ oc get pods
NAME                  READY   STATUS    RESTARTS   AGE
dont-evict-this-pod   1/1     Running   0          5m5s

3. Create a machine config resource that triggers a drain operation in the nodes

cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-drain-maxunavail
spec:
  config:
    ignition:
      version: 3.2.0
  kernelArguments:
    - quiet
  kernelType: realtime
EOF



Actual results:
We can see the drain retries in the machine-config controller

$ oc logs machine-config-controller-7b965cf5f7-vh6xw  | grep Drain

I0601 11:16:29.400285       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:16:29.400321       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.02522114774361111 hours
I0601 11:18:00.172946       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:18:00.172995       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.050435779038611114 hours
I0601 11:19:30.929790       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:19:30.929825       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.07564600984527778 hours
I0601 11:21:01.697937       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:21:01.697979       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.10085938584583333 hours
I0601 11:22:32.462456       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:22:32.462528       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.12607176003444445 hours
I0601 11:24:03.243156       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:24:03.243195       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1512886127861111 hours
I0601 11:25:34.014046       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:25:34.014092       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1765027507986111 hours
I0601 11:27:04.789156       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:27:04.789200       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.2017180583877778 hours
I0601 11:28:35.560892       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s


Looking at the logs' timestamps we can see that no matter how many times the drain operation is retried, it never waits 5 minutes to retry again.



Expected results:

From 5th retry on, the machine-config controller should wait 5 minutes to try again a drain execution. Hence, the time gap between 2 drain executions should be at least 5 minutes. It could be 6 or 7 depending on the situation of the cluster, but at least there should be 5 minutes between them.


Additional info:

Comment 6 errata-xmlrpc 2022-08-10 11:15:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069