Bug 2092442 - Minimum time between drain retries is not the expected one
Summary: Minimum time between drain retries is not the expected one
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.11
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.11.0
Assignee: Yu Qi Zhang
QA Contact: Sergio
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-01 14:39 UTC by Sergio
Modified: 2022-08-10 11:16 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:15:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3178 0 None open Bug 2092442: drain_controller: slow down retries for failing nodes 2022-06-03 01:00:38 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:16:04 UTC

Description Sergio 2022-06-01 14:39:07 UTC
Description of problem:
When the drain operation was executed in the daemonsets, when a drain operation failed the first 5 retries were executed after waiting 1 minute, and the rest of the retries were executed after waiting 5 minutes instead of 1.

After moving the drain execution to the machine-config controller, all drain retries are executed after waiting 1 minute. No matter how many times we have retried.



Version-Release number of MCO (Machine Config Operator) (if applicable):
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-25-123329   True        False         3h21m   Cluster version is 4.11.0-0.nightly-2022-05-25-123329



Platform (AWS, VSphere, Metal, etc.):

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:
Always

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Create PodDisruptionBudget

cat << EOF | oc create -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
        app: dontevict
EOF

2. Create a pod using this PodDisruptionBudget so that the pod cannot be evicted

$ oc run --restart=Never --labels app=dontevict  --image=quay.io/prometheus/busybox dont-evict-43245 -- sleep 2h

$ oc get pods
NAME                  READY   STATUS    RESTARTS   AGE
dont-evict-this-pod   1/1     Running   0          5m5s

3. Create a machine config resource that triggers a drain operation in the nodes

cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-drain-maxunavail
spec:
  config:
    ignition:
      version: 3.2.0
  kernelArguments:
    - quiet
  kernelType: realtime
EOF



Actual results:
We can see the drain retries in the machine-config controller

$ oc logs machine-config-controller-7b965cf5f7-vh6xw  | grep Drain

I0601 11:16:29.400285       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:16:29.400321       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.02522114774361111 hours
I0601 11:18:00.172946       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:18:00.172995       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.050435779038611114 hours
I0601 11:19:30.929790       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:19:30.929825       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.07564600984527778 hours
I0601 11:21:01.697937       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:21:01.697979       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.10085938584583333 hours
I0601 11:22:32.462456       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:22:32.462528       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.12607176003444445 hours
I0601 11:24:03.243156       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:24:03.243195       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1512886127861111 hours
I0601 11:25:34.014046       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:25:34.014092       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.1765027507986111 hours
I0601 11:27:04.789156       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s
I0601 11:27:04.789200       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 0.2017180583877778 hours
I0601 11:28:35.560892       1 drain_controller.go:141] node ip-10-0-159-181.us-east-2.compute.internal: Drain failed, but overall timeout has not been reached. Waiting 1 minute then retrying. Error message from drain: error when evicting pods/"dont-evict-43245" -n "e2e-test-mco-dcln9": global timeout reached: 1m30s


Looking at the logs' timestamps we can see that no matter how many times the drain operation is retried, it never waits 5 minutes to retry again.



Expected results:

From 5th retry on, the machine-config controller should wait 5 minutes to try again a drain execution. Hence, the time gap between 2 drain executions should be at least 5 minutes. It could be 6 or 7 depending on the situation of the cluster, but at least there should be 5 minutes between them.


Additional info:

Comment 6 errata-xmlrpc 2022-08-10 11:15:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.