2090794 – MachineConfigPool cannot apply a configuration after fixing the pods that caused a drain alert

Bug 2090794 - MachineConfigPool cannot apply a configuration after fixing the pods that caused a drain alert

Summary: MachineConfigPool cannot apply a configuration after fixing the pods that cau...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Yu Qi Zhang
QA Contact:	Sergio
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-26 14:41 UTC by Sergio
Modified:	2022-08-10 11:14 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:14:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3167	0	None	open	Bug 2090794: drain controller: continue retry after 1h timeout	2022-05-30 20:58:23 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:14:54 UTC

Description Sergio 2022-05-26 14:41:56 UTC

Description of problem:

With the new drain behavior, if an alert is triggered because a pod cannot be evicted for 1 hour, if we fix the pod after the alert is triggered the MCP cannot apply the configuration and is stuck reporting an error in the MCC logs.



Version-Release number of MCO (Machine Config Operator) (if applicable):
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-25-193227   True        False         5h22m   Cluster version is 4.11.0-0.nightly-2022-05-25-193227



Platform (AWS, VSphere, Metal, etc.):

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure):
Yes

How reproducible:
Always


Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Create PodDisruptionBudget

cat << EOF | oc create -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
        app: dontevict
EOF

2. Create a pod using this PodDisruptionBudget so that the pod cannot be evicted

$ oc run --restart=Never --labels app=dontevict  --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 2h

$ oc get pods
NAME                  READY   STATUS    RESTARTS   AGE
dont-evict-this-pod   1/1     Running   0          5m5s

3. Create a machine config resource that triggers a drain operation in the nodes

cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-drain-maxunavail
spec:
  config:
    ignition:
      version: 3.2.0
  kernelArguments:
    - quiet
  kernelType: realtime
EOF

4. After 1 hour a MCDDrainError alarm will be raised and in the MCC logs we will see a message reporting the the drain operation has failed

$ oc logs -n openshift-machine-config-operator $(oc get pod -n openshift-machine-config-operator -l k8s-app=machine-config-controller  -o name)

0526 14:09:59.897126       1 drain_controller.go:141] node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: initiating drain
I0526 14:09:59.897169       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 3.6754288937169446 hours
I0526 14:09:59.897191       1 drain_controller.go:213] Error syncing node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: drain exceeded timeout: 1h0m0s

The worker MCP reports a Degraded status.

5. Remove the pod that cannot be evicted

$ oc delete pod dont-evict-this-pod


Actual results:

The MCC is stuck reporting the error 

0526 14:09:59.897126       1 drain_controller.go:141] node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: initiating drain
I0526 14:09:59.897169       1 drain_controller.go:303] Previous node drain found. Drain has been going on for 3.6754288937169446 hours
I0526 14:09:59.897191       1 drain_controller.go:213] Error syncing node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: node sregidor-gcp2-fpc9w-worker-a-r5d9p.c.openshift-qe.internal: drain exceeded timeout: 1h0m0s

The worker pool continues reporting Degraded status.


Expected results:

After manually removing the pod that causes the eviction problem the MCP should be able to finish the application of the machine configuration.


Additional info:

When the drain logic was in the daemonsets the MCP could apply the configuration without problems once the pod was manually deleted.

Comment 1 Yu Qi Zhang 2022-05-30 20:58:58 UTC

Marking as blocker due to a change of behaviour from previous updates, which may cause updates to stall

Does not apply to previous versions. Change only in 4.11

Comment 3 Rio Liu 2022-06-13 01:25:11 UTC

verified on 4.11.0-0.nightly-2022-06-11-120123

1. create pdb

oc create -f pdb.yaml                                                                                                                                                                                          ⎈ admin
poddisruptionbudget.policy/dontevict created

oc get pdb                                                                                                                                                                                                     ⎈ admin
NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
dontevict   1               N/A               0                     25s


2. create pod

oc run --restart=Never --labels app=dontevict --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 3h                                                                                               ⎈ admin
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "dont-evict-this-pod" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "dont-evict-this-pod" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "dont-evict-this-pod" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "dont-evict-this-pod" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
pod/dont-evict-this-pod created

oc get pod                                                                                                                                                                                                     ⎈ admin
NAME                  READY   STATUS    RESTARTS   AGE
dont-evict-this-pod   1/1     Running   0          9s

3. create a mc

oc create -f file-ig3.yaml                                                                                                                                                                                     ⎈ admin
machineconfig.machineconfiguration.openshift.io/test-file created

4. drain error and node is degraded.

oc logs -n openshift-machine-config-operator machine-config-controller-fbc49f6f6-l5s8k -c machine-config-controller|grep 'drain exceeded timeout'                                                              ⎈ admin
E0613 00:06:03.576184       1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
E0613 00:07:34.365247       1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.

oc get mcp                                                                                                                                                                                                     ⎈ admin
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-c3fe10afdd8424f7d70826fc900d0d48   True      False      False      3              3                   3                     0                      169m
worker   rendered-worker-d9fcce43a37d625ffe3d684875ef35af   False     True       True       3              2                   2                     1                      169m

5. delete pdb

oc delete pdb/dontevict                                                                                                                                                                                        ⎈ admin
poddisruptionbudget.policy "dontevict" deleted

6. pod can be evicted successfully and drain controller is recovered

I0613 01:14:08.235712       1 drain_controller.go:302] Previous node drain found. Drain has been going on for 2.1432277348847224 hours
E0613 01:14:08.235725       1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0613 01:14:08.235728       1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: initiating drain
E0613 01:14:08.870044       1 drain_controller.go:106] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-6n6fj, openshift-cluster-node-tuning-operator/tuned-7mdqc, openshift-dns/dns-default-n7rmm, openshift-dns/node-resolver-d7gwx, openshift-image-registry/node-ca-jkrjv, openshift-ingress-canary/ingress-canary-96nm5, openshift-machine-config-operator/machine-config-daemon-wxx2k, openshift-monitoring/node-exporter-27bd8, openshift-multus/multus-additional-cni-plugins-qzgpp, openshift-multus/multus-d952k, openshift-multus/network-metrics-daemon-677z8, openshift-network-diagnostics/network-check-target-7bc27, openshift-sdn/sdn-sdq9d
I0613 01:14:08.871381       1 drain_controller.go:106] evicting pod default/dont-evict-this-pod
E0613 01:14:08.878517       1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0613 01:14:13.881627       1 drain_controller.go:106] evicting pod default/dont-evict-this-pod
E0613 01:14:13.886294       1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0613 01:14:18.888627       1 drain_controller.go:106] evicting pod default/dont-evict-this-pod
E0613 01:14:18.895783       1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0613 01:14:23.896687       1 drain_controller.go:106] evicting pod default/dont-evict-this-pod
E0613 01:14:23.903074       1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0613 01:14:28.904629       1 drain_controller.go:106] evicting pod default/dont-evict-this-pod
I0613 01:15:00.922616       1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: Evicted pod default/dont-evict-this-pod
I0613 01:15:00.922647       1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: operation successful; applying completion annotation
I0613 01:15:44.011771       1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Reporting unready: node ip-10-0-218-5.us-east-2.compute.internal is reporting OutOfDisk=Unknown
I0613 01:15:44.055115       1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints
I0613 01:15:47.148004       1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Reporting unready: node ip-10-0-218-5.us-east-2.compute.internal is reporting Unschedulable
I0613 01:15:47.176572       1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints
I0613 01:15:52.321968       1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Completed update to rendered-worker-f77b25199d32ebeb8a8af036b8cd129d
I0613 01:15:57.322785       1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: uncordoning
I0613 01:15:57.322807       1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: initiating uncordon (currently schedulable: false)
I0613 01:15:57.345877       1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: uncordon succeeded (currently schedulable: true)
I0613 01:15:57.345892       1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: operation successful; applying completion annotation
I0613 01:15:57.369270       1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints
I0613 01:16:02.332765       1 status.go:90] Pool worker: All nodes are updated with rendered-worker-f77b25199d32ebeb8a8af036b8cd129d

oc get mcp                                                                                                                                                                                                     ⎈ admin
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-c3fe10afdd8424f7d70826fc900d0d48   True      False      False      3              3                   3                     0                      176m
worker   rendered-worker-f77b25199d32ebeb8a8af036b8cd129d   True      False      False      3              3                   3                     0                      176m

Comment 5 errata-xmlrpc 2022-08-10 11:14:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.