Bug 2090794
Summary: | MachineConfigPool cannot apply a configuration after fixing the pods that caused a drain alert | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sergio <sregidor> |
Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Sergio <sregidor> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | mkrejci, rioliu |
Version: | 4.11 | ||
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:14:33 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sergio
2022-05-26 14:41:56 UTC
Marking as blocker due to a change of behaviour from previous updates, which may cause updates to stall Does not apply to previous versions. Change only in 4.11 verified on 4.11.0-0.nightly-2022-06-11-120123 1. create pdb oc create -f pdb.yaml ⎈ admin poddisruptionbudget.policy/dontevict created oc get pdb ⎈ admin NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE dontevict 1 N/A 0 25s 2. create pod oc run --restart=Never --labels app=dontevict --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 3h ⎈ admin Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "dont-evict-this-pod" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "dont-evict-this-pod" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "dont-evict-this-pod" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "dont-evict-this-pod" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/dont-evict-this-pod created oc get pod ⎈ admin NAME READY STATUS RESTARTS AGE dont-evict-this-pod 1/1 Running 0 9s 3. create a mc oc create -f file-ig3.yaml ⎈ admin machineconfig.machineconfiguration.openshift.io/test-file created 4. drain error and node is degraded. oc logs -n openshift-machine-config-operator machine-config-controller-fbc49f6f6-l5s8k -c machine-config-controller|grep 'drain exceeded timeout' ⎈ admin E0613 00:06:03.576184 1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. E0613 00:07:34.365247 1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. oc get mcp ⎈ admin NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c3fe10afdd8424f7d70826fc900d0d48 True False False 3 3 3 0 169m worker rendered-worker-d9fcce43a37d625ffe3d684875ef35af False True True 3 2 2 1 169m 5. delete pdb oc delete pdb/dontevict ⎈ admin poddisruptionbudget.policy "dontevict" deleted 6. pod can be evicted successfully and drain controller is recovered I0613 01:14:08.235712 1 drain_controller.go:302] Previous node drain found. Drain has been going on for 2.1432277348847224 hours E0613 01:14:08.235725 1 drain_controller.go:305] node ip-10-0-218-5.us-east-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0613 01:14:08.235728 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: initiating drain E0613 01:14:08.870044 1 drain_controller.go:106] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-6n6fj, openshift-cluster-node-tuning-operator/tuned-7mdqc, openshift-dns/dns-default-n7rmm, openshift-dns/node-resolver-d7gwx, openshift-image-registry/node-ca-jkrjv, openshift-ingress-canary/ingress-canary-96nm5, openshift-machine-config-operator/machine-config-daemon-wxx2k, openshift-monitoring/node-exporter-27bd8, openshift-multus/multus-additional-cni-plugins-qzgpp, openshift-multus/multus-d952k, openshift-multus/network-metrics-daemon-677z8, openshift-network-diagnostics/network-check-target-7bc27, openshift-sdn/sdn-sdq9d I0613 01:14:08.871381 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod E0613 01:14:08.878517 1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0613 01:14:13.881627 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod E0613 01:14:13.886294 1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0613 01:14:18.888627 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod E0613 01:14:18.895783 1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0613 01:14:23.896687 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod E0613 01:14:23.903074 1 drain_controller.go:106] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0613 01:14:28.904629 1 drain_controller.go:106] evicting pod default/dont-evict-this-pod I0613 01:15:00.922616 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: Evicted pod default/dont-evict-this-pod I0613 01:15:00.922647 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: operation successful; applying completion annotation I0613 01:15:44.011771 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Reporting unready: node ip-10-0-218-5.us-east-2.compute.internal is reporting OutOfDisk=Unknown I0613 01:15:44.055115 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints I0613 01:15:47.148004 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Reporting unready: node ip-10-0-218-5.us-east-2.compute.internal is reporting Unschedulable I0613 01:15:47.176572 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints I0613 01:15:52.321968 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: Completed update to rendered-worker-f77b25199d32ebeb8a8af036b8cd129d I0613 01:15:57.322785 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: uncordoning I0613 01:15:57.322807 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: initiating uncordon (currently schedulable: false) I0613 01:15:57.345877 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: uncordon succeeded (currently schedulable: true) I0613 01:15:57.345892 1 drain_controller.go:141] node ip-10-0-218-5.us-east-2.compute.internal: operation successful; applying completion annotation I0613 01:15:57.369270 1 node_controller.go:446] Pool worker[zone=us-east-2c]: node ip-10-0-218-5.us-east-2.compute.internal: changed taints I0613 01:16:02.332765 1 status.go:90] Pool worker: All nodes are updated with rendered-worker-f77b25199d32ebeb8a8af036b8cd129d oc get mcp ⎈ admin NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-c3fe10afdd8424f7d70826fc900d0d48 True False False 3 3 3 0 176m worker rendered-worker-f77b25199d32ebeb8a8af036b8cd129d True False False 3 3 3 0 176m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |