Bug 1861876
Summary: | MCDDrainError firing as critical instead of as warning | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Kirsten Garrison <kgarriso> |
Component: | Machine Config Operator | Assignee: | Kirsten Garrison <kgarriso> |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.6 | CC: | wking |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:21:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1862538 |
Description
Kirsten Garrison
2020-07-29 19:15:48 UTC
Always firing as critical is very confusing and disruptive. It was an error to have this set and should be warning until the upcoming MCO telemetry overhaul. Verified on 4.6.0-0.nightly-2020-08-03-025909 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-03-025909 True False 7h26m Cluster version is 4.6.0-0.nightly-2020-08-03-025909 == Create PDB == $ cat << EOF > pdb.yaml apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict EOF $ oc create -f pdb.yaml poddisruptionbudget.policy/dontevict created $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE dontevict 1 N/A 0 5s == Run pod on a worker node that has the label associated with the pdb == $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-140-165.us-west-2.compute.internal Ready master 7h1m v0.0.0-master+$Format:%h$ ip-10-0-148-228.us-west-2.compute.internal Ready worker 6h49m v0.0.0-master+$Format:%h$ ip-10-0-170-248.us-west-2.compute.internal Ready master 7h1m v0.0.0-master+$Format:%h$ ip-10-0-185-62.us-west-2.compute.internal Ready worker 6h49m v0.0.0-master+$Format:%h$ ip-10-0-214-243.us-west-2.compute.internal Ready worker 6h49m v0.0.0-master+$Format:%h$ ip-10-0-219-92.us-west-2.compute.internal Ready master 7h1m v0.0.0-master+$Format:%h$ $ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-148-228" } } }' --image=docker.io/busybox dont-evict-this-pod -- sleep 1h pod/dont-evict-this-pod created $ oc -n default get pods NAME READY STATUS RESTARTS AGE dont-evict-this-pod 1/1 Running 0 9s == Do something that causes node drain. I deleted a MC I created here but you can just add a file using a MC (see example below) == Example ---------------------------------------------------------- cat << EOF > file.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-file spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK filesystem: root mode: 0644 path: /etc/test EOF oc create -f file.yaml ------------------------------ $ oc delete -f file.yaml machineconfig.machineconfiguration.openshift.io "test-file" deleted $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m 00-worker 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m 01-master-container-runtime 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m 01-master-kubelet 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m 01-worker-container-runtime 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m 01-worker-kubelet 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m 99-master-generated-registries 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m 99-master-ssh 3.1.0 7h25m 99-worker-generated-registries 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m 99-worker-ssh 3.1.0 7h25m rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m rendered-worker-18530e85ada03eb1df754c4ede1fabec 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 6h23m rendered-worker-31140c01283f3a5cce98f76c006d563f 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a True False False 3 3 3 0 7h16m worker rendered-worker-18530e85ada03eb1df754c4ede1fabec True False False 3 3 3 0 7h16m $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a True False False 3 3 3 0 7h16m worker rendered-worker-18530e85ada03eb1df754c4ede1fabec False True False 3 0 0 0 7h16m $ oc -n openshift-machine-config-operator get pods --field-selector spec.nodeName=ip-10-0-148-228.us-west-2.compute.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-fh7jv 2/2 Running 0 7h8m == Wait for eviction to time out. Check MCD logs for "global timeout reached: 1m30s" == $ oc -n openshift-machine-config-operator logs machine-config-daemon-fh7jv -c machine-config-daemon --SNIP-- I0803 20:54:17.764325 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:17.772371 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:54:22.784670 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:22.799447 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:54:27.803554 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:27.811602 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:54:32.811922 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:32.820051 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:54:37.827921 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:37.836011 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:54:42.853392 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:42.861594 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:54:47.863960 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:47.877359 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:54:52.879678 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:52.901840 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:54:57.919019 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:54:57.927111 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:55:02.940723 2014 daemon.go:341] evicting pod default/dont-evict-this-pod E0803 20:55:02.949544 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0803 20:55:07.951738 2014 daemon.go:341] evicting pod default/dont-evict-this-pod I0803 20:55:07.951780 2014 update.go:148] Draining failed with: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s, retrying $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-140-165.us-west-2.compute.internal Ready master 7h36m v0.0.0-master+$Format:%h$ ip-10-0-148-228.us-west-2.compute.internal Ready,SchedulingDisabled worker 7h24m v0.0.0-master+$Format:%h$ ip-10-0-170-248.us-west-2.compute.internal Ready master 7h36m v0.0.0-master+$Format:%h$ ip-10-0-185-62.us-west-2.compute.internal Ready worker 7h24m v0.0.0-master+$Format:%h$ ip-10-0-214-243.us-west-2.compute.internal Ready worker 7h24m v0.0.0-master+$Format:%h$ ip-10-0-219-92.us-west-2.compute.internal Ready master 7h36m v0.0.0-master+$Format:%h$ $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a True False False 3 3 3 0 7h35m worker rendered-worker-18530e85ada03eb1df754c4ede1fabec False True True 3 0 0 1 7h35m $ oc get mcp/worker -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2020-08-03T13:21:08Z" generation: 4 labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:machineconfiguration.openshift.io/mco-built-in: {} f:pools.operator.machineconfiguration.openshift.io/worker: {} f:spec: .: {} f:configuration: {} f:machineConfigSelector: .: {} f:matchLabels: .: {} f:machineconfiguration.openshift.io/role: {} f:nodeSelector: .: {} f:matchLabels: .: {} f:node-role.kubernetes.io/worker: {} f:paused: {} manager: machine-config-operator operation: Update time: "2020-08-03T13:21:08Z" - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:configuration: f:name: {} f:source: {} f:status: .: {} f:conditions: {} f:configuration: .: {} f:name: {} f:source: {} f:degradedMachineCount: {} f:machineCount: {} f:observedGeneration: {} f:readyMachineCount: {} f:unavailableMachineCount: {} f:updatedMachineCount: {} manager: machine-config-controller operation: Update time: "2020-08-03T20:48:00Z" name: worker resourceVersion: "385650" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker uid: 60d50c8a-25c0-4dd9-a8ce-ccfd13ef389e spec: configuration: name: rendered-worker-31140c01283f3a5cce98f76c006d563f source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2020-08-03T13:21:55Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2020-08-03T20:37:52Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2020-08-03T20:37:52Z" message: All nodes are updating to rendered-worker-31140c01283f3a5cce98f76c006d563f reason: "" status: "True" type: Updating - lastTransitionTime: "2020-08-03T20:48:00Z" message: 'Node ip-10-0-148-228.us-west-2.compute.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod \"dont-evict-this-pod\": global timeout reached: 1m30s"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2020-08-03T20:48:00Z" message: "" reason: "" status: "True" type: Degraded configuration: name: rendered-worker-18530e85ada03eb1df754c4ede1fabec source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: test-file degradedMachineCount: 1 machineCount: 3 observedGeneration: 4 readyMachineCount: 0 unavailableMachineCount: 1 updatedMachineCount: 0 == Find the prometheus web URL == $ oc -n openshift-monitoring get routes NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD alertmanager-main alertmanager-main-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com alertmanager-main web reencrypt/Redirect None grafana grafana-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com grafana https reencrypt/Redirect None >> prometheus-k8s prometheus-k8s-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com prometheus-k8s web reencrypt/Redirect None thanos-querier thanos-querier-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com thanos-querier web reencrypt/Redirect None == Go to https://prometheus-k8s-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com and log in as kubeadmin == == Click on Alerts and search for MCDrainError == alert: MCDDrainError expr: mcd_drain > 0 labels: severity: warning annotations: message: 'Drain failed on {{ $labels.node }} , updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon' Labels State Active Since Value alertname="MCDDrainError" drain_time="602.309273355 sec" endpoint="metrics" err="5 tries: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s" instance="10.0.148.228:9001" job="machine-config-daemon" namespace="openshift-machine-config-operator" pod="machine-config-daemon-fh7jv" service="machine-config-daemon" severity="warning" firing 2020-08-03 20:48:36.021562042 +0000 UTC 1.5964876759266927e+09 alertname="MCDDrainError" drain_time="602.542385785 sec" endpoint="metrics" err="5 tries: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s" instance="10.0.148.228:9001" job="machine-config-daemon" namespace="openshift-machine-config-operator" pod="machine-config-daemon-fh7jv" service="machine-config-daemon" severity="warning" firing 2020-08-03 20:58:36.021562042 +0000 UTC 1.596488278520987e+09 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |