Hide Forgot
Description of problem: Seen in a 4.5.5 cluster: [FIRING:6] MCDDrainError machine-config-daemon (metrics 10.0.171.29:9001 openshift-machine-config-operator machine-config-daemon-nnq9x openshift-monitoring/k8s machine-config-daemon critical) Drain failed on , updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon Version-Release number of selected component (if applicable): 4.5.5 How reproducible: Unknown. Steps to Reproduce: 1. Launch a 4.5.5 cluster. 2. Set a PDB on a pod that forbids eviction. 3. Push a new MachineConfig or take other action that causes the machine-config operator to try to roll the machine set. Actual results: "Drain failed on , updates may be blocked..." missing rendered node name Expected results: "Drain failed on {node-name-for-10.0.171.29}, updates may be blocked..." Additional info: The error message template has had {{ $labels.node }} since the alert was born [1]. Not clear to me why the mcd_drain metric was missing the label, or if it has the label and this is just a error-template-side thing. Unrelated MCDDrainError discussion in bug 1829999. [1]: https://github.com/openshift/machine-config-operator/blame/7f087773b6e8369806ab9b1a98fdd18ba996a8a1/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L27
Can confirm this is happening on 4.4.14 aswell. no label present in prometheus called node=nodename or similar.
Moving to 4.7, since this is not a blocking issue for 4.6.
Verified on 4.7.0-0.nightly-2020-11-10-093436 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-11-10-093436 True False 4h38m Cluster version is 4.7.0-0.nightly-2020-11-10-093436 $ cat << EOF > pdb.yaml > apiVersion: policy/v1beta1 > kind: PodDisruptionBudget > metadata: > name: dontevict > spec: > minAvailable: 1 > selector: > matchLabels: > app: dontevict > EOF $ oc create -f pdb.yaml poddisruptionbudget.policy/dontevict created $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-143-20.us-west-2.compute.internal Ready master 4h44m v1.19.2+9c2f84c ip-10-0-154-71.us-west-2.compute.internal Ready worker 4h31m v1.19.2+9c2f84c ip-10-0-171-153.us-west-2.compute.internal Ready master 4h40m v1.19.2+9c2f84c ip-10-0-189-196.us-west-2.compute.internal Ready worker 4h31m v1.19.2+9c2f84c ip-10-0-194-240.us-west-2.compute.internal Ready worker 4h31m v1.19.2+9c2f84c ip-10-0-209-84.us-west-2.compute.internal Ready master 4h40m v1.19.2+9c2f84c $ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-154-71"} } }' --image=docker.io/busybox dont-evict-this-pod -- sleep 1h pod/dont-evict-this-pod created $ oc get pods NAME READY STATUS RESTARTS AGE dont-evict-this-pod 0/1 ContainerCreating 0 5s $ cat << EOF > file.yaml > apiVersion: machineconfiguration.openshift.io/v1 > kind: MachineConfig > metadata: > labels: > machineconfiguration.openshift.io/role: worker > name: test-file > spec: > config: > ignition: > version: 3.1.0 > storage: > files: > - contents: > source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK > filesystem: root > mode: 0644 > path: /etc/test > EOF $ oc create -f file.yaml machineconfig.machineconfiguration.openshift.io/test-file created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m 00-worker da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m 01-master-container-runtime da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m 01-master-kubelet da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m 01-worker-container-runtime da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m 01-worker-kubelet da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m 03-worker-extensions 3.1.0 3h21m 99-master-generated-registries da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m 99-master-ssh 3.1.0 4h49m 99-worker-generated-registries da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m 99-worker-ssh 3.1.0 4h49m rendered-master-8d25b9ae487bc5e7ffb021bd93bfff7d da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m rendered-worker-69dac79db33505219af92d594dbbc383 da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h43m rendered-worker-e6858708d022f5e2ad4b50ef033be75a da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 3h21m test-file 3.1.0 3s $ oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-e6858708d022f5e2ad4b50ef033be75a False True False 3 0 0 0 4h45m $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-143-20.us-west-2.compute.internal Ready master 4h46m v1.19.2+9c2f84c ip-10-0-154-71.us-west-2.compute.internal Ready worker 4h32m v1.19.2+9c2f84c ip-10-0-171-153.us-west-2.compute.internal Ready master 4h41m v1.19.2+9c2f84c ip-10-0-189-196.us-west-2.compute.internal Ready worker 4h32m v1.19.2+9c2f84c ip-10-0-194-240.us-west-2.compute.internal Ready,SchedulingDisabled worker 4h33m v1.19.2+9c2f84c ip-10-0-209-84.us-west-2.compute.internal Ready master 4h41m v1.19.2+9c2f84c $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-143-20.us-west-2.compute.internal Ready master 4h51m v1.19.2+9c2f84c ip-10-0-154-71.us-west-2.compute.internal Ready,SchedulingDisabled worker 4h38m v1.19.2+9c2f84c ip-10-0-171-153.us-west-2.compute.internal Ready master 4h47m v1.19.2+9c2f84c ip-10-0-189-196.us-west-2.compute.internal Ready worker 4h38m v1.19.2+9c2f84c ip-10-0-194-240.us-west-2.compute.internal Ready worker 4h38m v1.19.2+9c2f84c ip-10-0-209-84.us-west-2.compute.internal Ready master 4h47m v1.19.2+9c2f84c $ oc -n openshift-machine-config-operator get pods --field-selector spec.nodeName=ip-10-0-154-71.us-west-2.compute.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-7n6bf 2/2 Running 0 4h38m $ oc -n openshift-machine-config-operator logs machine-config-daemon-7n6bf -c machine-config-daemon ... I1110 21:47:52.933055 2072 daemon.go:344] evicting pod default/dont-evict-this-pod E1110 21:47:52.962506 2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1110 21:47:57.962645 2072 daemon.go:344] evicting pod default/dont-evict-this-pod E1110 21:47:57.970946 2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1110 21:48:02.971070 2072 daemon.go:344] evicting pod default/dont-evict-this-pod E1110 21:48:03.013410 2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1110 21:48:08.013504 2072 daemon.go:344] evicting pod default/dont-evict-this-pod E1110 21:48:08.021002 2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1110 21:48:13.021128 2072 daemon.go:344] evicting pod default/dont-evict-this-pod E1110 21:48:13.030356 2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. $ oc -n openshift-monitoring get routes NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD alertmanager-main alertmanager-main-openshift-monitoring.apps.mnguyen47.devcluster.openshift.com alertmanager-main web reencrypt/Redirect None grafana grafana-openshift-monitoring.apps.mnguyen47.devcluster.openshift.com grafana https reencrypt/Redirect None prometheus-k8s prometheus-k8s-openshift-monitoring.apps.mnguyen47.devcluster.openshift.com prometheus-k8s web reencrypt/Redirect None thanos-querier thanos-querier-openshift-monitoring.apps.mnguyen47.devcluster.openshift.com thanos-querier web reencrypt/Redirect None Prometheus Shows mcd_drain_err{container="oauth-proxy",endpoint="metrics",err="WaitTimeout",instance="10.0.154.71:9001",job="machine-config-daemon",namespace="openshift-machine-config-operator",node="ip-10-0-154-71.us-west-2.compute.internal",pod="machine-config-daemon-7n6bf",service="machine-config-daemon"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633