Bug 1866873

Summary:	MCDDrainError "Drain failed on , updates may be blocked" missing rendered node name
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Machine Config Operator	Assignee:	Kirsten Garrison <kgarriso>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	jerzhang, jnaess, kgarriso, mkrejci
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:15:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1906298

Description W. Trevor King 2020-08-06 16:52:15 UTC

Description of problem:

Seen in a 4.5.5 cluster:

[FIRING:6] MCDDrainError machine-config-daemon (metrics 10.0.171.29:9001 openshift-machine-config-operator machine-config-daemon-nnq9x openshift-monitoring/k8s machine-config-daemon critical)
Drain failed on  , updates may be blocked. For more details:  oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon

Version-Release number of selected component (if applicable):

4.5.5

How reproducible:

Unknown.

Steps to Reproduce:
1. Launch a 4.5.5 cluster.
2. Set a PDB on a pod that forbids eviction.
3. Push a new MachineConfig or take other action that causes the machine-config operator to try to roll the machine set.

Actual results:

"Drain failed on  , updates may be blocked..." missing rendered node name

Expected results:

"Drain failed on {node-name-for-10.0.171.29}, updates may be blocked..."

Additional info:

The error message template has had {{ $labels.node }} since the alert was born [1].  Not clear to me why the mcd_drain metric was missing the label, or if it has the label and this is just a error-template-side thing.

Unrelated MCDDrainError discussion in bug 1829999.

[1]: https://github.com/openshift/machine-config-operator/blame/7f087773b6e8369806ab9b1a98fdd18ba996a8a1/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L27

Comment 1 jnaess 2020-08-11 08:00:12 UTC

Can confirm this is happening on 4.4.14 aswell.

no label present in prometheus called node=nodename or similar.

Comment 2 Michelle Krejci 2020-09-14 19:26:14 UTC

Moving to 4.7, since this is not a blocking issue for 4.6.

Comment 5 Michael Nguyen 2020-11-10 22:05:38 UTC

Verified on 4.7.0-0.nightly-2020-11-10-093436


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-10-093436   True        False         4h38m   Cluster version is 4.7.0-0.nightly-2020-11-10-093436
$ cat << EOF > pdb.yaml 
> apiVersion: policy/v1beta1
> kind: PodDisruptionBudget
> metadata:
>   name: dontevict
> spec:
>   minAvailable: 1
>   selector:
>     matchLabels:
>       app: dontevict
> EOF
$ oc create -f pdb.yaml 
poddisruptionbudget.policy/dontevict created
$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-143-20.us-west-2.compute.internal    Ready    master   4h44m   v1.19.2+9c2f84c
ip-10-0-154-71.us-west-2.compute.internal    Ready    worker   4h31m   v1.19.2+9c2f84c
ip-10-0-171-153.us-west-2.compute.internal   Ready    master   4h40m   v1.19.2+9c2f84c
ip-10-0-189-196.us-west-2.compute.internal   Ready    worker   4h31m   v1.19.2+9c2f84c
ip-10-0-194-240.us-west-2.compute.internal   Ready    worker   4h31m   v1.19.2+9c2f84c
ip-10-0-209-84.us-west-2.compute.internal    Ready    master   4h40m   v1.19.2+9c2f84c
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-154-71"} } }' --image=docker.io/busybox dont-evict-this-pod -- sleep 1h
pod/dont-evict-this-pod created
$ oc get pods
NAME                  READY   STATUS              RESTARTS   AGE
dont-evict-this-pod   0/1     ContainerCreating   0          5s
$ cat << EOF > file.yaml 
>   apiVersion: machineconfiguration.openshift.io/v1
>   kind: MachineConfig
>   metadata:
>     labels:
>       machineconfiguration.openshift.io/role: worker
>     name: test-file
>   spec:
>     config:
>       ignition:
>         version: 3.1.0
>       storage:
>         files:
>         - contents:
>             source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
>           filesystem: root
>           mode: 0644
>           path: /etc/test
> EOF
$ oc create -f file.yaml 
machineconfig.machineconfiguration.openshift.io/test-file created
$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
00-worker                                          da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
01-master-container-runtime                        da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
01-master-kubelet                                  da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
01-worker-container-runtime                        da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
01-worker-kubelet                                  da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
03-worker-extensions                                                                          3.1.0             3h21m
99-master-generated-registries                     da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
99-master-ssh                                                                                 3.1.0             4h49m
99-worker-generated-registries                     da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
99-worker-ssh                                                                                 3.1.0             4h49m
rendered-master-8d25b9ae487bc5e7ffb021bd93bfff7d   da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
rendered-worker-69dac79db33505219af92d594dbbc383   da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h43m
rendered-worker-e6858708d022f5e2ad4b50ef033be75a   da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             3h21m
test-file                                                                                     3.1.0             3s
$ oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-e6858708d022f5e2ad4b50ef033be75a   False     True       False      3              0                   0                     0                      4h45m
$ oc get nodes
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-143-20.us-west-2.compute.internal    Ready                      master   4h46m   v1.19.2+9c2f84c
ip-10-0-154-71.us-west-2.compute.internal    Ready                      worker   4h32m   v1.19.2+9c2f84c
ip-10-0-171-153.us-west-2.compute.internal   Ready                      master   4h41m   v1.19.2+9c2f84c
ip-10-0-189-196.us-west-2.compute.internal   Ready                      worker   4h32m   v1.19.2+9c2f84c
ip-10-0-194-240.us-west-2.compute.internal   Ready,SchedulingDisabled   worker   4h33m   v1.19.2+9c2f84c
ip-10-0-209-84.us-west-2.compute.internal    Ready                      master   4h41m   v1.19.2+9c2f84c
$ oc get nodes
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-143-20.us-west-2.compute.internal    Ready                      master   4h51m   v1.19.2+9c2f84c
ip-10-0-154-71.us-west-2.compute.internal    Ready,SchedulingDisabled   worker   4h38m   v1.19.2+9c2f84c
ip-10-0-171-153.us-west-2.compute.internal   Ready                      master   4h47m   v1.19.2+9c2f84c
ip-10-0-189-196.us-west-2.compute.internal   Ready                      worker   4h38m   v1.19.2+9c2f84c
ip-10-0-194-240.us-west-2.compute.internal   Ready                      worker   4h38m   v1.19.2+9c2f84c
ip-10-0-209-84.us-west-2.compute.internal    Ready                      master   4h47m   v1.19.2+9c2f84c
$ oc  -n openshift-machine-config-operator get pods --field-selector spec.nodeName=ip-10-0-154-71.us-west-2.compute.internal
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-7n6bf   2/2     Running   0          4h38m
$ oc  -n openshift-machine-config-operator logs machine-config-daemon-7n6bf -c machine-config-daemon
...
I1110 21:47:52.933055    2072 daemon.go:344] evicting pod default/dont-evict-this-pod
E1110 21:47:52.962506    2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1110 21:47:57.962645    2072 daemon.go:344] evicting pod default/dont-evict-this-pod
E1110 21:47:57.970946    2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1110 21:48:02.971070    2072 daemon.go:344] evicting pod default/dont-evict-this-pod
E1110 21:48:03.013410    2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1110 21:48:08.013504    2072 daemon.go:344] evicting pod default/dont-evict-this-pod
E1110 21:48:08.021002    2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1110 21:48:13.021128    2072 daemon.go:344] evicting pod default/dont-evict-this-pod
E1110 21:48:13.030356    2072 daemon.go:344] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
$ oc -n openshift-monitoring get routes
NAME                HOST/PORT                                                                        PATH   SERVICES            PORT    TERMINATION          WILDCARD
alertmanager-main   alertmanager-main-openshift-monitoring.apps.mnguyen47.devcluster.openshift.com          alertmanager-main   web     reencrypt/Redirect   None
grafana             grafana-openshift-monitoring.apps.mnguyen47.devcluster.openshift.com                    grafana             https   reencrypt/Redirect   None
prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.mnguyen47.devcluster.openshift.com             prometheus-k8s      web     reencrypt/Redirect   None
thanos-querier      thanos-querier-openshift-monitoring.apps.mnguyen47.devcluster.openshift.com             thanos-querier      web     reencrypt/Redirect   None

Prometheus Shows
mcd_drain_err{container="oauth-proxy",endpoint="metrics",err="WaitTimeout",instance="10.0.154.71:9001",job="machine-config-daemon",namespace="openshift-machine-config-operator",node="ip-10-0-154-71.us-west-2.compute.internal",pod="machine-config-daemon-7n6bf",service="machine-config-daemon"}

Comment 8 errata-xmlrpc 2021-02-24 15:15:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633