Bug 1861876

Summary:	MCDDrainError firing as critical instead of as warning
Product:	OpenShift Container Platform	Reporter:	Kirsten Garrison <kgarriso>
Component:	Machine Config Operator	Assignee:	Kirsten Garrison <kgarriso>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.6	CC:	wking
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:21:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1862538

Description Kirsten Garrison 2020-07-29 19:15:48 UTC

MCDDrainError firing as critical on all drain failures, this is very misleading especially given PDBs. This should have been written as a warning.

Comment 1 Kirsten Garrison 2020-07-29 19:17:12 UTC

Always firing as critical is very confusing and disruptive. It was an error to have this set and should be warning until the upcoming MCO telemetry overhaul.

Comment 4 Michael Nguyen 2020-08-03 21:21:26 UTC

Verified on 4.6.0-0.nightly-2020-08-03-025909

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-03-025909   True        False         7h26m   Cluster version is 4.6.0-0.nightly-2020-08-03-025909


== Create PDB ==
$ cat << EOF > pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: dontevict
EOF
$ oc create -f pdb.yaml 
poddisruptionbudget.policy/dontevict created
$ oc get pdb
NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
dontevict   1               N/A               0                     5s



== Run pod on a worker node that has the label associated with the pdb ==
$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-140-165.us-west-2.compute.internal   Ready    master   7h1m    v0.0.0-master+$Format:%h$
ip-10-0-148-228.us-west-2.compute.internal   Ready    worker   6h49m   v0.0.0-master+$Format:%h$
ip-10-0-170-248.us-west-2.compute.internal   Ready    master   7h1m    v0.0.0-master+$Format:%h$
ip-10-0-185-62.us-west-2.compute.internal    Ready    worker   6h49m   v0.0.0-master+$Format:%h$
ip-10-0-214-243.us-west-2.compute.internal   Ready    worker   6h49m   v0.0.0-master+$Format:%h$
ip-10-0-219-92.us-west-2.compute.internal    Ready    master   7h1m    v0.0.0-master+$Format:%h$
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-148-228" } } }' --image=docker.io/busybox dont-evict-this-pod -- sleep 1h
pod/dont-evict-this-pod created
$ oc -n default get pods
NAME                  READY   STATUS    RESTARTS   AGE
dont-evict-this-pod   1/1     Running   0          9s

== Do something that causes node drain.  I deleted a MC I created here but you can just add a file using a MC (see example below) ==

Example
----------------------------------------------------------
  cat << EOF > file.yaml 
  apiVersion: machineconfiguration.openshift.io/v1
  kind: MachineConfig
  metadata:
    labels:
      machineconfiguration.openshift.io/role: worker
    name: test-file
  spec:
    config:
      ignition:
        version: 2.2.0
      storage:
        files:
        - contents:
            source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
          filesystem: root
          mode: 0644
          path: /etc/test
  EOF

  oc create -f file.yaml

------------------------------


$ oc delete -f file.yaml 
machineconfig.machineconfiguration.openshift.io "test-file" deleted
$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
00-worker                                          057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
01-master-container-runtime                        057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
01-master-kubelet                                  057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
01-worker-container-runtime                        057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
01-worker-kubelet                                  057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
99-master-generated-registries                     057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
99-master-ssh                                                                                 3.1.0             7h25m
99-worker-generated-registries                     057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
99-worker-ssh                                                                                 3.1.0             7h25m
rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a   057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
rendered-worker-18530e85ada03eb1df754c4ede1fabec   057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             6h23m
rendered-worker-31140c01283f3a5cce98f76c006d563f   057d852d0d10f94120aaa91e771503baa5b3c242   3.1.0             7h15m
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a   True      False      False      3              3                   3                     0                      7h16m
worker   rendered-worker-18530e85ada03eb1df754c4ede1fabec   True      False      False      3              3                   3                     0                      7h16m
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a   True      False      False      3              3                   3                     0                      7h16m
worker   rendered-worker-18530e85ada03eb1df754c4ede1fabec   False     True       False      3              0                   0                     0                      7h16m
$  oc  -n openshift-machine-config-operator get pods --field-selector spec.nodeName=ip-10-0-148-228.us-west-2.compute.internal
NAME                          READY   STATUS    RESTARTS   AGE
machine-config-daemon-fh7jv   2/2     Running   0          7h8m

== Wait for eviction to time out.  Check MCD logs for "global timeout reached: 1m30s" == 

$ oc  -n openshift-machine-config-operator logs machine-config-daemon-fh7jv -c machine-config-daemon
--SNIP--
I0803 20:54:17.764325    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:17.772371    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:22.784670    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:22.799447    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:27.803554    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:27.811602    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:32.811922    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:32.820051    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:37.827921    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:37.836011    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:42.853392    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:42.861594    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:47.863960    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:47.877359    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:52.879678    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:52.901840    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:57.919019    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:57.927111    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:55:02.940723    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:55:02.949544    2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:55:07.951738    2014 daemon.go:341] evicting pod default/dont-evict-this-pod
I0803 20:55:07.951780    2014 update.go:148] Draining failed with: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s, retrying

$ oc get nodes
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-140-165.us-west-2.compute.internal   Ready                      master   7h36m   v0.0.0-master+$Format:%h$
ip-10-0-148-228.us-west-2.compute.internal   Ready,SchedulingDisabled   worker   7h24m   v0.0.0-master+$Format:%h$
ip-10-0-170-248.us-west-2.compute.internal   Ready                      master   7h36m   v0.0.0-master+$Format:%h$
ip-10-0-185-62.us-west-2.compute.internal    Ready                      worker   7h24m   v0.0.0-master+$Format:%h$
ip-10-0-214-243.us-west-2.compute.internal   Ready                      worker   7h24m   v0.0.0-master+$Format:%h$
ip-10-0-219-92.us-west-2.compute.internal    Ready                      master   7h36m   v0.0.0-master+$Format:%h$
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a   True      False      False      3              3                   3                     0                      7h35m
worker   rendered-worker-18530e85ada03eb1df754c4ede1fabec   False     True       True       3              0                   0                     1                      7h35m
$ oc get mcp/worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2020-08-03T13:21:08Z"
  generation: 4
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    pools.operator.machineconfiguration.openshift.io/worker: ""
  managedFields:
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:machineconfiguration.openshift.io/mco-built-in: {}
          f:pools.operator.machineconfiguration.openshift.io/worker: {}
      f:spec:
        .: {}
        f:configuration: {}
        f:machineConfigSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:machineconfiguration.openshift.io/role: {}
        f:nodeSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:node-role.kubernetes.io/worker: {}
        f:paused: {}
    manager: machine-config-operator
    operation: Update
    time: "2020-08-03T13:21:08Z"
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:configuration:
          f:name: {}
          f:source: {}
      f:status:
        .: {}
        f:conditions: {}
        f:configuration:
          .: {}
          f:name: {}
          f:source: {}
        f:degradedMachineCount: {}
        f:machineCount: {}
        f:observedGeneration: {}
        f:readyMachineCount: {}
        f:unavailableMachineCount: {}
        f:updatedMachineCount: {}
    manager: machine-config-controller
    operation: Update
    time: "2020-08-03T20:48:00Z"
  name: worker
  resourceVersion: "385650"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  uid: 60d50c8a-25c0-4dd9-a8ce-ccfd13ef389e
spec:
  configuration:
    name: rendered-worker-31140c01283f3a5cce98f76c006d563f
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2020-08-03T13:21:55Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2020-08-03T20:37:52Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2020-08-03T20:37:52Z"
    message: All nodes are updating to rendered-worker-31140c01283f3a5cce98f76c006d563f
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2020-08-03T20:48:00Z"
    message: 'Node ip-10-0-148-228.us-west-2.compute.internal is reporting: "failed
      to drain node (5 tries): timed out waiting for the condition: error when evicting
      pod \"dont-evict-this-pod\": global timeout reached: 1m30s"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2020-08-03T20:48:00Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded
  configuration:
    name: rendered-worker-18530e85ada03eb1df754c4ede1fabec
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: test-file
  degradedMachineCount: 1
  machineCount: 3
  observedGeneration: 4
  readyMachineCount: 0
  unavailableMachineCount: 1
  updatedMachineCount: 0

== Find the prometheus web URL ==
$ oc -n openshift-monitoring get routes
NAME                HOST/PORT                                                                        PATH   SERVICES            PORT    TERMINATION          WILDCARD
alertmanager-main   alertmanager-main-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com          alertmanager-main   web     reencrypt/Redirect   None
grafana             grafana-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com                    grafana             https   reencrypt/Redirect   None
>> prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com             prometheus-k8s      web     reencrypt/Redirect   None
thanos-querier      thanos-querier-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com             thanos-querier      web     reencrypt/Redirect   None

== Go to https://prometheus-k8s-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com and log in as kubeadmin ==

== Click on Alerts and search for MCDrainError ==


alert: MCDDrainError
expr: mcd_drain > 0
labels:
  severity: warning
annotations:
  message: 'Drain failed on {{ $labels.node }} , updates may be blocked. For more details:  oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon'

Labels 	State 	Active Since 	Value
alertname="MCDDrainError" drain_time="602.309273355 sec" endpoint="metrics" err="5 tries: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s" instance="10.0.148.228:9001" job="machine-config-daemon" namespace="openshift-machine-config-operator" pod="machine-config-daemon-fh7jv" service="machine-config-daemon" severity="warning" 	firing 	2020-08-03 20:48:36.021562042 +0000 UTC 	1.5964876759266927e+09
alertname="MCDDrainError" drain_time="602.542385785 sec" endpoint="metrics" err="5 tries: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s" instance="10.0.148.228:9001" job="machine-config-daemon" namespace="openshift-machine-config-operator" pod="machine-config-daemon-fh7jv" service="machine-config-daemon" severity="warning" 	firing 	2020-08-03 20:58:36.021562042 +0000 UTC 	1.596488278520987e+09

Comment 6 errata-xmlrpc 2020-10-27 16:21:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196