Bug 1861876
| Summary: | MCDDrainError firing as critical instead of as warning | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Kirsten Garrison <kgarriso> |
| Component: | Machine Config Operator | Assignee: | Kirsten Garrison <kgarriso> |
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.6 | CC: | wking |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:21:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1862538 | ||
|
Description
Kirsten Garrison
2020-07-29 19:15:48 UTC
Always firing as critical is very confusing and disruptive. It was an error to have this set and should be warning until the upcoming MCO telemetry overhaul. Verified on 4.6.0-0.nightly-2020-08-03-025909
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2020-08-03-025909 True False 7h26m Cluster version is 4.6.0-0.nightly-2020-08-03-025909
== Create PDB ==
$ cat << EOF > pdb.yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: dontevict
spec:
minAvailable: 1
selector:
matchLabels:
app: dontevict
EOF
$ oc create -f pdb.yaml
poddisruptionbudget.policy/dontevict created
$ oc get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
dontevict 1 N/A 0 5s
== Run pod on a worker node that has the label associated with the pdb ==
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-140-165.us-west-2.compute.internal Ready master 7h1m v0.0.0-master+$Format:%h$
ip-10-0-148-228.us-west-2.compute.internal Ready worker 6h49m v0.0.0-master+$Format:%h$
ip-10-0-170-248.us-west-2.compute.internal Ready master 7h1m v0.0.0-master+$Format:%h$
ip-10-0-185-62.us-west-2.compute.internal Ready worker 6h49m v0.0.0-master+$Format:%h$
ip-10-0-214-243.us-west-2.compute.internal Ready worker 6h49m v0.0.0-master+$Format:%h$
ip-10-0-219-92.us-west-2.compute.internal Ready master 7h1m v0.0.0-master+$Format:%h$
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-148-228" } } }' --image=docker.io/busybox dont-evict-this-pod -- sleep 1h
pod/dont-evict-this-pod created
$ oc -n default get pods
NAME READY STATUS RESTARTS AGE
dont-evict-this-pod 1/1 Running 0 9s
== Do something that causes node drain. I deleted a MC I created here but you can just add a file using a MC (see example below) ==
Example
----------------------------------------------------------
cat << EOF > file.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: test-file
spec:
config:
ignition:
version: 2.2.0
storage:
files:
- contents:
source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
filesystem: root
mode: 0644
path: /etc/test
EOF
oc create -f file.yaml
------------------------------
$ oc delete -f file.yaml
machineconfig.machineconfiguration.openshift.io "test-file" deleted
$ oc get mc
NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
00-worker 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
01-master-container-runtime 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
01-master-kubelet 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
01-worker-container-runtime 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
01-worker-kubelet 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
99-master-generated-registries 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
99-master-ssh 3.1.0 7h25m
99-worker-generated-registries 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
99-worker-ssh 3.1.0 7h25m
rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
rendered-worker-18530e85ada03eb1df754c4ede1fabec 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 6h23m
rendered-worker-31140c01283f3a5cce98f76c006d563f 057d852d0d10f94120aaa91e771503baa5b3c242 3.1.0 7h15m
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a True False False 3 3 3 0 7h16m
worker rendered-worker-18530e85ada03eb1df754c4ede1fabec True False False 3 3 3 0 7h16m
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a True False False 3 3 3 0 7h16m
worker rendered-worker-18530e85ada03eb1df754c4ede1fabec False True False 3 0 0 0 7h16m
$ oc -n openshift-machine-config-operator get pods --field-selector spec.nodeName=ip-10-0-148-228.us-west-2.compute.internal
NAME READY STATUS RESTARTS AGE
machine-config-daemon-fh7jv 2/2 Running 0 7h8m
== Wait for eviction to time out. Check MCD logs for "global timeout reached: 1m30s" ==
$ oc -n openshift-machine-config-operator logs machine-config-daemon-fh7jv -c machine-config-daemon
--SNIP--
I0803 20:54:17.764325 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:17.772371 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:22.784670 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:22.799447 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:27.803554 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:27.811602 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:32.811922 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:32.820051 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:37.827921 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:37.836011 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:42.853392 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:42.861594 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:47.863960 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:47.877359 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:52.879678 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:52.901840 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:54:57.919019 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:54:57.927111 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:55:02.940723 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
E0803 20:55:02.949544 2014 daemon.go:341] error when evicting pod "dont-evict-this-pod" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0803 20:55:07.951738 2014 daemon.go:341] evicting pod default/dont-evict-this-pod
I0803 20:55:07.951780 2014 update.go:148] Draining failed with: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s, retrying
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-140-165.us-west-2.compute.internal Ready master 7h36m v0.0.0-master+$Format:%h$
ip-10-0-148-228.us-west-2.compute.internal Ready,SchedulingDisabled worker 7h24m v0.0.0-master+$Format:%h$
ip-10-0-170-248.us-west-2.compute.internal Ready master 7h36m v0.0.0-master+$Format:%h$
ip-10-0-185-62.us-west-2.compute.internal Ready worker 7h24m v0.0.0-master+$Format:%h$
ip-10-0-214-243.us-west-2.compute.internal Ready worker 7h24m v0.0.0-master+$Format:%h$
ip-10-0-219-92.us-west-2.compute.internal Ready master 7h36m v0.0.0-master+$Format:%h$
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-f16dd8debc6fb0ad1212ffc7f386e67a True False False 3 3 3 0 7h35m
worker rendered-worker-18530e85ada03eb1df754c4ede1fabec False True True 3 0 0 1 7h35m
$ oc get mcp/worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: "2020-08-03T13:21:08Z"
generation: 4
labels:
machineconfiguration.openshift.io/mco-built-in: ""
pools.operator.machineconfiguration.openshift.io/worker: ""
managedFields:
- apiVersion: machineconfiguration.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.: {}
f:machineconfiguration.openshift.io/mco-built-in: {}
f:pools.operator.machineconfiguration.openshift.io/worker: {}
f:spec:
.: {}
f:configuration: {}
f:machineConfigSelector:
.: {}
f:matchLabels:
.: {}
f:machineconfiguration.openshift.io/role: {}
f:nodeSelector:
.: {}
f:matchLabels:
.: {}
f:node-role.kubernetes.io/worker: {}
f:paused: {}
manager: machine-config-operator
operation: Update
time: "2020-08-03T13:21:08Z"
- apiVersion: machineconfiguration.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:configuration:
f:name: {}
f:source: {}
f:status:
.: {}
f:conditions: {}
f:configuration:
.: {}
f:name: {}
f:source: {}
f:degradedMachineCount: {}
f:machineCount: {}
f:observedGeneration: {}
f:readyMachineCount: {}
f:unavailableMachineCount: {}
f:updatedMachineCount: {}
manager: machine-config-controller
operation: Update
time: "2020-08-03T20:48:00Z"
name: worker
resourceVersion: "385650"
selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
uid: 60d50c8a-25c0-4dd9-a8ce-ccfd13ef389e
spec:
configuration:
name: rendered-worker-31140c01283f3a5cce98f76c006d563f
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
paused: false
status:
conditions:
- lastTransitionTime: "2020-08-03T13:21:55Z"
message: ""
reason: ""
status: "False"
type: RenderDegraded
- lastTransitionTime: "2020-08-03T20:37:52Z"
message: ""
reason: ""
status: "False"
type: Updated
- lastTransitionTime: "2020-08-03T20:37:52Z"
message: All nodes are updating to rendered-worker-31140c01283f3a5cce98f76c006d563f
reason: ""
status: "True"
type: Updating
- lastTransitionTime: "2020-08-03T20:48:00Z"
message: 'Node ip-10-0-148-228.us-west-2.compute.internal is reporting: "failed
to drain node (5 tries): timed out waiting for the condition: error when evicting
pod \"dont-evict-this-pod\": global timeout reached: 1m30s"'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
- lastTransitionTime: "2020-08-03T20:48:00Z"
message: ""
reason: ""
status: "True"
type: Degraded
configuration:
name: rendered-worker-18530e85ada03eb1df754c4ede1fabec
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: test-file
degradedMachineCount: 1
machineCount: 3
observedGeneration: 4
readyMachineCount: 0
unavailableMachineCount: 1
updatedMachineCount: 0
== Find the prometheus web URL ==
$ oc -n openshift-monitoring get routes
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
alertmanager-main alertmanager-main-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com alertmanager-main web reencrypt/Redirect None
grafana grafana-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com grafana https reencrypt/Redirect None
>> prometheus-k8s prometheus-k8s-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com prometheus-k8s web reencrypt/Redirect None
thanos-querier thanos-querier-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com thanos-querier web reencrypt/Redirect None
== Go to https://prometheus-k8s-openshift-monitoring.apps.mnguyen46.devcluster.openshift.com and log in as kubeadmin ==
== Click on Alerts and search for MCDrainError ==
alert: MCDDrainError
expr: mcd_drain > 0
labels:
severity: warning
annotations:
message: 'Drain failed on {{ $labels.node }} , updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon'
Labels State Active Since Value
alertname="MCDDrainError" drain_time="602.309273355 sec" endpoint="metrics" err="5 tries: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s" instance="10.0.148.228:9001" job="machine-config-daemon" namespace="openshift-machine-config-operator" pod="machine-config-daemon-fh7jv" service="machine-config-daemon" severity="warning" firing 2020-08-03 20:48:36.021562042 +0000 UTC 1.5964876759266927e+09
alertname="MCDDrainError" drain_time="602.542385785 sec" endpoint="metrics" err="5 tries: error when evicting pod "dont-evict-this-pod": global timeout reached: 1m30s" instance="10.0.148.228:9001" job="machine-config-daemon" namespace="openshift-machine-config-operator" pod="machine-config-daemon-fh7jv" service="machine-config-daemon" severity="warning" firing 2020-08-03 20:58:36.021562042 +0000 UTC 1.596488278520987e+09
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |