Bug 1968019
| Summary: | drain timeout and pool degrading period is too short | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Kirsten Garrison <kgarriso> | ||||
| Component: | Machine Config Operator | Assignee: | Kirsten Garrison <kgarriso> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4.7 | CC: | jerzhang, wking | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.8.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: drain period and timeout is too short for an average cluster.
Consequence: failures and alerts prematurely when a normal cluster would just need more time.
Fix: Bump timeouts to only report a failure and fire and alert after 1hour of failed drain attempts
Result: cluster operator will have more meaningful failures andalerts, will not prematurely degrade for an average cluster
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1987221 (view as bug list) | Environment: | |||||
| Last Closed: | 2021-07-27 23:11:38 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1968759, 1973006 | ||||||
| Attachments: |
|
||||||
|
Description
Kirsten Garrison
2021-06-04 17:58:53 UTC
this would be an intermediate fix/related to: https://bugzilla.redhat.com/show_bug.cgi?id=1952694 Verified on 4.8.0-0.nightly-2021-06-08-034312
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-06-08-034312 True False 7h38m Cluster version is 4.8.0-0.nightly-2021-06-08-034312
$ cat pdb.yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: dontevict
spec:
minAvailable: 1
selector:
matchLabels:
app: dontevict
$ oc create -f pdb.yaml
poddisruptionbudget.policy/dontevict created
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-166-111"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 3h
=== wait for pod to start then add a file through MC to start the drain process ===
$ cat file-ig3.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: test-file
spec:
config:
ignition:
version: 3.1.0
storage:
files:
- contents:
source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
filesystem: root
mode: 0644
path: /etc/test
$ oc create -f file-ig3.yaml
machineconfig.machineconfiguration.openshift.io/test-file created
=== Wait for 1 hour to capture error message ===
$ oc -n openshift-machine-config-operator logs -f machine-config-daemon-2jq6z -c machine-config-daemon
I0608 22:03:17.252850 2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:17.264125 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:22.264254 2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:22.282463 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:27.282712 2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:27.291016 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:32.291138 2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:32.306597 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:37.307020 2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:37.322719 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0608 22:03:37.785235 2015 writer.go:135] Marking Degraded due to: failed to drain node : ip-10-0-166-111.us-west-2.compute.internal after 1 hour
I0608 22:03:37.803835 2015 update.go:549] Checking Reconcilable for config rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 to rendered-worker-125367f12e4ddd19c61b945ee92f721a
I0608 22:03:37.835755 2015 update.go:1863] Starting update from rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 to rendered-worker-125367f12e4ddd19c61b945ee92f721a: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false}
I0608 22:03:37.869113 2015 update.go:451] File diff: /etc/test was deleted
I0608 22:03:37.869244 2015 update.go:461] File diff: /etc/testing was added
I0608 22:03:37.869272 2015 update.go:1863] Node has been successfully cordoned
I0608 22:03:37.872625 2015 update.go:1863] Update prepared; beginning drain
$ oc get mcp/worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: "2021-06-08T12:50:28Z"
generation: 5
labels:
machineconfiguration.openshift.io/mco-built-in: ""
pools.operator.machineconfiguration.openshift.io/worker: ""
managedFields:
- apiVersion: machineconfiguration.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.: {}
f:machineconfiguration.openshift.io/mco-built-in: {}
f:pools.operator.machineconfiguration.openshift.io/worker: {}
f:spec:
.: {}
f:configuration:
.: {}
f:source: {}
f:machineConfigSelector:
.: {}
f:matchLabels:
.: {}
f:machineconfiguration.openshift.io/role: {}
f:nodeSelector:
.: {}
f:matchLabels:
.: {}
f:node-role.kubernetes.io/worker: {}
f:paused: {}
manager: machine-config-operator
operation: Update
time: "2021-06-08T12:50:28Z"
- apiVersion: machineconfiguration.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:configuration:
f:name: {}
f:source: {}
f:status:
.: {}
f:conditions: {}
f:configuration:
.: {}
f:name: {}
f:source: {}
f:degradedMachineCount: {}
f:machineCount: {}
f:observedGeneration: {}
f:readyMachineCount: {}
f:unavailableMachineCount: {}
f:updatedMachineCount: {}
manager: machine-config-controller
operation: Update
time: "2021-06-08T12:52:24Z"
name: worker
resourceVersion: "266675"
uid: 14e25e63-4cd4-469a-ac1f-7491b4a7e504
spec:
configuration:
name: rendered-worker-125367f12e4ddd19c61b945ee92f721a
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: test-file
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
paused: false
status:
conditions:
- lastTransitionTime: "2021-06-08T12:52:45Z"
message: ""
reason: ""
status: "False"
type: RenderDegraded
- lastTransitionTime: "2021-06-08T21:01:07Z"
message: ""
reason: ""
status: "False"
type: Updated
- lastTransitionTime: "2021-06-08T21:01:07Z"
message: All nodes are updating to rendered-worker-125367f12e4ddd19c61b945ee92f721a
reason: ""
status: "True"
type: Updating
- lastTransitionTime: "2021-06-08T22:03:42Z"
message: 'Node ip-10-0-166-111.us-west-2.compute.internal is reporting: "failed
to drain node : ip-10-0-166-111.us-west-2.compute.internal after 1 hour"'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
- lastTransitionTime: "2021-06-08T22:03:42Z"
message: ""
reason: ""
status: "True"
type: Degraded
configuration:
name: rendered-worker-c9f2639c99f57ce9882509c2ab05eb74
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: test-file
degradedMachineCount: 1
machineCount: 3
observedGeneration: 5
readyMachineCount: 1
unavailableMachineCount: 1
updatedMachineCount: 1
See screenshot for prometheus MCDDrainErr firing.
Created attachment 1789463 [details]
Drain Error on Prometheus
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |