1968019 – drain timeout and pool degrading period is too short

Bug 1968019 - drain timeout and pool degrading period is too short

Summary: drain timeout and pool degrading period is too short

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Kirsten Garrison
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1968759 1973006
TreeView+	depends on / blocked

Reported:	2021-06-04 17:58 UTC by Kirsten Garrison
Modified:	2021-07-30 23:41 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: drain period and timeout is too short for an average cluster. Consequence: failures and alerts prematurely when a normal cluster would just need more time. Fix: Bump timeouts to only report a failure and fire and alert after 1hour of failed drain attempts Result: cluster operator will have more meaningful failures andalerts, will not prematurely degrade for an average cluster
Clone Of:
Clones:	1987221 (view as bug list)
Environment:
Last Closed:	2021-07-27 23:11:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Drain Error on Prometheus (85.85 KB, image/png) 2021-06-08 22:11 UTC, Michael Nguyen	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2597	0	None	open	Bug 1968019: Bump drain timeout to 1h	2021-06-04 18:00:37 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:11:55 UTC

Description Kirsten Garrison 2021-06-04 17:58:53 UTC

Description of problem:

The drain period and timeout which causes a degraded is too short for an average cluster.  This leads to reported upgrade failures and alerts when the user would just need a bit more time (esp given that any degraded pool is now surface as an upgrade blocker). It isn't uncommon for nodes to need 15m<X<1hr so bump the timeouts to only fire alerts and report a failure at at least 1hr of drain attempt.


Actual results:
nodes that need a reasonable amt of time error out


Expected results:

a node that needs an hour to drain correctly should be able to and not cause error

Comment 1 Kirsten Garrison 2021-06-04 18:00:06 UTC

this would be an intermediate fix/related to: https://bugzilla.redhat.com/show_bug.cgi?id=1952694

Comment 4 Michael Nguyen 2021-06-08 22:11:22 UTC

Verified on 4.8.0-0.nightly-2021-06-08-034312

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-08-034312   True        False         7h38m   Cluster version is 4.8.0-0.nightly-2021-06-08-034312

$ cat pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: dontevict

$ oc create -f pdb.yaml 
poddisruptionbudget.policy/dontevict created

$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-166-111"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 3h

=== wait for pod to start then add a file through MC to start the drain process ===

$ cat file-ig3.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test

$ oc create -f file-ig3.yaml 
machineconfig.machineconfiguration.openshift.io/test-file created

=== Wait for 1 hour to capture error message ===

$ oc -n openshift-machine-config-operator logs -f machine-config-daemon-2jq6z -c machine-config-daemon
I0608 22:03:17.252850    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:17.264125    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:22.264254    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:22.282463    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:27.282712    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:27.291016    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:32.291138    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:32.306597    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:37.307020    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:37.322719    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0608 22:03:37.785235    2015 writer.go:135] Marking Degraded due to: failed to drain node : ip-10-0-166-111.us-west-2.compute.internal after 1 hour
I0608 22:03:37.803835    2015 update.go:549] Checking Reconcilable for config rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 to rendered-worker-125367f12e4ddd19c61b945ee92f721a
I0608 22:03:37.835755    2015 update.go:1863] Starting update from rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 to rendered-worker-125367f12e4ddd19c61b945ee92f721a: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false}
I0608 22:03:37.869113    2015 update.go:451] File diff: /etc/test was deleted
I0608 22:03:37.869244    2015 update.go:461] File diff: /etc/testing was added
I0608 22:03:37.869272    2015 update.go:1863] Node has been successfully cordoned
I0608 22:03:37.872625    2015 update.go:1863] Update prepared; beginning drain
$ oc get mcp/worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2021-06-08T12:50:28Z"
  generation: 5
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    pools.operator.machineconfiguration.openshift.io/worker: ""
  managedFields:
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:machineconfiguration.openshift.io/mco-built-in: {}
          f:pools.operator.machineconfiguration.openshift.io/worker: {}
      f:spec:
        .: {}
        f:configuration:
          .: {}
          f:source: {}
        f:machineConfigSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:machineconfiguration.openshift.io/role: {}
        f:nodeSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:node-role.kubernetes.io/worker: {}
        f:paused: {}
    manager: machine-config-operator
    operation: Update
    time: "2021-06-08T12:50:28Z"
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:configuration:
          f:name: {}
          f:source: {}
      f:status:
        .: {}
        f:conditions: {}
        f:configuration:
          .: {}
          f:name: {}
          f:source: {}
        f:degradedMachineCount: {}
        f:machineCount: {}
        f:observedGeneration: {}
        f:readyMachineCount: {}
        f:unavailableMachineCount: {}
        f:updatedMachineCount: {}
    manager: machine-config-controller
    operation: Update
    time: "2021-06-08T12:52:24Z"
  name: worker
  resourceVersion: "266675"
  uid: 14e25e63-4cd4-469a-ac1f-7491b4a7e504
spec:
  configuration:
    name: rendered-worker-125367f12e4ddd19c61b945ee92f721a
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: test-file
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2021-06-08T12:52:45Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2021-06-08T21:01:07Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2021-06-08T21:01:07Z"
    message: All nodes are updating to rendered-worker-125367f12e4ddd19c61b945ee92f721a
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2021-06-08T22:03:42Z"
    message: 'Node ip-10-0-166-111.us-west-2.compute.internal is reporting: "failed
      to drain node : ip-10-0-166-111.us-west-2.compute.internal after 1 hour"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2021-06-08T22:03:42Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded
  configuration:
    name: rendered-worker-c9f2639c99f57ce9882509c2ab05eb74
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: test-file
  degradedMachineCount: 1
  machineCount: 3
  observedGeneration: 5
  readyMachineCount: 1
  unavailableMachineCount: 1
  updatedMachineCount: 1

See screenshot for prometheus MCDDrainErr firing.

Comment 5 Michael Nguyen 2021-06-08 22:11:47 UTC

Created attachment 1789463 [details]
Drain Error on Prometheus

Comment 8 errata-xmlrpc 2021-07-27 23:11:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.