Bug 1968019

Summary: drain timeout and pool degrading period is too short
Product: OpenShift Container Platform Reporter: Kirsten Garrison <kgarriso>
Component: Machine Config OperatorAssignee: Kirsten Garrison <kgarriso>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.7CC: jerzhang, wking
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: drain period and timeout is too short for an average cluster. Consequence: failures and alerts prematurely when a normal cluster would just need more time. Fix: Bump timeouts to only report a failure and fire and alert after 1hour of failed drain attempts Result: cluster operator will have more meaningful failures andalerts, will not prematurely degrade for an average cluster
Story Points: ---
Clone Of:
: 1987221 (view as bug list) Environment:
Last Closed: 2021-07-27 23:11:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1968759, 1973006    
Attachments:
Description Flags
Drain Error on Prometheus none

Description Kirsten Garrison 2021-06-04 17:58:53 UTC
Description of problem:

The drain period and timeout which causes a degraded is too short for an average cluster.  This leads to reported upgrade failures and alerts when the user would just need a bit more time (esp given that any degraded pool is now surface as an upgrade blocker). It isn't uncommon for nodes to need 15m<X<1hr so bump the timeouts to only fire alerts and report a failure at at least 1hr of drain attempt.


Actual results:
nodes that need a reasonable amt of time error out


Expected results:

a node that needs an hour to drain correctly should be able to and not cause error

Comment 1 Kirsten Garrison 2021-06-04 18:00:06 UTC
this would be an intermediate fix/related to: https://bugzilla.redhat.com/show_bug.cgi?id=1952694

Comment 4 Michael Nguyen 2021-06-08 22:11:22 UTC
Verified on 4.8.0-0.nightly-2021-06-08-034312

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-08-034312   True        False         7h38m   Cluster version is 4.8.0-0.nightly-2021-06-08-034312

$ cat pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: dontevict

$ oc create -f pdb.yaml 
poddisruptionbudget.policy/dontevict created

$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-166-111"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 3h

=== wait for pod to start then add a file through MC to start the drain process ===

$ cat file-ig3.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test

$ oc create -f file-ig3.yaml 
machineconfig.machineconfiguration.openshift.io/test-file created

=== Wait for 1 hour to capture error message ===

$ oc -n openshift-machine-config-operator logs -f machine-config-daemon-2jq6z -c machine-config-daemon
I0608 22:03:17.252850    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:17.264125    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:22.264254    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:22.282463    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:27.282712    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:27.291016    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:32.291138    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:32.306597    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0608 22:03:37.307020    2015 daemon.go:330] evicting pod default/dont-evict-this-pod
E0608 22:03:37.322719    2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0608 22:03:37.785235    2015 writer.go:135] Marking Degraded due to: failed to drain node : ip-10-0-166-111.us-west-2.compute.internal after 1 hour
I0608 22:03:37.803835    2015 update.go:549] Checking Reconcilable for config rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 to rendered-worker-125367f12e4ddd19c61b945ee92f721a
I0608 22:03:37.835755    2015 update.go:1863] Starting update from rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 to rendered-worker-125367f12e4ddd19c61b945ee92f721a: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false}
I0608 22:03:37.869113    2015 update.go:451] File diff: /etc/test was deleted
I0608 22:03:37.869244    2015 update.go:461] File diff: /etc/testing was added
I0608 22:03:37.869272    2015 update.go:1863] Node has been successfully cordoned
I0608 22:03:37.872625    2015 update.go:1863] Update prepared; beginning drain
$ oc get mcp/worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2021-06-08T12:50:28Z"
  generation: 5
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    pools.operator.machineconfiguration.openshift.io/worker: ""
  managedFields:
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:machineconfiguration.openshift.io/mco-built-in: {}
          f:pools.operator.machineconfiguration.openshift.io/worker: {}
      f:spec:
        .: {}
        f:configuration:
          .: {}
          f:source: {}
        f:machineConfigSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:machineconfiguration.openshift.io/role: {}
        f:nodeSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:node-role.kubernetes.io/worker: {}
        f:paused: {}
    manager: machine-config-operator
    operation: Update
    time: "2021-06-08T12:50:28Z"
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:configuration:
          f:name: {}
          f:source: {}
      f:status:
        .: {}
        f:conditions: {}
        f:configuration:
          .: {}
          f:name: {}
          f:source: {}
        f:degradedMachineCount: {}
        f:machineCount: {}
        f:observedGeneration: {}
        f:readyMachineCount: {}
        f:unavailableMachineCount: {}
        f:updatedMachineCount: {}
    manager: machine-config-controller
    operation: Update
    time: "2021-06-08T12:52:24Z"
  name: worker
  resourceVersion: "266675"
  uid: 14e25e63-4cd4-469a-ac1f-7491b4a7e504
spec:
  configuration:
    name: rendered-worker-125367f12e4ddd19c61b945ee92f721a
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: test-file
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2021-06-08T12:52:45Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2021-06-08T21:01:07Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2021-06-08T21:01:07Z"
    message: All nodes are updating to rendered-worker-125367f12e4ddd19c61b945ee92f721a
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2021-06-08T22:03:42Z"
    message: 'Node ip-10-0-166-111.us-west-2.compute.internal is reporting: "failed
      to drain node : ip-10-0-166-111.us-west-2.compute.internal after 1 hour"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2021-06-08T22:03:42Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded
  configuration:
    name: rendered-worker-c9f2639c99f57ce9882509c2ab05eb74
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: test-file
  degradedMachineCount: 1
  machineCount: 3
  observedGeneration: 5
  readyMachineCount: 1
  unavailableMachineCount: 1
  updatedMachineCount: 1

See screenshot for prometheus MCDDrainErr firing.

Comment 5 Michael Nguyen 2021-06-08 22:11:47 UTC
Created attachment 1789463 [details]
Drain Error on Prometheus

Comment 8 errata-xmlrpc 2021-07-27 23:11:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438