Hide Forgot
Description of problem: The drain period and timeout which causes a degraded is too short for an average cluster. This leads to reported upgrade failures and alerts when the user would just need a bit more time (esp given that any degraded pool is now surface as an upgrade blocker). It isn't uncommon for nodes to need 15m<X<1hr so bump the timeouts to only fire alerts and report a failure at at least 1hr of drain attempt. Actual results: nodes that need a reasonable amt of time error out Expected results: a node that needs an hour to drain correctly should be able to and not cause error
this would be an intermediate fix/related to: https://bugzilla.redhat.com/show_bug.cgi?id=1952694
Verified on 4.8.0-0.nightly-2021-06-08-034312 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-08-034312 True False 7h38m Cluster version is 4.8.0-0.nightly-2021-06-08-034312 $ cat pdb.yaml apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict $ oc create -f pdb.yaml poddisruptionbudget.policy/dontevict created $ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-166-111"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 3h === wait for pod to start then add a file through MC to start the drain process === $ cat file-ig3.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-file spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK filesystem: root mode: 0644 path: /etc/test $ oc create -f file-ig3.yaml machineconfig.machineconfiguration.openshift.io/test-file created === Wait for 1 hour to capture error message === $ oc -n openshift-machine-config-operator logs -f machine-config-daemon-2jq6z -c machine-config-daemon I0608 22:03:17.252850 2015 daemon.go:330] evicting pod default/dont-evict-this-pod E0608 22:03:17.264125 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0608 22:03:22.264254 2015 daemon.go:330] evicting pod default/dont-evict-this-pod E0608 22:03:22.282463 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0608 22:03:27.282712 2015 daemon.go:330] evicting pod default/dont-evict-this-pod E0608 22:03:27.291016 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0608 22:03:32.291138 2015 daemon.go:330] evicting pod default/dont-evict-this-pod E0608 22:03:32.306597 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0608 22:03:37.307020 2015 daemon.go:330] evicting pod default/dont-evict-this-pod E0608 22:03:37.322719 2015 daemon.go:330] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0608 22:03:37.785235 2015 writer.go:135] Marking Degraded due to: failed to drain node : ip-10-0-166-111.us-west-2.compute.internal after 1 hour I0608 22:03:37.803835 2015 update.go:549] Checking Reconcilable for config rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 to rendered-worker-125367f12e4ddd19c61b945ee92f721a I0608 22:03:37.835755 2015 update.go:1863] Starting update from rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 to rendered-worker-125367f12e4ddd19c61b945ee92f721a: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false} I0608 22:03:37.869113 2015 update.go:451] File diff: /etc/test was deleted I0608 22:03:37.869244 2015 update.go:461] File diff: /etc/testing was added I0608 22:03:37.869272 2015 update.go:1863] Node has been successfully cordoned I0608 22:03:37.872625 2015 update.go:1863] Update prepared; beginning drain $ oc get mcp/worker -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2021-06-08T12:50:28Z" generation: 5 labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:machineconfiguration.openshift.io/mco-built-in: {} f:pools.operator.machineconfiguration.openshift.io/worker: {} f:spec: .: {} f:configuration: .: {} f:source: {} f:machineConfigSelector: .: {} f:matchLabels: .: {} f:machineconfiguration.openshift.io/role: {} f:nodeSelector: .: {} f:matchLabels: .: {} f:node-role.kubernetes.io/worker: {} f:paused: {} manager: machine-config-operator operation: Update time: "2021-06-08T12:50:28Z" - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:configuration: f:name: {} f:source: {} f:status: .: {} f:conditions: {} f:configuration: .: {} f:name: {} f:source: {} f:degradedMachineCount: {} f:machineCount: {} f:observedGeneration: {} f:readyMachineCount: {} f:unavailableMachineCount: {} f:updatedMachineCount: {} manager: machine-config-controller operation: Update time: "2021-06-08T12:52:24Z" name: worker resourceVersion: "266675" uid: 14e25e63-4cd4-469a-ac1f-7491b4a7e504 spec: configuration: name: rendered-worker-125367f12e4ddd19c61b945ee92f721a source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: test-file machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2021-06-08T12:52:45Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2021-06-08T21:01:07Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2021-06-08T21:01:07Z" message: All nodes are updating to rendered-worker-125367f12e4ddd19c61b945ee92f721a reason: "" status: "True" type: Updating - lastTransitionTime: "2021-06-08T22:03:42Z" message: 'Node ip-10-0-166-111.us-west-2.compute.internal is reporting: "failed to drain node : ip-10-0-166-111.us-west-2.compute.internal after 1 hour"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2021-06-08T22:03:42Z" message: "" reason: "" status: "True" type: Degraded configuration: name: rendered-worker-c9f2639c99f57ce9882509c2ab05eb74 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: test-file degradedMachineCount: 1 machineCount: 3 observedGeneration: 5 readyMachineCount: 1 unavailableMachineCount: 1 updatedMachineCount: 1 See screenshot for prometheus MCDDrainErr firing.
Created attachment 1789463 [details] Drain Error on Prometheus
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438