Description of problem: 4.4 endurance cluster has an unscheduleable degraded node following an upgrade. $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-197.us-east-2.compute.internal Ready worker 40d v1.17.1+3288478 ip-10-0-154-233.us-east-2.compute.internal Ready master 40d v1.17.1+3288478 ip-10-0-179-242.us-east-2.compute.internal Ready master 40d v1.17.1+3288478 ip-10-0-181-0.us-east-2.compute.internal Ready,SchedulingDisabled worker 40d v1.17.1+b8568b3 ip-10-0-193-215.us-east-2.compute.internal Ready master 40d v1.17.1+3288478 ip-10-0-209-180.us-east-2.compute.internal Ready worker 40d v1.17.1+3288478 machineconfigpool reports a degraded node: $ oc get machineconfigpools NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-dcbb36a0fdec5e9d178f2c6f4d3ef732 True False False 3 3 3 0 40d worker rendered-worker-1054910af316c74e80557c4840e0ebb7 False True True 3 2 2 1 40d node appears to be degraded because of: message: 'Node ip-10-0-181-0.us-east-2.compute.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod \"inline-volume-tester2-qgf7t\": pods \"inline-volume-tester2-qgf7t\" is forbidden: unable to create new content in namespace e2e-ephemeral-879 because it is being terminated, error when evicting pod \"inline-volume-tester2-b26lg\": pods \"inline-volume-tester2-b26lg\" is forbidden: unable to create new content in namespace e2e-ephemeral-6805 because it is being terminated]"' per oc get machineconfigpools worker -o yaml Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-07-18-033102 upgrading to 4.4.0-0.nightly-2020-08-03-094545
This would be an expected state as the MCO will only be degraded on a degraded master, the MCO will report ready if we have worker issues. We have
Oops! hit save too early: In slack there was a concern that the MCO was NOT reporting degraded but this is the expected state. We only report MCO degraded on master pool. There are plans to report on both but as of 4.4 this is correct. To recover worker, the pod above would have to be resolved so the drain can continue successfully. In the meantime, the upgrade will not be blocked by that degraded worker via the MCO $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-dcbb36a0fdec5e9d178f2c6f4d3ef732 True False False 3 3 3 0 44d worker rendered-worker-1054910af316c74e80557c4840e0ebb7 False True True 3 2 2 1 44d $ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.4.0-0.nightly-2020-07-18-033102 True False False 9d ... monitoring 4.4.0-0.nightly-2020-07-18-033102 False True True 4d8h As seen above, the upgrade itself is stalled on: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-07-18-033102 True True 4d7h Unable to apply 4.4.0-0.nightly-2020-08-03-094545: the cluster operator monitoring has not yet successfully rolled out
the MCO behavior may be understood but the cluster is still broken. monitoring can't finish upgrading because it runs a daemonset on all nodes, and not all nodes are available. so the node availability issue still needs to be addressed
Sending to Storage team since per: node appears to be degraded because of: message: 'Node ip-10-0-181-0.us-east-2.compute.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod \"inline-volume-tester2-qgf7t\": pods \"inline-volume-tester2-qgf7t\" is forbidden: unable to create new content in namespace e2e-ephemeral-879 because it is being terminated, error when evicting pod \"inline-volume-tester2-b26lg\": pods \"inline-volume-tester2-b26lg\" is forbidden: unable to create new content in namespace e2e-ephemeral-6805 because it is being terminated]"' this looks like another case of storage e2es not fully cleaning up and ending up blocking node drain
4.5 cluster has the same situation now $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-250.us-east-2.compute.internal NotReady worker 29d v1.18.3+012b3ec ip-10-0-136-106.us-east-2.compute.internal Ready master 29d v1.18.3+2cf11e2 ip-10-0-160-237.us-east-2.compute.internal Ready master 29d v1.18.3+2cf11e2 ip-10-0-163-71.us-east-2.compute.internal Ready worker 29d v1.18.3+08c38ef ip-10-0-206-204.us-east-2.compute.internal Ready worker 29d v1.18.3+08c38ef ip-10-0-208-27.us-east-2.compute.internal Ready master 29d v1.18.3+2cf11e2 ip-10-0-212-130.us-east-2.compute.internal Ready,SchedulingDisabled worker 29d v1.18.3+012b3ec machineconfiguration.openshift.io/reason: failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod "inline-volume-tester2-htdh4": pods "inline-...
new 4.5 cluster hit this again: Annotations: csi.volume.kubernetes.io/nodeid: {"csi-hostpath-e2e-ephemeral-1022":"ip-10-0-155-235.us-east-2.compute.internal","csi-hostpath-e2e-ephemeral-2550":"ip-10-0-155-235.us-east... machine.openshift.io/machine: openshift-machine-api/ci-op-p9li5xkx-endura-qtpk9-worker-us-east-2a-bdqcp machineconfiguration.openshift.io/currentConfig: rendered-worker-dc0045d9ef2c8ff945f25cc920caa3ed machineconfiguration.openshift.io/desiredConfig: rendered-worker-8f7a35c8bcc3905d5dd473fa360b9ff0 machineconfiguration.openshift.io/reason: failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod "inline-volume-tester-g9qsq": pods "inline-vo... machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true
I am moving this to 4.6.0 and this is fixed in 4.6.0 - https://github.com/openshift/origin/pull/24981 . Moving this to modified. Lets test this in 4.6.0
Based on this is a dup bug of 1814282, then I'll mark this as verified, if we hit this issue in our endurance test, please feel free to reopen it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196