Bug 1864281

Summary: endurance cluster has an unavailable node
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Kubernetes QA Contact: Qin Ping <piqin>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: aos-bugs, hekumar, jokerman, jsafrane, kgarriso
Version: 4.4Keywords: Reopened
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:22:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Parees 2020-08-03 18:41:25 UTC
Description of problem:
4.4 endurance cluster has an unscheduleable degraded node following an upgrade.

$ oc get nodes
NAME                                         STATUS                     ROLES    AGE   VERSION
ip-10-0-128-197.us-east-2.compute.internal   Ready                      worker   40d   v1.17.1+3288478
ip-10-0-154-233.us-east-2.compute.internal   Ready                      master   40d   v1.17.1+3288478
ip-10-0-179-242.us-east-2.compute.internal   Ready                      master   40d   v1.17.1+3288478
ip-10-0-181-0.us-east-2.compute.internal     Ready,SchedulingDisabled   worker   40d   v1.17.1+b8568b3
ip-10-0-193-215.us-east-2.compute.internal   Ready                      master   40d   v1.17.1+3288478
ip-10-0-209-180.us-east-2.compute.internal   Ready                      worker   40d   v1.17.1+3288478


machineconfigpool reports a degraded node:
$ oc get machineconfigpools
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-dcbb36a0fdec5e9d178f2c6f4d3ef732   True      False      False      3              3                   3                     0                      40d
worker   rendered-worker-1054910af316c74e80557c4840e0ebb7   False     True       True       3              2                   2                     1                      40d


node appears to be degraded because of:
    message: 'Node ip-10-0-181-0.us-east-2.compute.internal is reporting: "failed
      to drain node (5 tries): timed out waiting for the condition: [error when evicting
      pod \"inline-volume-tester2-qgf7t\": pods \"inline-volume-tester2-qgf7t\" is
      forbidden: unable to create new content in namespace e2e-ephemeral-879 because
      it is being terminated, error when evicting pod \"inline-volume-tester2-b26lg\":
      pods \"inline-volume-tester2-b26lg\" is forbidden: unable to create new content
      in namespace e2e-ephemeral-6805 because it is being terminated]"'


per oc get machineconfigpools worker -o yaml



Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-07-18-033102 upgrading to
4.4.0-0.nightly-2020-08-03-094545

Comment 2 Kirsten Garrison 2020-08-07 20:59:04 UTC
This would be an expected state as the MCO will only be degraded on a degraded master, the MCO will report ready if we have worker issues. We have

Comment 3 Kirsten Garrison 2020-08-07 22:35:31 UTC
Oops! hit save too early:

In slack there was a concern that the MCO was NOT reporting degraded but this is the expected state. We only report MCO degraded on master pool.  There are plans to report on both but as of 4.4 this is correct. To recover worker, the pod above would have to be resolved so the drain can continue successfully. In the meantime, the upgrade will not be blocked by that degraded worker via the MCO

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-dcbb36a0fdec5e9d178f2c6f4d3ef732   True      False      False      3              3                   3                     0                      44d
worker   rendered-worker-1054910af316c74e80557c4840e0ebb7   False     True       True       3              2                   2                     1                      44d

$ oc get clusteroperators
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE

machine-config                             4.4.0-0.nightly-2020-07-18-033102   True        False         False      9d
...
monitoring                                 4.4.0-0.nightly-2020-07-18-033102   False       True          True       4d8h

As seen above, the upgrade itself is stalled on:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-07-18-033102   True        True          4d7h    Unable to apply 4.4.0-0.nightly-2020-08-03-094545: the cluster operator monitoring has not yet successfully rolled out

Comment 4 Ben Parees 2020-08-18 15:46:28 UTC
the MCO behavior may be understood but the cluster is still broken.

monitoring can't finish upgrading because it runs a daemonset on all nodes, and not all nodes are available.

so the node availability issue still needs to be addressed

Comment 5 Ben Parees 2020-08-18 15:48:03 UTC
Sending to Storage team since per:

node appears to be degraded because of:
    message: 'Node ip-10-0-181-0.us-east-2.compute.internal is reporting: "failed
      to drain node (5 tries): timed out waiting for the condition: [error when evicting
      pod \"inline-volume-tester2-qgf7t\": pods \"inline-volume-tester2-qgf7t\" is
      forbidden: unable to create new content in namespace e2e-ephemeral-879 because
      it is being terminated, error when evicting pod \"inline-volume-tester2-b26lg\":
      pods \"inline-volume-tester2-b26lg\" is forbidden: unable to create new content
      in namespace e2e-ephemeral-6805 because it is being terminated]"'

this looks like another case of storage e2es not fully cleaning up and ending up blocking node drain

Comment 6 Ben Parees 2020-08-20 12:57:57 UTC
4.5 cluster has the same situation now

$ oc get nodes
NAME                                         STATUS                     ROLES    AGE   VERSION
ip-10-0-135-250.us-east-2.compute.internal   NotReady                   worker   29d   v1.18.3+012b3ec
ip-10-0-136-106.us-east-2.compute.internal   Ready                      master   29d   v1.18.3+2cf11e2
ip-10-0-160-237.us-east-2.compute.internal   Ready                      master   29d   v1.18.3+2cf11e2
ip-10-0-163-71.us-east-2.compute.internal    Ready                      worker   29d   v1.18.3+08c38ef
ip-10-0-206-204.us-east-2.compute.internal   Ready                      worker   29d   v1.18.3+08c38ef
ip-10-0-208-27.us-east-2.compute.internal    Ready                      master   29d   v1.18.3+2cf11e2
ip-10-0-212-130.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   29d   v1.18.3+012b3ec



                    machineconfiguration.openshift.io/reason:
                      failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod "inline-volume-tester2-htdh4": pods "inline-...

Comment 7 Ben Parees 2020-09-28 21:56:15 UTC
new 4.5 cluster hit this again:
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"csi-hostpath-e2e-ephemeral-1022":"ip-10-0-155-235.us-east-2.compute.internal","csi-hostpath-e2e-ephemeral-2550":"ip-10-0-155-235.us-east...
                    machine.openshift.io/machine: openshift-machine-api/ci-op-p9li5xkx-endura-qtpk9-worker-us-east-2a-bdqcp
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-dc0045d9ef2c8ff945f25cc920caa3ed
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-8f7a35c8bcc3905d5dd473fa360b9ff0
                    machineconfiguration.openshift.io/reason:
                      failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod "inline-volume-tester-g9qsq": pods "inline-vo...
                    machineconfiguration.openshift.io/state: Degraded
                    volumes.kubernetes.io/controller-managed-attach-detach: true

Comment 15 Hemant Kumar 2020-10-20 14:46:17 UTC
I am moving this to 4.6.0 and this is fixed in 4.6.0 - https://github.com/openshift/origin/pull/24981 . Moving this to modified. Lets test this in 4.6.0

Comment 17 Qin Ping 2020-10-21 01:28:28 UTC
Based on this is a dup bug of 1814282, then I'll mark this as verified, if we hit this issue in our endurance test, please feel free to reopen it.

Comment 20 errata-xmlrpc 2020-10-27 16:22:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196