Description of problem: We see a lot of clusters via Telemetry reporting static pod operators degraded with reason: "NodeControllerDegraded: The master node(s) "name" not ready". This condition prevents rolling upgrades for those clusters and further debugging is required to determine the real reason of why the masters are not ready. In 4.4, we merged https://github.com/openshift/library-go/pull/660 that fixes this message to include last transition time, reason and message directly from the Node object. This message is now reported instead of just "not ready". Steps to Reproduce: 1. Make "master" node report NotReady 2. Wait for the NodeControllerDegraded condition to appear for the operator 3. Check the message to now include detailed reason. Actual results: Expected results: Additional info:
Verified in 4.4.0-0.nightly-2020-02-16-221315 oc get no ... ip-10-0-143-9.ap-northeast-1.compute.internal Ready master 9h v1.17.1 ... # shutdown a master to let it displays NotReady oc debug no/ip-10-0-143-9.ap-northeast-1.compute.internal -- chroot /host shutdown -h now Starting pod/ip-10-0-143-9ap-northeast-1computeinternal-debug ... To use host binaries, run `chroot /host` ^C oc get no ... ip-10-0-143-9.ap-northeast-1.compute.internal NotReady master 9h v1.17.1 ... # check NodeControllerDegraded has time, reason and message from node YAML oc get kubeapiserver cluster -o yaml ... conditions: - lastTransitionTime: "2020-02-17T12:41:25Z" message: 'The master nodes not ready: node "ip-10-0-143-9.ap-northeast-1.compute.internal" not ready since 2020-02-17 12:41:25 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)' reason: MasterNodesReady status: "True" type: NodeControllerDegraded oc get co/kube-apiserver -o yaml ... conditions: - lastTransitionTime: "2020-02-17T12:43:43Z" message: 'NodeControllerDegraded: The master nodes not ready: node "ip-10-0-143-9.ap-northeast-1.compute.internal" not ready since 2020-02-17 12:41:25 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)' reason: NodeController_MasterNodesReady status: "True" type: Degraded oc get no ip-10-0-143-9.ap-northeast-1.compute.internal -o yaml ... - lastHeartbeatTime: "2020-02-17T12:39:10Z" lastTransitionTime: "2020-02-17T12:41:25Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: Ready ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581