+++ This bug was initially created as a clone of Bug #1817419 +++ Description of problem: Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-25-223508 How reproducible: So far hit once Steps to Reproduce: 1. Install a latest 4.4 fresh env, the env matrix is upi-on-gcp, disconnected-remove_rhcos_worker-fips-ovn 2. After the installation succeeds, check `oc get co`, found: kube-apiserver 4.4.0-0.nightly-2020-03-25-223508 True True True 144m 3. Check `oc get po -n openshift-kube-apiserver --show-labels`, found all pods under openshift-kube-apiserver are: NAME READY STATUS RESTARTS AGE LABELS installer-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 68m app=installer kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 4/4 Running 0 68m apiserver=true,app=openshi ft-kube-apiserver,revision=8 kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 4/4 Running 4 86m apiserver=true,app=openshi ft-kube-apiserver,revision=7 kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 4/4 Running 4 88m apiserver=true,app=openshi ft-kube-apiserver,revision=7 revision-pruner-7-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 82m app=pruner revision-pruner-7-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 0/1 Completed 0 73m app=pruner revision-pruner-7-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 0/1 Completed 0 66m app=pruner revision-pruner-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 66m app=pruner revision-pruner-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 0/1 Completed 0 65m app=pruner Check `oc logs deploy/kube-apiserver-operator -n openshift-kube-apiserver-operator`, shows: ... NodeInstallerDegraded: pods \"installer-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not found\nNodeControllerDegraded: The master nodes not ready: node \"qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not ready since 2020-03-26 07:12:00 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)"... But check `oc get no`, no issue: NAME STATUS ROLES AGE VERSION qe-yapei44debug-03260202-m-0.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-m-1.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-m-2.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-w-a-l-rhel-0 Ready worker 34m v1.17.1 qe-yapei44debug-03260202-w-a-l-rhel-1 Ready worker 34m v1.17.1 Networking QE also helped debug, network also has no issue. 4. Force operator to retry rolling out by: $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' After minutes, co/kube-apiserver quickly becomes normal: kube-apiserver 4.4.0-0.nightly-2020-03-25-223508 True False False 173m Pods also become normal: $ ogpkas kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 4/4 Running 0 6m28s apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 4/4 Running 0 8m23s apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 4/4 Running 0 10m apiserver=true,app=openshift-kube-apiserver,revision=9 Actual results: 3. Operator stuck in rolling out the static pods even if step 4 shows the cause gone. Expected results: 3. Operator should auto detect the cause and auto retry rolling out once the cause gone. Additional info: Bug is filed from https://coreos.slack.com/archives/CH76YSYSC/p1585214604225000?thread_ts=1585210139.199800&cid=CH76YSYSC with discussion there. --- Additional comment from Abu Kashem on 2020-04-07 14:40:18 UTC --- Facts: - This is happening in upi-on-gcp and infrequently. - All kube-apiserver pods are running successfully. - The operator is reporting a misleading status in clusteroperator object. I am assuming this will not block upgrade. (correct?) - There is a workaround to fix this issue, that's a plus. - Troubleshooting may not be obvious, we have to check the operator log to find what the issue is. That's a minus. Since this is reporting misleading information in ClusterOperator object, ideally we would want to fix it in 4.4. But given the time constraint I think we can defer it to 4.5. This bug existed on 4.5 and 4.3, to believe that 4.4 also exists. It is worth fixing.
Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1817419. Please don't clone bugs in advance. The developer will assess necessary backports and create clones. Exception: CVEs *** This bug has been marked as a duplicate of bug 1817419 ***