+++ This bug was initially created as a clone of Bug #1817419 +++ Description of problem: Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2020-04-07-141343 How reproducible: At least in two clusters, one is ipi-on-aws, another is IPI on AWS (FIPS off) OVN Steps to Reproduce: 1. Install a latest 4.3 fresh env 2. After the installation succeeds, check `oc get co`, found: kube-apiserver 4.3.0-0.nightly-2020-04-07-141343 True True True 150m 3. Check `oc get po -n openshift-kube-apiserver --show-labels`, found all pods under openshift-kube-apiserver are: NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-...39-158.....compute.internal 3/3 Running 3 103m apiserver=true,app=openshift-kube-apiserver,revision=7 kube-apiserver-ip-...47-36.....compute.internal 3/3 Running 3 111m apiserver=true,app=openshift-kube-apiserver,revision=6 kube-apiserver-ip-...71-131.....compute.internal 3/3 Running 3 109m apiserver=true,app=openshift-kube-apiserver,revision=6 Check `oc logs deploy/kube-apiserver-operator -n openshift-kube-apiserver-operator`, shows: ...'OperatorStatusChanged' Status for clusteroper ator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-ip-...- 139-158.....compute.internal\" not found" to "NodeControllerDegraded: The master nodes not ready... 4. Force operator to retry rolling out by: $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' After minutes, co/kube-apiserver quickly becomes normal: kube-apiserver 4.3.0-0.nightly-2020-04-07-141343 True False False 156m Pods also become normal: kube-apiserver-ip-...39-158.....compute.internal 3/3 Running 0 5m56s apiserver=true,app=openshift-kube-apiserver,revision=8 kube-apiserver-ip-...47-36.....compute.internal 3/3 Running 0 4m8s apiserver=true,app=openshift-kube-apiserver,revision=8 kube-apiserver-ip-...71-131.....compute.internal 3/3 Running 0 2m24s apiserver=true,app=openshift-kube-apiserver,revision=8 Actual results: 1. Operator stuck in rolling out the static pods even if step 4 shows the cause gone. Expected results: 1. Operator should auto detect the cause and auto retry rolling out once the cause gone. Additional info:
Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1817419. Please don't clone bugs in advance. The developer will assess necessary backports and create clones. Exception: CVEs *** This bug has been marked as a duplicate of bug 1817419 ***