If there is a new pending revision for the static pod installer controller, the backoff for fallbacks and retries should not apply, but we want to progress as quickly as possible (that's what we always did here, we just forgot it for the backoff).
*** Bug 1993802 has been marked as a duplicate of this bug. ***
From PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/1200, there is one condition ‘if operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision’, no backoff if new revision is pending, otherwise, fallback will occur. - Not operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision, verification steps as below on single node cluster, 1. Add foo: ["bar"] as additional argument for kube-apiserver via unsupportedConfigOverrides $ oc edit kubeapiserver cluster spec: ..... unsupportedConfigOverrides: apiServerArguments: foo: - bar Wait for a few minutes, $ oc get co | grep -v '.True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.9.0-0.nightly-2021-08-17-122812 True True True 34m StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0 was rolled back to revision 9 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused $ oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision' latestAvailableRevision: 9 latestAvailableRevisionReason: "" nodeStatuses: - currentRevision: 8 lastFailedReason: OperandFailedFallback lastFailedRevision: 9 lastFailedRevisionErrors: - 'fallback to last-known-good revision 8 took place after: waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused (NetworkError)' lastFailedTime: "2021-08-18T02:28:14Z" lastFallbackCount: 1 nodeName: ci-ln-xpw43jt-f76d1-trjjv-master-0 targetRevision: 9 readyReplicas: 0 kind: List metadata: Wait another more minutes, (about 30 mins), until the lastFallbackCount goes up to 2 $ oc get co | grep -v '.True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.9.0-0.nightly-2021-08-17-122812 True True True 67m StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0 was rolled back to revision 9 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused $ oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision' latestAvailableRevision: 9 latestAvailableRevisionReason: "" nodeStatuses: - currentRevision: 8 lastFailedReason: OperandFailedFallback lastFailedRevision: 9 lastFailedRevisionErrors: - 'fallback to last-known-good revision 8 took place after: waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused (NetworkError)' lastFailedTime: "2021-08-18T02:56:52Z" lastFallbackCount: 2 nodeName: ci-ln-xpw43jt-f76d1-trjjv-master-0 targetRevision: 9 readyReplicas: 0 kind: List metadata: 3. then remove the addition argument from unsupportedConfigOverrides. This trigger a new revision. $ oc edit kubeapiserver # Remove addition invalid parameters for unsupportedConfigOverrides kubeapiserver.operator.openshift.io/cluster edited $ oc get po -n openshift-kube-apiserver --show-labels -l apiserver NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0 5/5 Running 0 10m apiserver=true,app=openshift-kube-apiserver,revision=10 $ oc get co | grep -v '.True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE - Common case: operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision Made rollover operations on kube-apiserver, for i in {1..5}; do echo "rollout $i"; oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'; sleep 300;done & $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-656bd8d466-k7nq4 | grep 'because static pod is pending' I0818 03:31:21.628375 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending I0818 03:31:24.341428 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending I0818 03:31:28.037967 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending I0818 03:31:30.203295 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending I0818 03:36:26.609207 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending I0818 03:36:28.379478 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending I0818 03:36:31.520424 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending I0818 03:36:34.734073 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending I0818 03:41:07.375868 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 13, but has not made progress because static pod is pending I0818 03:41:08.257594 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 13, but has not made progress because static pod is pending I0818 03:46:12.782345 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 14, but has not made progress because static pod is pending I0818 03:46:13.237851 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 14, but has not made progress because static pod is pending I0818 03:51:52.757578 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending I0818 03:51:56.088486 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending I0818 03:51:59.858662 1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending From revision 11 to 15, they were caused by forced rolling out, no backoff only new revision is pending. From the above results, the PR fix worked as expected, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759