Bug 1989633

Summary: staticpod/installer: backoff should not apply if latestAvailableRevision > targetRevision
Product: OpenShift Container Platform Reporter: Stefan Schimanski <sttts>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: aos-bugs, mfojtik, xxia
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:44:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1993800    

Description Stefan Schimanski 2021-08-03 16:01:17 UTC
If there is a new pending revision for the static pod installer controller, the backoff for fallbacks and retries should not apply, but we want to progress as quickly as possible (that's what we always did here, we just forgot it for the backoff).

Comment 2 Stefan Schimanski 2021-08-17 08:51:06 UTC
*** Bug 1993802 has been marked as a duplicate of this bug. ***

Comment 3 Ke Wang 2021-08-18 04:28:43 UTC
From PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/1200, there is one condition ‘if operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision’, no backoff if new revision is pending, otherwise, fallback will occur.

- Not operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision, verification steps as below on single node cluster, 
1. Add foo: ["bar"] as additional argument for kube-apiserver via unsupportedConfigOverrides
$  oc edit kubeapiserver cluster
 spec:
   .....
  unsupportedConfigOverrides:
    apiServerArguments:
      foo:
      - bar

Wait for a few minutes,

$ oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver                             4.9.0-0.nightly-2021-08-17-122812   True        True          True       34m     StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0 was rolled back to revision 9 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

$ oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 8
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 9
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 8 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-08-18T02:28:14Z"
      lastFallbackCount: 1
      nodeName: ci-ln-xpw43jt-f76d1-trjjv-master-0
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:


Wait another more minutes, (about 30 mins),  until the lastFallbackCount goes up to 2

$ oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver                             4.9.0-0.nightly-2021-08-17-122812   True        True          True       67m     StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0 was rolled back to revision 9 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

$ oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 8
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 9
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 8 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-08-18T02:56:52Z"
      lastFallbackCount: 2
      nodeName: ci-ln-xpw43jt-f76d1-trjjv-master-0
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:

3. then remove the addition argument from unsupportedConfigOverrides. This trigger a new revision.
$ oc edit kubeapiserver  # Remove addition invalid parameters for unsupportedConfigOverrides
kubeapiserver.operator.openshift.io/cluster edited

$ oc get po -n openshift-kube-apiserver --show-labels -l apiserver
NAME                                                READY   STATUS    RESTARTS   AGE   LABELS
kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0   5/5     Running   0          10m   apiserver=true,app=openshift-kube-apiserver,revision=10

$ oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE


- Common case: operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision
Made rollover operations on kube-apiserver,
for i in {1..5}; do echo "rollout $i"; oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'; sleep 300;done &

$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-656bd8d466-k7nq4  | grep 'because static pod is pending'

I0818 03:31:21.628375       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending
I0818 03:31:24.341428       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending
I0818 03:31:28.037967       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending
I0818 03:31:30.203295       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending
I0818 03:36:26.609207       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending
I0818 03:36:28.379478       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending
I0818 03:36:31.520424       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending
I0818 03:36:34.734073       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending
I0818 03:41:07.375868       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 13, but has not made progress because static pod is pending
I0818 03:41:08.257594       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 13, but has not made progress because static pod is pending
I0818 03:46:12.782345       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 14, but has not made progress because static pod is pending
I0818 03:46:13.237851       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 14, but has not made progress because static pod is pending
I0818 03:51:52.757578       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending
I0818 03:51:56.088486       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending
I0818 03:51:59.858662       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending

From revision 11 to 15, they were caused by forced rolling out, no backoff only new revision is pending. From the above results, the PR fix worked as expected, so move the bug VERIFIED.

Comment 6 errata-xmlrpc 2021-10-18 17:44:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759