Bug 1993800

Summary: 4.8: Static pod installer backoff broken WAS: [arbiter] OCP Console fail goes into endless look during authentication after set of temporary network disruptions which separatates cluster zones
Product: OpenShift Container Platform Reporter: Stefan Schimanski <sttts>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: aos-bugs, jokerman, kewang, mbukatov, mfojtik, surbania, xxia
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1987005
: 1993802 (view as bug list) Environment:
Last Closed: 2021-09-21 08:01:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1989633    
Bug Blocks:    

Comment 1 Ke Wang 2021-09-01 09:08:12 UTC
This bug's PR is dev-approved and not yet merged, so I'm following issue DPTP-660 to do the pre-merge verifying for QE pre-merge verification goal of issue OCPQE-815 by using the bot to launch a cluster with the open PR.  Here is the verification steps:

To verify this PR, we need to do with single node cluster, in 4.8 there is no fallback yet, the backoff only applies to failing installers, not failing operands.

1. set installer error probability to 1.0
 
$ oc edit kubeapiserver cluster
$ oc get kubeapiserver cluster -oyaml | grep -A2 unsupportedConfigOverrides
  unsupportedConfigOverrides:
    installerErrorInjection:
      failPropability: 1.0

2. trigger a revision and wait until the retry backoff goes up to maximum (10 min, after roughly 10 retries)
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'       

Wed 01 Sep 2021 12:23:28 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          False      36m

Wed 01 Sep 2021 12:23:29 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                           READY   STATUS      RESTARTS   AGE   LABELS
...
installer-7-ip-xx-x-xxx-xxx.us-west-1.compute.internal         0/1     Completed   0          33m   app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal      5/5     Running     0          31m   apiserver=true,app=openshift-kube-apiserver,revision=7
revision-pruner-4-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Completed   0          36m   app=pruner
revision-pruner-7-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Completed   0          29m   app=pruner

Wed 01 Sep 2021 12:23:30 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 8
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 8
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

...

Wed 01 Sep 2021 12:23:36 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          False      36m

Wed 01 Sep 2021 12:23:38 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                           READY   STATUS      RESTARTS   AGE   LABELS
...
installer-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal         0/1     Error       0          8s    app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal      5/5     Running     0          31m   apiserver=true,app=openshift-kube-apiserver,revision=7
...

Wed 01 Sep 2021 12:23:39 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 8
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedCount: 1
      lastFailedRevision: 8
      lastFailedRevisionErrors:
      - no detailed termination message, see `oc get -oyaml -n "openshift-kube-apiserver"
        pods "installer-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal"`
      lastFailedTime: "2021-09-01T04:23:35Z"
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 8
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

...

Wed 01 Sep 2021 01:04:22 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       77m

Wed 01 Sep 2021 01:04:23 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE   LABELS
...
installer-8-retry-1-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          40m   app=installer
installer-8-retry-2-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          39m   app=installer
installer-8-retry-3-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          38m   app=installer
installer-8-retry-4-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          37m   app=installer
installer-8-retry-5-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          35m   app=installer
installer-8-retry-6-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          32m   app=installer
installer-8-retry-7-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          20m   app=installer
installer-8-retry-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          14m   app=installer
installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          8s    app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          72m   apiserver=true,app=openshift-kube-apiserver,revision=7
...

Wed 01 Sep 2021 01:04:24 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 8
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedCount: 10
      lastFailedRevision: 8
      lastFailedRevisionErrors:
      - no detailed termination message, see `oc get -oyaml -n "openshift-kube-apiserver"
        pods "installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal"`
      lastFailedTime: "2021-09-01T05:04:22Z"
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 8
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

3. after the first 10 retries just remove the failPropability 1.0 and trigger a new revision at the same time

Wed Sep 01 13:04:32 [kewang@kewang-fedora]$ oc edit kubeapiserver cluster
kubeapiserver.operator.openshift.io/cluster edited

Wed Sep 01 13:04:46 [kewang@kewang-fedora]$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'
kubeapiserver.operator.openshift.io/cluster patched

...

4. watch that the new installer of the new revision is created within <<10min, but rather seconds.

Wed 01 Sep 2021 01:05:19 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       78m

----> New installer 9 is created.

Wed 01 Sep 2021 01:05:20 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS              RESTARTS   AGE   LABELS
...
installer-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     Error               0          41m   app=installer
...
installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error               0          65s   app=installer
installer-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     ContainerCreating   0          2s    app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running             0          73m   apiserver=true,app=openshift-kube-apiserver,revision=7
...

Wed 01 Sep 2021 01:05:21 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedRevision: 8
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Wed 01 Sep 2021 01:05:27 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       78m

Wed 01 Sep 2021 01:05:28 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE   LABELS
...
installer-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     Error       0          41m   app=installer
...
installer-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal           1/1     Running     0          10s   app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          73m   apiserver=true,app=openshift-kube-apiserver,revision=7
...

Wed 01 Sep 2021 01:05:29 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedRevision: 8
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Wed 01 Sep 2021 01:05:35 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       78m

Wed 01 Sep 2021 01:05:36 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE   LABELS
...
installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          81s   app=installer
installer-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     Completed   0          18s   app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          73m   apiserver=true,app=openshift-kube-apiserver,revision=7
...


Wed 01 Sep 2021 01:09:05 PM CST
oc get co | grep -v '.True.*False.*False'

Wed 01 Sep 2021 01:09:06 PM CST
oc get pod -n openshift-kube-apiserver --show-labels

Wed 01 Sep 2021 01:09:06 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'

...

Wed 01 Sep 2021 01:10:04 PM CST
oc get co | grep -v '.True.*False.*False'

Wed 01 Sep 2021 01:10:05 PM CST
oc get pod -n openshift-kube-apiserver --show-labels

Wed 01 Sep 2021 01:10:05 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedRevision: 8
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Wed 01 Sep 2021 01:10:31 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       83m

Wed 01 Sep 2021 01:10:32 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE     LABELS
...
installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          6m17s   app=installer
installer-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     Completed   0          5m14s   app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          21s     apiserver=true,app=openshift-kube-apiserver,revision=9
revision-pruner-4-ip-xx-x-xxx-xxx.us-west-1.compute.internal     0/1     Completed   0          83m     app=pruner
revision-pruner-7-ip-xx-x-xxx-xxx.us-west-1.compute.internal     0/1     Completed   0          76m     app=pruner
revision-pruner-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal     0/1     Completed   0          46m     app=pruner

Wed 01 Sep 2021 01:10:33 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedRevision: 8
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

...

Wed 01 Sep 2021 01:11:36 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE

Wed 01 Sep 2021 01:11:37 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE     LABELS
...
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          86s     apiserver=true,app=openshift-kube-apiserver,revision=9
...

<---- the new installer of the new revision was created 
Above results are as expected. So the bug is pre-merge verified. After the PR gets merged, the bug will be moved to VERIFIED by the bot automatically

Comment 4 Ke Wang 2021-09-14 01:26:26 UTC
Can't wait for the robot to move the bug VERIFIED, because the Errata relevant person in charge to urge.

Comment 6 errata-xmlrpc 2021-09-21 08:01:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.12 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3511