Bug 1993800 - 4.8: Static pod installer backoff broken WAS: [arbiter] OCP Console fail goes into endless look during authentication after set of temporary network disruptions which separatates cluster zones
Summary: 4.8: Static pod installer backoff broken WAS: [arbiter] OCP Console fail goes...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.z
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 1989633
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-16 07:40 UTC by Stefan Schimanski
Modified: 2021-09-21 08:01 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1987005
: 1993802 (view as bug list)
Environment:
Last Closed: 2021-09-21 08:01:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 648 0 None None None 2021-09-01 09:08:46 UTC
Github openshift cluster-kube-apiserver-operator pull 1214 0 None None None 2021-09-01 09:08:47 UTC
Github openshift cluster-kube-controller-manager-operator pull 560 0 None None None 2021-09-01 09:08:48 UTC
Github openshift cluster-kube-scheduler-operator pull 366 0 None None None 2021-09-01 09:08:48 UTC
Github openshift library-go pull 1180 0 None None None 2021-09-01 09:08:49 UTC
Red Hat Product Errata RHBA-2021:3511 0 None None None 2021-09-21 08:01:45 UTC

Internal Links: 1987005

Comment 1 Ke Wang 2021-09-01 09:08:12 UTC
This bug's PR is dev-approved and not yet merged, so I'm following issue DPTP-660 to do the pre-merge verifying for QE pre-merge verification goal of issue OCPQE-815 by using the bot to launch a cluster with the open PR.  Here is the verification steps:

To verify this PR, we need to do with single node cluster, in 4.8 there is no fallback yet, the backoff only applies to failing installers, not failing operands.

1. set installer error probability to 1.0
 
$ oc edit kubeapiserver cluster
$ oc get kubeapiserver cluster -oyaml | grep -A2 unsupportedConfigOverrides
  unsupportedConfigOverrides:
    installerErrorInjection:
      failPropability: 1.0

2. trigger a revision and wait until the retry backoff goes up to maximum (10 min, after roughly 10 retries)
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'       

Wed 01 Sep 2021 12:23:28 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          False      36m

Wed 01 Sep 2021 12:23:29 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                           READY   STATUS      RESTARTS   AGE   LABELS
...
installer-7-ip-xx-x-xxx-xxx.us-west-1.compute.internal         0/1     Completed   0          33m   app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal      5/5     Running     0          31m   apiserver=true,app=openshift-kube-apiserver,revision=7
revision-pruner-4-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Completed   0          36m   app=pruner
revision-pruner-7-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Completed   0          29m   app=pruner

Wed 01 Sep 2021 12:23:30 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 8
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 8
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

...

Wed 01 Sep 2021 12:23:36 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          False      36m

Wed 01 Sep 2021 12:23:38 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                           READY   STATUS      RESTARTS   AGE   LABELS
...
installer-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal         0/1     Error       0          8s    app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal      5/5     Running     0          31m   apiserver=true,app=openshift-kube-apiserver,revision=7
...

Wed 01 Sep 2021 12:23:39 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 8
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedCount: 1
      lastFailedRevision: 8
      lastFailedRevisionErrors:
      - no detailed termination message, see `oc get -oyaml -n "openshift-kube-apiserver"
        pods "installer-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal"`
      lastFailedTime: "2021-09-01T04:23:35Z"
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 8
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

...

Wed 01 Sep 2021 01:04:22 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       77m

Wed 01 Sep 2021 01:04:23 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE   LABELS
...
installer-8-retry-1-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          40m   app=installer
installer-8-retry-2-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          39m   app=installer
installer-8-retry-3-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          38m   app=installer
installer-8-retry-4-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          37m   app=installer
installer-8-retry-5-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          35m   app=installer
installer-8-retry-6-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          32m   app=installer
installer-8-retry-7-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          20m   app=installer
installer-8-retry-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          14m   app=installer
installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          8s    app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          72m   apiserver=true,app=openshift-kube-apiserver,revision=7
...

Wed 01 Sep 2021 01:04:24 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 8
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedCount: 10
      lastFailedRevision: 8
      lastFailedRevisionErrors:
      - no detailed termination message, see `oc get -oyaml -n "openshift-kube-apiserver"
        pods "installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal"`
      lastFailedTime: "2021-09-01T05:04:22Z"
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 8
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

3. after the first 10 retries just remove the failPropability 1.0 and trigger a new revision at the same time

Wed Sep 01 13:04:32 [kewang@kewang-fedora]$ oc edit kubeapiserver cluster
kubeapiserver.operator.openshift.io/cluster edited

Wed Sep 01 13:04:46 [kewang@kewang-fedora]$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'
kubeapiserver.operator.openshift.io/cluster patched

...

4. watch that the new installer of the new revision is created within <<10min, but rather seconds.

Wed 01 Sep 2021 01:05:19 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       78m

----> New installer 9 is created.

Wed 01 Sep 2021 01:05:20 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS              RESTARTS   AGE   LABELS
...
installer-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     Error               0          41m   app=installer
...
installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error               0          65s   app=installer
installer-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     ContainerCreating   0          2s    app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running             0          73m   apiserver=true,app=openshift-kube-apiserver,revision=7
...

Wed 01 Sep 2021 01:05:21 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedRevision: 8
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Wed 01 Sep 2021 01:05:27 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       78m

Wed 01 Sep 2021 01:05:28 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE   LABELS
...
installer-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     Error       0          41m   app=installer
...
installer-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal           1/1     Running     0          10s   app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          73m   apiserver=true,app=openshift-kube-apiserver,revision=7
...

Wed 01 Sep 2021 01:05:29 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedRevision: 8
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Wed 01 Sep 2021 01:05:35 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       78m

Wed 01 Sep 2021 01:05:36 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE   LABELS
...
installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          81s   app=installer
installer-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     Completed   0          18s   app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          73m   apiserver=true,app=openshift-kube-apiserver,revision=7
...


Wed 01 Sep 2021 01:09:05 PM CST
oc get co | grep -v '.True.*False.*False'

Wed 01 Sep 2021 01:09:06 PM CST
oc get pod -n openshift-kube-apiserver --show-labels

Wed 01 Sep 2021 01:09:06 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'

...

Wed 01 Sep 2021 01:10:04 PM CST
oc get co | grep -v '.True.*False.*False'

Wed 01 Sep 2021 01:10:05 PM CST
oc get pod -n openshift-kube-apiserver --show-labels

Wed 01 Sep 2021 01:10:05 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedRevision: 8
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Wed 01 Sep 2021 01:10:31 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.8.0-0.ci.test-2021-09-01-032024-ci-ln-w76hfgk-latest   True        True          True       83m

Wed 01 Sep 2021 01:10:32 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE     LABELS
...
installer-8-retry-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal   0/1     Error       0          6m17s   app=installer
installer-9-ip-xx-x-xxx-xxx.us-west-1.compute.internal           0/1     Completed   0          5m14s   app=installer
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          21s     apiserver=true,app=openshift-kube-apiserver,revision=9
revision-pruner-4-ip-xx-x-xxx-xxx.us-west-1.compute.internal     0/1     Completed   0          83m     app=pruner
revision-pruner-7-ip-xx-x-xxx-xxx.us-west-1.compute.internal     0/1     Completed   0          76m     app=pruner
revision-pruner-8-ip-xx-x-xxx-xxx.us-west-1.compute.internal     0/1     Completed   0          46m     app=pruner

Wed 01 Sep 2021 01:10:33 PM CST
oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedRevision: 8
      nodeName: ip-xx-x-xxx-xxx.us-west-1.compute.internal
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

...

Wed 01 Sep 2021 01:11:36 PM CST
oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE

Wed 01 Sep 2021 01:11:37 PM CST
oc get pod -n openshift-kube-apiserver --show-labels
NAME                                                             READY   STATUS      RESTARTS   AGE     LABELS
...
kube-apiserver-ip-xx-x-xxx-xxx.us-west-1.compute.internal        5/5     Running     0          86s     apiserver=true,app=openshift-kube-apiserver,revision=9
...

<---- the new installer of the new revision was created 
Above results are as expected. So the bug is pre-merge verified. After the PR gets merged, the bug will be moved to VERIFIED by the bot automatically

Comment 4 Ke Wang 2021-09-14 01:26:26 UTC
Can't wait for the robot to move the bug VERIFIED, because the Errata relevant person in charge to urge.

Comment 6 errata-xmlrpc 2021-09-21 08:01:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.12 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3511


Note You need to log in before you can comment on or make changes to this bug.