Bug 1989633 - staticpod/installer: backoff should not apply if latestAvailableRevision > targetRevision
Summary: staticpod/installer: backoff should not apply if latestAvailableRevision > ta...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
: 1993802 (view as bug list)
Depends On:
Blocks: 1993800
TreeView+ depends on / blocked
 
Reported: 2021-08-03 16:01 UTC by Stefan Schimanski
Modified: 2021-10-18 17:44 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:44:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1200 0 None None None 2021-08-03 16:02:23 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:44:29 UTC

Description Stefan Schimanski 2021-08-03 16:01:17 UTC
If there is a new pending revision for the static pod installer controller, the backoff for fallbacks and retries should not apply, but we want to progress as quickly as possible (that's what we always did here, we just forgot it for the backoff).

Comment 2 Stefan Schimanski 2021-08-17 08:51:06 UTC
*** Bug 1993802 has been marked as a duplicate of this bug. ***

Comment 3 Ke Wang 2021-08-18 04:28:43 UTC
From PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/1200, there is one condition ‘if operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision’, no backoff if new revision is pending, otherwise, fallback will occur.

- Not operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision, verification steps as below on single node cluster, 
1. Add foo: ["bar"] as additional argument for kube-apiserver via unsupportedConfigOverrides
$  oc edit kubeapiserver cluster
 spec:
   .....
  unsupportedConfigOverrides:
    apiServerArguments:
      foo:
      - bar

Wait for a few minutes,

$ oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver                             4.9.0-0.nightly-2021-08-17-122812   True        True          True       34m     StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0 was rolled back to revision 9 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

$ oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 8
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 9
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 8 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-08-18T02:28:14Z"
      lastFallbackCount: 1
      nodeName: ci-ln-xpw43jt-f76d1-trjjv-master-0
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:


Wait another more minutes, (about 30 mins),  until the lastFallbackCount goes up to 2

$ oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver                             4.9.0-0.nightly-2021-08-17-122812   True        True          True       67m     StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0 was rolled back to revision 9 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused

$ oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision'
    latestAvailableRevision: 9
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 8
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 9
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 8 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-08-18T02:56:52Z"
      lastFallbackCount: 2
      nodeName: ci-ln-xpw43jt-f76d1-trjjv-master-0
      targetRevision: 9
    readyReplicas: 0
kind: List
metadata:

3. then remove the addition argument from unsupportedConfigOverrides. This trigger a new revision.
$ oc edit kubeapiserver  # Remove addition invalid parameters for unsupportedConfigOverrides
kubeapiserver.operator.openshift.io/cluster edited

$ oc get po -n openshift-kube-apiserver --show-labels -l apiserver
NAME                                                READY   STATUS    RESTARTS   AGE   LABELS
kube-apiserver-ci-ln-xpw43jt-f76d1-trjjv-master-0   5/5     Running   0          10m   apiserver=true,app=openshift-kube-apiserver,revision=10

$ oc get co | grep -v '.True.*False.*False'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE


- Common case: operatorStatus.LatestAvailableRevision > currNodeState.TargetRevision
Made rollover operations on kube-apiserver,
for i in {1..5}; do echo "rollout $i"; oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'; sleep 300;done &

$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-656bd8d466-k7nq4  | grep 'because static pod is pending'

I0818 03:31:21.628375       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending
I0818 03:31:24.341428       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending
I0818 03:31:28.037967       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending
I0818 03:31:30.203295       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 11, but has not made progress because static pod is pending
I0818 03:36:26.609207       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending
I0818 03:36:28.379478       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending
I0818 03:36:31.520424       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending
I0818 03:36:34.734073       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 12, but has not made progress because static pod is pending
I0818 03:41:07.375868       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 13, but has not made progress because static pod is pending
I0818 03:41:08.257594       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 13, but has not made progress because static pod is pending
I0818 03:46:12.782345       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 14, but has not made progress because static pod is pending
I0818 03:46:13.237851       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 14, but has not made progress because static pod is pending
I0818 03:51:52.757578       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending
I0818 03:51:56.088486       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending
I0818 03:51:59.858662       1 installer_controller.go:512] "ci-ln-xpw43jt-f76d1-trjjv-master-0" is in transition to 15, but has not made progress because static pod is pending

From revision 11 to 15, they were caused by forced rolling out, no backoff only new revision is pending. From the above results, the PR fix worked as expected, so move the bug VERIFIED.

Comment 6 errata-xmlrpc 2021-10-18 17:44:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.