Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1985997

Summary: kube-apiserver in SNO must not brick the cluster when a config observer outputs invalid data that would eventually converge towards a running system in HA setup
Product: OpenShift Container Platform Reporter: Lukasz Szaszkiewicz <lszaszki>
Component: kube-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: aos-bugs, mfojtik, sttts, xxia
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:41:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lukasz Szaszkiewicz 2021-07-26 13:13:59 UTC

Comment 4 Ke Wang 2021-09-13 11:03:16 UTC
Verification steps,

1. Add foo: ["bar"] as additional argument for kube-apiserver via unsupportedConfigOverrides
$  oc edit kubeapiserver cluster
 spec:
   .....
  unsupportedConfigOverrides:
    apiServerArguments:
      foo:
      - bar

In another terminal console, run script test.sh
#!/usr/bin/env bash
while true
do oc get co/kube-apiserver;oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision';sleep 30
done

Wait for a few minutes, the following will be displayed.

NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        True          False      6h5m    NodeInstallerProgressing: 1 nodes are at revision 5; 0 nodes have achieved new revision 6

    latestAvailableRevision: 6
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 5
      nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
      targetRevision: 6
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
...

NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        True          False      6h16m   NodeInstallerProgressing: 1 nodes are at revision 5; 0 nodes have achieved new revision 6
    latestAvailableRevision: 6
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 5
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 6
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-09-13T08:22:23Z"
      lastFallbackCount: 1
      nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
      targetRevision: 6
    readyReplicas: 0
kind: List
metadata:
...
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        True          True       6h18m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal was rolled back to revision 6 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused
    latestAvailableRevision: 6
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 5
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 6
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-09-13T08:22:23Z"
      lastFallbackCount: 1
      nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
      targetRevision: 6
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  
2. Wait another more minutes,  until the lastFallbackCount value is 1 and shows above message,  then remove the addition argument from unsupportedConfigOverrides. This trigger a new revision, the kube-apiserver is back.
  
...
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        False         False      6h26m   
    latestAvailableRevision: 7
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 6
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-09-13T08:22:23Z"
      lastFallbackCount: 1
      nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        False         False      6h27m   

To log in master node, current kube-apiserver-last-known-good is pointed to kube-apiserver-pod-7 by the startup-monitor,
sh-4.4# ls kube-apiserver-last-known-good -l
lrwxrwxrwx. 1 root root 81 Sep 13 08:34 kube-apiserver-last-known-good -> /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml

We can see startup-monitor was created for each revision of kube-apiserver by installer, after the operand becomes ready, , and then was removed.
sh-4.4# find . -name 'kube-apiserver-startup-monitor-pod.yaml' 
./kube-apiserver-pod-2/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-2/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-3/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-3/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-4/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-4/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-5/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-5/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-6/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-6/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-7/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-7/kube-apiserver-startup-monitor-pod.yaml

sh-4.4# journalctl -b -u crio | grep -E '(Creating container|Removed container).*kube-apiserver-startup-monitor'
...
Sep 13 08:33:54 kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal crio[1621]: time="2021-09-13 08:33:54.866229951Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-startup-monitor-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal/startup-monitor" id=9483384c-922e-4352-a2e4-a70f40323c1d name=/runtime.v1alpha2.RuntimeService/CreateContainer
Sep 13 08:39:57 kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal crio[1621]: time="2021-09-13 08:39:57.883444823Z" level=info msg="Removed container 066f86863e5aa322da1de33381e1c8e79f368a99a885f4d1f0c7125d0bb1632b: openshift-kube-apiserver/kube-apiserver-startup-monitor-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal/startup-monitor" id=a796ddad-272e-4ac3-ad29-3f94351ddacc name=/runtime.v1alpha2.RuntimeService/RemoveContainer

From above , we can see the startup-monitor watched the operand pods for readiness, the fallback mechanism works as expected. So move the bug VERIFIED.

Comment 6 errata-xmlrpc 2021-10-18 17:41:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759