Bug 1985997 - kube-apiserver in SNO must not brick the cluster when a config observer outputs invalid data that would eventually converge towards a running system in HA setup
Summary: kube-apiserver in SNO must not brick the cluster when a config observer outpu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-26 13:13 UTC by Lukasz Szaszkiewicz
Modified: 2021-10-18 17:42 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:41:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift api pull 978 0 None open Bug 1985997: operator/v1: clarify nodeStatus.last{failedCount,fallbackCount} 2021-08-04 21:45:27 UTC
Github openshift cluster-kube-apiserver-operator pull 1177 0 None open Bug 1985997: wires the startup monitor 2021-08-09 10:38:48 UTC
Github openshift cluster-kube-apiserver-operator pull 1189 0 None open Bug 1985997: wires startup monitor related controllers 2021-08-09 10:38:48 UTC
Github openshift cluster-kube-apiserver-operator pull 1194 0 None None None 2021-08-09 10:38:48 UTC
Github openshift cluster-kube-apiserver-operator pull 1196 0 None None None 2021-08-09 10:38:49 UTC
Github openshift cluster-kube-apiserver-operator pull 1197 0 None None None 2021-08-09 10:38:50 UTC
Github openshift cluster-kube-apiserver-operator pull 1198 0 None None None 2021-08-04 21:45:31 UTC
Github openshift release pull 20795 0 None None None 2021-08-04 21:45:31 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:42:29 UTC

Description Lukasz Szaszkiewicz 2021-07-26 13:13:59 UTC

Comment 4 Ke Wang 2021-09-13 11:03:16 UTC
Verification steps,

1. Add foo: ["bar"] as additional argument for kube-apiserver via unsupportedConfigOverrides
$  oc edit kubeapiserver cluster
 spec:
   .....
  unsupportedConfigOverrides:
    apiServerArguments:
      foo:
      - bar

In another terminal console, run script test.sh
#!/usr/bin/env bash
while true
do oc get co/kube-apiserver;oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision';sleep 30
done

Wait for a few minutes, the following will be displayed.

NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        True          False      6h5m    NodeInstallerProgressing: 1 nodes are at revision 5; 0 nodes have achieved new revision 6

    latestAvailableRevision: 6
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 5
      nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
      targetRevision: 6
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
...

NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        True          False      6h16m   NodeInstallerProgressing: 1 nodes are at revision 5; 0 nodes have achieved new revision 6
    latestAvailableRevision: 6
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 5
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 6
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-09-13T08:22:23Z"
      lastFallbackCount: 1
      nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
      targetRevision: 6
    readyReplicas: 0
kind: List
metadata:
...
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        True          True       6h18m   StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal was rolled back to revision 6 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused
    latestAvailableRevision: 6
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 5
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 6
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-09-13T08:22:23Z"
      lastFallbackCount: 1
      nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
      targetRevision: 6
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  
2. Wait another more minutes,  until the lastFallbackCount value is 1 and shows above message,  then remove the addition argument from unsupportedConfigOverrides. This trigger a new revision, the kube-apiserver is back.
  
...
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        False         False      6h26m   
    latestAvailableRevision: 7
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 7
      lastFailedReason: OperandFailedFallback
      lastFailedRevision: 6
      lastFailedRevisionErrors:
      - 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
        static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
        dial tcp [::1]:6443: connect: connection refused (NetworkError)'
      lastFailedTime: "2021-09-13T08:22:23Z"
      lastFallbackCount: 1
      nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.9.0-0.nightly-2021-09-10-170926   True        False         False      6h27m   

To log in master node, current kube-apiserver-last-known-good is pointed to kube-apiserver-pod-7 by the startup-monitor,
sh-4.4# ls kube-apiserver-last-known-good -l
lrwxrwxrwx. 1 root root 81 Sep 13 08:34 kube-apiserver-last-known-good -> /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml

We can see startup-monitor was created for each revision of kube-apiserver by installer, after the operand becomes ready, , and then was removed.
sh-4.4# find . -name 'kube-apiserver-startup-monitor-pod.yaml' 
./kube-apiserver-pod-2/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-2/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-3/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-3/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-4/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-4/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-5/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-5/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-6/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-6/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-7/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-7/kube-apiserver-startup-monitor-pod.yaml

sh-4.4# journalctl -b -u crio | grep -E '(Creating container|Removed container).*kube-apiserver-startup-monitor'
...
Sep 13 08:33:54 kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal crio[1621]: time="2021-09-13 08:33:54.866229951Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-startup-monitor-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal/startup-monitor" id=9483384c-922e-4352-a2e4-a70f40323c1d name=/runtime.v1alpha2.RuntimeService/CreateContainer
Sep 13 08:39:57 kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal crio[1621]: time="2021-09-13 08:39:57.883444823Z" level=info msg="Removed container 066f86863e5aa322da1de33381e1c8e79f368a99a885f4d1f0c7125d0bb1632b: openshift-kube-apiserver/kube-apiserver-startup-monitor-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal/startup-monitor" id=a796ddad-272e-4ac3-ad29-3f94351ddacc name=/runtime.v1alpha2.RuntimeService/RemoveContainer

From above , we can see the startup-monitor watched the operand pods for readiness, the fallback mechanism works as expected. So move the bug VERIFIED.

Comment 6 errata-xmlrpc 2021-10-18 17:41:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.