Bug 1985997
| Summary: | kube-apiserver in SNO must not brick the cluster when a config observer outputs invalid data that would eventually converge towards a running system in HA setup | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lukasz Szaszkiewicz <lszaszki> |
| Component: | kube-apiserver | Assignee: | Lukasz Szaszkiewicz <lszaszki> |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.9 | CC: | aos-bugs, mfojtik, sttts, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:41:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Lukasz Szaszkiewicz
2021-07-26 13:13:59 UTC
Verification steps,
1. Add foo: ["bar"] as additional argument for kube-apiserver via unsupportedConfigOverrides
$ oc edit kubeapiserver cluster
spec:
.....
unsupportedConfigOverrides:
apiServerArguments:
foo:
- bar
In another terminal console, run script test.sh
#!/usr/bin/env bash
while true
do oc get co/kube-apiserver;oc get kubeapiserver -oyaml | grep -A15 'latestAvailableRevision';sleep 30
done
Wait for a few minutes, the following will be displayed.
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.9.0-0.nightly-2021-09-10-170926 True True False 6h5m NodeInstallerProgressing: 1 nodes are at revision 5; 0 nodes have achieved new revision 6
latestAvailableRevision: 6
latestAvailableRevisionReason: ""
nodeStatuses:
- currentRevision: 5
nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
targetRevision: 6
readyReplicas: 0
kind: List
metadata:
resourceVersion: ""
selfLink: ""
...
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.9.0-0.nightly-2021-09-10-170926 True True False 6h16m NodeInstallerProgressing: 1 nodes are at revision 5; 0 nodes have achieved new revision 6
latestAvailableRevision: 6
latestAvailableRevisionReason: ""
nodeStatuses:
- currentRevision: 5
lastFailedReason: OperandFailedFallback
lastFailedRevision: 6
lastFailedRevisionErrors:
- 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
dial tcp [::1]:6443: connect: connection refused (NetworkError)'
lastFailedTime: "2021-09-13T08:22:23Z"
lastFallbackCount: 1
nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
targetRevision: 6
readyReplicas: 0
kind: List
metadata:
...
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.9.0-0.nightly-2021-09-10-170926 True True True 6h18m StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal was rolled back to revision 6 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd": dial tcp [::1]:6443: connect: connection refused
latestAvailableRevision: 6
latestAvailableRevisionReason: ""
nodeStatuses:
- currentRevision: 5
lastFailedReason: OperandFailedFallback
lastFailedRevision: 6
lastFailedRevisionErrors:
- 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
dial tcp [::1]:6443: connect: connection refused (NetworkError)'
lastFailedTime: "2021-09-13T08:22:23Z"
lastFallbackCount: 1
nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
targetRevision: 6
readyReplicas: 0
kind: List
metadata:
resourceVersion: ""
2. Wait another more minutes, until the lastFallbackCount value is 1 and shows above message, then remove the addition argument from unsupportedConfigOverrides. This trigger a new revision, the kube-apiserver is back.
...
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.9.0-0.nightly-2021-09-10-170926 True False False 6h26m
latestAvailableRevision: 7
latestAvailableRevisionReason: ""
nodeStatuses:
- currentRevision: 7
lastFailedReason: OperandFailedFallback
lastFailedRevision: 6
lastFailedRevisionErrors:
- 'fallback to last-known-good revision 5 took place after: waiting for kube-apiserver
static pod to listen on port 6443: Get "https://localhost:6443/healthz/etcd":
dial tcp [::1]:6443: connect: connection refused (NetworkError)'
lastFailedTime: "2021-09-13T08:22:23Z"
lastFallbackCount: 1
nodeName: kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal
readyReplicas: 0
kind: List
metadata:
resourceVersion: ""
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.9.0-0.nightly-2021-09-10-170926 True False False 6h27m
To log in master node, current kube-apiserver-last-known-good is pointed to kube-apiserver-pod-7 by the startup-monitor,
sh-4.4# ls kube-apiserver-last-known-good -l
lrwxrwxrwx. 1 root root 81 Sep 13 08:34 kube-apiserver-last-known-good -> /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml
We can see startup-monitor was created for each revision of kube-apiserver by installer, after the operand becomes ready, , and then was removed.
sh-4.4# find . -name 'kube-apiserver-startup-monitor-pod.yaml'
./kube-apiserver-pod-2/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-2/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-3/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-3/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-4/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-4/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-5/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-5/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-6/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-6/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-7/configmaps/kube-apiserver-pod/kube-apiserver-startup-monitor-pod.yaml
./kube-apiserver-pod-7/kube-apiserver-startup-monitor-pod.yaml
sh-4.4# journalctl -b -u crio | grep -E '(Creating container|Removed container).*kube-apiserver-startup-monitor'
...
Sep 13 08:33:54 kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal crio[1621]: time="2021-09-13 08:33:54.866229951Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-startup-monitor-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal/startup-monitor" id=9483384c-922e-4352-a2e4-a70f40323c1d name=/runtime.v1alpha2.RuntimeService/CreateContainer
Sep 13 08:39:57 kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal crio[1621]: time="2021-09-13 08:39:57.883444823Z" level=info msg="Removed container 066f86863e5aa322da1de33381e1c8e79f368a99a885f4d1f0c7125d0bb1632b: openshift-kube-apiserver/kube-apiserver-startup-monitor-kewang-13sno1-gjpzm-master-0.c.openshift-qe.internal/startup-monitor" id=a796ddad-272e-4ac3-ad29-3f94351ddacc name=/runtime.v1alpha2.RuntimeService/RemoveContainer
From above , we can see the startup-monitor watched the operand pods for readiness, the fallback mechanism works as expected. So move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |