Bug 1792382
| Summary: | apiserver did not report degraded when it had a crashlooping pod | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> |
| Component: | openshift-apiserver | Assignee: | David Eads <deads> |
| Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3.0 | CC: | aos-bugs, deads, mfojtik, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-04 11:24:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ben Parees
2020-01-17 15:24:14 UTC
From the failed job's DaemonSet [1]:
metadata:
...
generation: 5
...
status:
currentNumberScheduled: 3
desiredNumberScheduled: 3
numberAvailable: 2
numberMisscheduled: 0
numberReady: 2
numberUnavailable: 1
observedGeneration: 5
updatedNumberScheduled: 1
So we have:
* numberAvailable > 0, we are Available=True
* observedGeneration == generation, which is necessary, but not sufficient, for Progressing=False
* numberReady < currentNumberScheduled, so we are Progressing=True
* updatedNumberScheduled != currentNumberScheduled, so we are Progressing=True
* numberUnavailable > 0 for a while, so we should be Degraded=True
[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7bf9a252c3ac57a660c46b89bb9fc3782163b156d57de3be5538286b90af020/namespaces/openshift-apiserver/apps/daemonsets.yaml
Verified in 4.4.0-0.nightly-2020-01-22-073853 env:
$ oc get svc -n openshift-etcd etcd -o yaml > svc-etcd-openshift-etcd.yaml
$ oc delete svc -n openshift-etcd etcd
$ oc delete po apiserver-72xcp -n openshift-apiserver ## make new pod start given etcd.openshift-etcd.svc has problem
$ oc get pod -n openshift-apiserver
NAME READY STATUS RESTARTS AGE
apiserver-9lmgf 0/1 CrashLoopBackOff 3 4m29s
apiserver-nmp4z 1/1 Running 0 117m
apiserver-tg885 1/1 Running 0 115m
$ oc get pod apiserver-9lmgf -o yaml -n openshift-apiserver
...
exitCode: 255
finishedAt: "2020-01-22T10:58:15Z"
message: |
...
W0122 10:58:11.503375 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd.openshift-etcd.svc:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd.openshift-etcd.svc on 172.30.0.10:53: no such host". Reconnecting...
F0122 10:58:15.061768 1 openshift_apiserver.go:420] context deadline exceeded
reason: Error
startedAt: "2020-01-22T10:57:53Z"
name: openshift-apiserver
ready: false
restartCount: 3
started: false
state:
waiting:
message: back-off 40s restarting failed container=openshift-apiserver pod=apiserver-9lmgf_openshift-apiserver(3aa7e092-6c6c-46d5-9d62-5a095142563f)
reason: CrashLoopBackOff
...
$ oc get co openshift-apiserver ## DEGRADED becomes True
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
openshift-apiserver 4.4.0-0.nightly-2020-01-22-073853 True False True 117m
$ oc create -f svc-etcd-openshift-etcd.yaml
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |