Description of problem: At the time of teardown/mustgather one of the openshift-apiserver pods was crashlooping, but it reported available=true/degraded=false. job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591 operator status: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591/artifacts/e2e-aws-upgrade/clusteroperators.json { "apiVersion": "config.openshift.io/v1", "kind": "ClusterOperator", "metadata": { "creationTimestamp": "2020-01-16T18:03:28Z", "generation": 1, "name": "openshift-apiserver", "resourceVersion": "25847", "selfLink": "/apis/config.openshift.io/v1/clusteroperators/openshift-apiserver", "uid": "7f983248-388a-11ea-ac03-12690cd56899" }, "spec": {}, "status": { "conditions": [ { "lastTransitionTime": "2020-01-16T18:06:10Z", "reason": "AsExpected", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2020-01-16T18:28:48Z", "reason": "AsExpected", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2020-01-16T18:31:38Z", "reason": "AsExpected", "status": "True", "type": "Available" }, { "lastTransitionTime": "2020-01-16T18:03:28Z", "reason": "AsExpected", "status": "True", "type": "Upgradeable" } ], crashlooping pod: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7bf9a252c3ac57a660c46b89bb9fc3782163b156d57de3be5538286b90af020/namespaces/openshift-apiserver/pods/apiserver-dhjz9/apiserver-dhjz9.yaml - containerID: cri-o://82d109fa3f04c87fb2d5753c9eb8ff06be4ab43bb696c8065e9ce60472f8fc4f image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1d4abd6d6ce58625de347d7c5f57cd1ed5882824bcf158eb237b9639eafaca36 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1d4abd6d6ce58625de347d7c5f57cd1ed5882824bcf158eb237b9639eafaca36 lastState: terminated: containerID: cri-o://82d109fa3f04c87fb2d5753c9eb8ff06be4ab43bb696c8065e9ce60472f8fc4f exitCode: 255 finishedAt: 2020-01-16T19:46:17Z message: | shift.io count at <storage-prefix>//rangeallocations I0116 19:45:32.819197 1 client.go:361] parsed scheme: "endpoint" I0116 19:45:32.819262 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0 <nil>}] I0116 19:45:42.830227 1 store.go:1342] Monitoring templates.template.openshift.io count at <storage-prefix>//templates I0116 19:45:42.830915 1 client.go:361] parsed scheme: "endpoint" I0116 19:45:42.830949 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0 <nil>}] I0116 19:45:47.843368 1 store.go:1342] Monitoring templateinstances.template.openshift.io count at <storage-prefix>//templateinstances I0116 19:45:47.844017 1 client.go:361] parsed scheme: "endpoint" I0116 19:45:47.844064 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0 <nil>}] I0116 19:45:47.856317 1 store.go:1342] Monitoring brokertemplateinstances.template.openshift.io count at <storage-prefix>//brokertemplateinstances I0116 19:45:47.885165 1 client.go:361] parsed scheme: "endpoint" I0116 19:45:47.885307 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0 <nil>}] I0116 19:45:57.899321 1 store.go:1342] Monitoring users.user.openshift.io count at <storage-prefix>//users I0116 19:45:57.899963 1 client.go:361] parsed scheme: "endpoint" I0116 19:45:57.900064 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0 <nil>}] W0116 19:46:12.934932 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd.openshift-etcd.svc:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd.openshift-etcd.svc on 172.30.0.10:53: no such host". Reconnecting... F0116 19:46:17.900152 1 openshift_apiserver.go:420] context deadline exceeded reason: Error startedAt: 2020-01-16T19:44:26Z name: openshift-apiserver ready: false restartCount: 21 state: waiting: message: Back-off 5m0s restarting failed container=openshift-apiserver pod=apiserver-dhjz9_openshift-apiserver(04b732f9-388e-11ea-b835-12a069e7f9b5) reason: CrashLoopBackOff
From the failed job's DaemonSet [1]: metadata: ... generation: 5 ... status: currentNumberScheduled: 3 desiredNumberScheduled: 3 numberAvailable: 2 numberMisscheduled: 0 numberReady: 2 numberUnavailable: 1 observedGeneration: 5 updatedNumberScheduled: 1 So we have: * numberAvailable > 0, we are Available=True * observedGeneration == generation, which is necessary, but not sufficient, for Progressing=False * numberReady < currentNumberScheduled, so we are Progressing=True * updatedNumberScheduled != currentNumberScheduled, so we are Progressing=True * numberUnavailable > 0 for a while, so we should be Degraded=True [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7bf9a252c3ac57a660c46b89bb9fc3782163b156d57de3be5538286b90af020/namespaces/openshift-apiserver/apps/daemonsets.yaml
Verified in 4.4.0-0.nightly-2020-01-22-073853 env: $ oc get svc -n openshift-etcd etcd -o yaml > svc-etcd-openshift-etcd.yaml $ oc delete svc -n openshift-etcd etcd $ oc delete po apiserver-72xcp -n openshift-apiserver ## make new pod start given etcd.openshift-etcd.svc has problem $ oc get pod -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-9lmgf 0/1 CrashLoopBackOff 3 4m29s apiserver-nmp4z 1/1 Running 0 117m apiserver-tg885 1/1 Running 0 115m $ oc get pod apiserver-9lmgf -o yaml -n openshift-apiserver ... exitCode: 255 finishedAt: "2020-01-22T10:58:15Z" message: | ... W0122 10:58:11.503375 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd.openshift-etcd.svc:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd.openshift-etcd.svc on 172.30.0.10:53: no such host". Reconnecting... F0122 10:58:15.061768 1 openshift_apiserver.go:420] context deadline exceeded reason: Error startedAt: "2020-01-22T10:57:53Z" name: openshift-apiserver ready: false restartCount: 3 started: false state: waiting: message: back-off 40s restarting failed container=openshift-apiserver pod=apiserver-9lmgf_openshift-apiserver(3aa7e092-6c6c-46d5-9d62-5a095142563f) reason: CrashLoopBackOff ... $ oc get co openshift-apiserver ## DEGRADED becomes True NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-apiserver 4.4.0-0.nightly-2020-01-22-073853 True False True 117m $ oc create -f svc-etcd-openshift-etcd.yaml
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581