Bug 1792382 - apiserver did not report degraded when it had a crashlooping pod
Summary: apiserver did not report degraded when it had a crashlooping pod
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: David Eads
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-17 15:24 UTC by Ben Parees
Modified: 2020-05-04 11:25 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:24:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-openshift-apiserver-operator pull 301 0 None closed bug 1792382: large scale condition rewrites 2021-01-20 19:46:55 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:25:35 UTC

Description Ben Parees 2020-01-17 15:24:14 UTC
Description of problem:
At the time of teardown/mustgather one of the openshift-apiserver pods was crashlooping, but it reported available=true/degraded=false.

job:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591


operator status:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591/artifacts/e2e-aws-upgrade/clusteroperators.json

{
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2020-01-16T18:03:28Z",
                "generation": 1,
                "name": "openshift-apiserver",
                "resourceVersion": "25847",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/openshift-apiserver",
                "uid": "7f983248-388a-11ea-ac03-12690cd56899"
            },
            "spec": {},
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2020-01-16T18:06:10Z",
                        "reason": "AsExpected",
                        "status": "False",
                        "type": "Degraded"
                    },
                    {
                        "lastTransitionTime": "2020-01-16T18:28:48Z",
                        "reason": "AsExpected",
                        "status": "False",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2020-01-16T18:31:38Z",
                        "reason": "AsExpected",
                        "status": "True",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2020-01-16T18:03:28Z",
                        "reason": "AsExpected",
                        "status": "True",
                        "type": "Upgradeable"
                    }
                ],



crashlooping pod:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7bf9a252c3ac57a660c46b89bb9fc3782163b156d57de3be5538286b90af020/namespaces/openshift-apiserver/pods/apiserver-dhjz9/apiserver-dhjz9.yaml


  - containerID: cri-o://82d109fa3f04c87fb2d5753c9eb8ff06be4ab43bb696c8065e9ce60472f8fc4f
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1d4abd6d6ce58625de347d7c5f57cd1ed5882824bcf158eb237b9639eafaca36
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1d4abd6d6ce58625de347d7c5f57cd1ed5882824bcf158eb237b9639eafaca36
    lastState:
      terminated:
        containerID: cri-o://82d109fa3f04c87fb2d5753c9eb8ff06be4ab43bb696c8065e9ce60472f8fc4f
        exitCode: 255
        finishedAt: 2020-01-16T19:46:17Z
        message: |
          shift.io count at <storage-prefix>//rangeallocations
          I0116 19:45:32.819197       1 client.go:361] parsed scheme: "endpoint"
          I0116 19:45:32.819262       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0  <nil>}]
          I0116 19:45:42.830227       1 store.go:1342] Monitoring templates.template.openshift.io count at <storage-prefix>//templates
          I0116 19:45:42.830915       1 client.go:361] parsed scheme: "endpoint"
          I0116 19:45:42.830949       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0  <nil>}]
          I0116 19:45:47.843368       1 store.go:1342] Monitoring templateinstances.template.openshift.io count at <storage-prefix>//templateinstances
          I0116 19:45:47.844017       1 client.go:361] parsed scheme: "endpoint"
          I0116 19:45:47.844064       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0  <nil>}]
          I0116 19:45:47.856317       1 store.go:1342] Monitoring brokertemplateinstances.template.openshift.io count at <storage-prefix>//brokertemplateinstances
          I0116 19:45:47.885165       1 client.go:361] parsed scheme: "endpoint"
          I0116 19:45:47.885307       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0  <nil>}]
          I0116 19:45:57.899321       1 store.go:1342] Monitoring users.user.openshift.io count at <storage-prefix>//users
          I0116 19:45:57.899963       1 client.go:361] parsed scheme: "endpoint"
          I0116 19:45:57.900064       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://etcd.openshift-etcd.svc:2379 0  <nil>}]
          W0116 19:46:12.934932       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd.openshift-etcd.svc:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd.openshift-etcd.svc on 172.30.0.10:53: no such host". Reconnecting...
          F0116 19:46:17.900152       1 openshift_apiserver.go:420] context deadline exceeded
        reason: Error
        startedAt: 2020-01-16T19:44:26Z
    name: openshift-apiserver
    ready: false
    restartCount: 21
    state:
      waiting:
        message: Back-off 5m0s restarting failed container=openshift-apiserver pod=apiserver-dhjz9_openshift-apiserver(04b732f9-388e-11ea-b835-12a069e7f9b5)
        reason: CrashLoopBackOff

Comment 2 W. Trevor King 2020-01-17 19:31:32 UTC
From the failed job's DaemonSet [1]:

  metadata:
    ...
    generation: 5
  ...
  status:
    currentNumberScheduled: 3
    desiredNumberScheduled: 3
    numberAvailable: 2
    numberMisscheduled: 0
    numberReady: 2
    numberUnavailable: 1
    observedGeneration: 5
    updatedNumberScheduled: 1

So we have:

* numberAvailable > 0, we are Available=True
* observedGeneration == generation, which is necessary, but not sufficient, for Progressing=False
* numberReady < currentNumberScheduled, so we are Progressing=True
* updatedNumberScheduled != currentNumberScheduled, so we are Progressing=True
* numberUnavailable > 0 for a while, so we should be Degraded=True

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14591/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7bf9a252c3ac57a660c46b89bb9fc3782163b156d57de3be5538286b90af020/namespaces/openshift-apiserver/apps/daemonsets.yaml

Comment 4 Xingxing Xia 2020-01-22 11:14:41 UTC
Verified in 4.4.0-0.nightly-2020-01-22-073853 env:
$ oc get svc -n openshift-etcd etcd -o yaml > svc-etcd-openshift-etcd.yaml
$ oc delete svc -n openshift-etcd etcd
$ oc delete po apiserver-72xcp -n openshift-apiserver  ## make new pod start given etcd.openshift-etcd.svc has problem
$ oc get pod -n openshift-apiserver
NAME              READY   STATUS             RESTARTS   AGE
apiserver-9lmgf   0/1     CrashLoopBackOff   3          4m29s
apiserver-nmp4z   1/1     Running            0          117m
apiserver-tg885   1/1     Running            0          115m
$ oc get pod apiserver-9lmgf -o yaml -n openshift-apiserver
...
        exitCode: 255
        finishedAt: "2020-01-22T10:58:15Z"
        message: |
          ...
          W0122 10:58:11.503375       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd.openshift-etcd.svc:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd.openshift-etcd.svc on 172.30.0.10:53: no such host". Reconnecting...
          F0122 10:58:15.061768       1 openshift_apiserver.go:420] context deadline exceeded
        reason: Error
        startedAt: "2020-01-22T10:57:53Z"
    name: openshift-apiserver
    ready: false
    restartCount: 3
    started: false
    state:
      waiting:
        message: back-off 40s restarting failed container=openshift-apiserver pod=apiserver-9lmgf_openshift-apiserver(3aa7e092-6c6c-46d5-9d62-5a095142563f)
        reason: CrashLoopBackOff
...
$ oc get co openshift-apiserver  ## DEGRADED becomes True
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-apiserver                        4.4.0-0.nightly-2020-01-22-073853   True        False         True       117m
$ oc create -f svc-etcd-openshift-etcd.yaml

Comment 6 errata-xmlrpc 2020-05-04 11:24:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.