Bug 1897361
Summary: | The MCO GCP-OP tests fail consistently on containerruntime tests with "connection refused" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yu Qi Zhang <jerzhang> |
Component: | Node | Assignee: | Peter Hunt <pehunt> |
Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | aos-bugs, kgarriso, pchavva, pehunt, rphillips, wking |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:33:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1899368 |
Description
Yu Qi Zhang
2020-11-12 21:19:09 UTC
Not sure if the component is correct but this is the failing command: https://github.com/openshift/machine-config-operator/blob/b9ce2e4d84858331da74119b28dd5bdeb6c18dd0/test/e2e/utils_test.go#L167 Which is trying to do a read of the crio config via: https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L115 And thus resulting in: utils_test.go:180: Error Trace: utils_test.go:180 ctrcfg_test.go:181 ctrcfg_test.go:115 ctrcfg_test.go:29 Error: Expected nil, but got: &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000215440), Stderr:[]uint8(nil)} Test: TestContainerRuntimeConfigLogLevel Messages: failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node ci-op-p6ihtdsq-1354f-fjg48-worker-c-w29h2: I1112 20:44:02.562732 8521 request.go:645] Throttling request took 1.181831179s, request: GET:https://api.ci-op-p6ihtdsq-1354f.origin-ci-int-gce.dev.openshift.com:6443/apis/certificates.k8s.io/v1?timeout=32s Error from server: error dialing backend: dial tcp 10.0.32.2:10250: connect: connection refused Ryan found an SDN-related error in the kubelet logs on the target node, which we traced back to: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2213/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1326980267755180032/artifacts/e2e-gcp-op/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name == "sdn-q55tn").status.containerStatuses[] | select(.name == "kube-rbac-proxy")' { "containerID": "cri-o://ea739d5211d6645011dcfeb0fe4c76265837ba7991ce644ccc06ecfc036aff97", "image": "registry.build01.ci.openshift.org/ci-op-l4ng4fvi/stable@sha256:9efd0551d14278a761a480667bf39da79ba93282509d828efa1ab207ce21033d", "imageID": "registry.build01.ci.openshift.org/ci-op-l4ng4fvi/stable@sha256:9efd0551d14278a761a480667bf39da79ba93282509d828efa1ab207ce21033d", "lastState": { "terminated": { "containerID": "cri-o://3cada243aed047937564c3d460c53fc182f344cace46204cbbbd755d11c14ccc", "exitCode": 1, "finishedAt": "2020-11-12T21:02:26Z", "message": "om None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n File \"<string>\", line 1, in <module>\n File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n return _default_decoder.decode(s)\n File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n File \"<string>\", line 1, in <module>\n File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n return _default_decoder.decode(s)\n File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n File \"<string>\", line 1, in <module>\n File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n return _default_decoder.decode(s)\n File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n", "reason": "Error", "startedAt": "2020-11-12T21:02:22Z" } }, "name": "kube-rbac-proxy", "ready": true, "restartCount": 1, "started": true, "state": { "running": { "startedAt": "2020-11-12T21:02:27Z" } } } kube-rbac-proxy has been fixed for 4.7 via bug 1885002. Hasn't been backported to 4.6, but 4.7 MCO CI is failing with the same 'oc rsh ...' kubelet connection errors (e.g. [1]), so kube-rbac-proxy issue is a red herring here. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2098/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326984157745647616 Trevor's summary with my diagnosis: 1. Test triggers a pool update. 2. Test waits until it thinks the pool is done. 3. MCO race tripped, claims pool done too fast. 4. Test tries to 'oc rsh ...' a node. 5. That node is mid-roll, connection fails. 6. Test fails, complaining about failed kubelet connection. [1] shows a config render taking 2 seconds. This should take minutes to roll out a change to the cluster. [2] has a sleep stating there is a race during the rollout causing Pool readiness to return too early. Increasing the Sleep() might fix the issue... Is there a way to not sleep and know for sure the rollout has happened? [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2098/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326984157745647616/build-log.txt [2] https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L108-L110 Reassigning to MCO to comment on the race... perhaps there is a better way to update the test. Those ctrcfg tests were written by and own by node team, the mcd tests don't require a sleep as we don't have to do that "delete ctrcfg" step or account for it (https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L110)? We just need to waitForRenderedConfig() and waitForPoolComplete(). I don't have any background on the comment looking at the history perhaps Peter has more color on it for us(https://github.com/openshift/machine-config-operator/pull/1414#issuecomment-582619470)? But the sleep has been around since 4.4 (backported to 4.3), the ctrcfg tests have not changed since April, so I'm not sure what's going on with this now. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |