Description of problem: Starting around Nov. 10/11, the MCO GCP-OP tests have not had a successful run. Tests fail consistently with: failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node jerzhang-201112-test2-rsj5m-worker-c-qn7zk: Error from server: error dialing backend: dial tcp 10.0.32.4:10250: connect: connection refused See: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op and https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op For recent run examples. I bisected 4.7 nightly releases and have determined that one of the following payloads started breaking the tests: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-194604 https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-205538 https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-220956 https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-232221 i.e. I've tested 163946 (tested twice, both pass), 194604 (fail, slightly different error - timeout), 232221 (fail - connection refused) For 194604 the timeout error looks like: TestContainerRuntimeConfigLogLevel: utils_test.go:86: Pool node-log-level-f62d2ba4-8f63-4d58-b41b-383fb23c5939 has completed rendered-node-log-level-f62d2ba4-8f63-4d58-b41b-383fb23c5939-7d3fbb531435299e13d0b1aa6cccfac3 (waited 3.514754937s) TestContainerRuntimeConfigLogLevel: utils_test.go:180: Error Trace: utils_test.go:180 ctrcfg_test.go:181 ctrcfg_test.go:115 ctrcfg_test.go:29 Error: Expected nil, but got: &exec.ExitError{ProcessState:(*os.ProcessState)(0xc0003ffa80), Stderr:[]uint8(nil)} Test: TestContainerRuntimeConfigLogLevel Messages: failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node ci-ln-gfk2h4t-f76d1-gfcl2-worker-c-7lmqq: error: Timeout occured Which I'm not sure is related. Offhand none of the changes in the above nightlies look very suspicious. In that timeframe I can also see the bump to crio for master: cri-o cri-o-0-1.20.0-0.rhaos4.7.git8e23406.el8.9-x86_64 -> cri-o-0-1.20.0-0.rhaos4.7.git3b1a241.el8.10-x86_64 But that doesn't look like the change that caused this problem. So I'm not sure what changed to have caused this to suddenly start failing. Let me know if there are any other info that might be helpful. Version-Release number of selected component (if applicable): 4.6, 4.7 How reproducible: 100% Steps to Reproduce: CI, manually running mco e2e tests Actual results: fail Expected results: pass Additional info:
Not sure if the component is correct but this is the failing command: https://github.com/openshift/machine-config-operator/blob/b9ce2e4d84858331da74119b28dd5bdeb6c18dd0/test/e2e/utils_test.go#L167 Which is trying to do a read of the crio config via: https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L115 And thus resulting in: utils_test.go:180: Error Trace: utils_test.go:180 ctrcfg_test.go:181 ctrcfg_test.go:115 ctrcfg_test.go:29 Error: Expected nil, but got: &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000215440), Stderr:[]uint8(nil)} Test: TestContainerRuntimeConfigLogLevel Messages: failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node ci-op-p6ihtdsq-1354f-fjg48-worker-c-w29h2: I1112 20:44:02.562732 8521 request.go:645] Throttling request took 1.181831179s, request: GET:https://api.ci-op-p6ihtdsq-1354f.origin-ci-int-gce.dev.openshift.com:6443/apis/certificates.k8s.io/v1?timeout=32s Error from server: error dialing backend: dial tcp 10.0.32.2:10250: connect: connection refused
Ryan found an SDN-related error in the kubelet logs on the target node, which we traced back to: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2213/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1326980267755180032/artifacts/e2e-gcp-op/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name == "sdn-q55tn").status.containerStatuses[] | select(.name == "kube-rbac-proxy")' { "containerID": "cri-o://ea739d5211d6645011dcfeb0fe4c76265837ba7991ce644ccc06ecfc036aff97", "image": "registry.build01.ci.openshift.org/ci-op-l4ng4fvi/stable@sha256:9efd0551d14278a761a480667bf39da79ba93282509d828efa1ab207ce21033d", "imageID": "registry.build01.ci.openshift.org/ci-op-l4ng4fvi/stable@sha256:9efd0551d14278a761a480667bf39da79ba93282509d828efa1ab207ce21033d", "lastState": { "terminated": { "containerID": "cri-o://3cada243aed047937564c3d460c53fc182f344cace46204cbbbd755d11c14ccc", "exitCode": 1, "finishedAt": "2020-11-12T21:02:26Z", "message": "om None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n File \"<string>\", line 1, in <module>\n File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n return _default_decoder.decode(s)\n File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n File \"<string>\", line 1, in <module>\n File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n return _default_decoder.decode(s)\n File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n File \"<string>\", line 1, in <module>\n File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n return _default_decoder.decode(s)\n File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n", "reason": "Error", "startedAt": "2020-11-12T21:02:22Z" } }, "name": "kube-rbac-proxy", "ready": true, "restartCount": 1, "started": true, "state": { "running": { "startedAt": "2020-11-12T21:02:27Z" } } }
kube-rbac-proxy has been fixed for 4.7 via bug 1885002. Hasn't been backported to 4.6, but 4.7 MCO CI is failing with the same 'oc rsh ...' kubelet connection errors (e.g. [1]), so kube-rbac-proxy issue is a red herring here. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2098/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326984157745647616
Trevor's summary with my diagnosis: 1. Test triggers a pool update. 2. Test waits until it thinks the pool is done. 3. MCO race tripped, claims pool done too fast. 4. Test tries to 'oc rsh ...' a node. 5. That node is mid-roll, connection fails. 6. Test fails, complaining about failed kubelet connection. [1] shows a config render taking 2 seconds. This should take minutes to roll out a change to the cluster. [2] has a sleep stating there is a race during the rollout causing Pool readiness to return too early. Increasing the Sleep() might fix the issue... Is there a way to not sleep and know for sure the rollout has happened? [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2098/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326984157745647616/build-log.txt [2] https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L108-L110 Reassigning to MCO to comment on the race... perhaps there is a better way to update the test.
Those ctrcfg tests were written by and own by node team, the mcd tests don't require a sleep as we don't have to do that "delete ctrcfg" step or account for it (https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L110)? We just need to waitForRenderedConfig() and waitForPoolComplete(). I don't have any background on the comment looking at the history perhaps Peter has more color on it for us(https://github.com/openshift/machine-config-operator/pull/1414#issuecomment-582619470)? But the sleep has been around since 4.4 (backported to 4.3), the ctrcfg tests have not changed since April, so I'm not sure what's going on with this now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633