Bug 1897361 - The MCO GCP-OP tests fail consistently on containerruntime tests with "connection refused"
Summary: The MCO GCP-OP tests fail consistently on containerruntime tests with "connec...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1899368
TreeView+ depends on / blocked
 
Reported: 2020-11-12 21:19 UTC by Yu Qi Zhang
Modified: 2021-05-19 14:52 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:33:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2229 0 None closed Bug 1897361: ctrcfg_test: Wait for our prior target config 2021-01-27 18:01:38 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:33:49 UTC

Description Yu Qi Zhang 2020-11-12 21:19:09 UTC
Description of problem:

Starting around Nov. 10/11, the MCO GCP-OP tests have not had a successful run. Tests fail consistently with:

failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node jerzhang-201112-test2-rsj5m-worker-c-qn7zk: Error from server: error dialing backend: dial tcp 10.0.32.4:10250: connect: connection refused

See: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op and https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op

For recent run examples.

I bisected 4.7 nightly releases and have determined that one of the following payloads started breaking the tests:
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-194604
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-205538
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-220956
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-232221

i.e. I've tested 163946 (tested twice, both pass), 194604 (fail, slightly different error - timeout), 232221 (fail - connection refused)

For 194604 the timeout error looks like:

TestContainerRuntimeConfigLogLevel: utils_test.go:86: Pool node-log-level-f62d2ba4-8f63-4d58-b41b-383fb23c5939 has completed rendered-node-log-level-f62d2ba4-8f63-4d58-b41b-383fb23c5939-7d3fbb531435299e13d0b1aa6cccfac3 (waited 3.514754937s)
    TestContainerRuntimeConfigLogLevel: utils_test.go:180: 
        	Error Trace:	utils_test.go:180
        	            				ctrcfg_test.go:181
        	            				ctrcfg_test.go:115
        	            				ctrcfg_test.go:29
        	Error:      	Expected nil, but got: &exec.ExitError{ProcessState:(*os.ProcessState)(0xc0003ffa80), Stderr:[]uint8(nil)}
        	Test:       	TestContainerRuntimeConfigLogLevel
        	Messages:   	failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node ci-ln-gfk2h4t-f76d1-gfcl2-worker-c-7lmqq: error: Timeout occured

Which I'm not sure is related.

Offhand none of the changes in the above nightlies look very suspicious. In that timeframe I can also see the bump to crio for master:
cri-o	cri-o-0-1.20.0-0.rhaos4.7.git8e23406.el8.9-x86_64 -> cri-o-0-1.20.0-0.rhaos4.7.git3b1a241.el8.10-x86_64

But that doesn't look like the change that caused this problem.

So I'm not sure what changed to have caused this to suddenly start failing. Let me know if there are any other info that might be helpful.

Version-Release number of selected component (if applicable):
4.6, 4.7

How reproducible:
100%

Steps to Reproduce:
CI, manually running mco e2e tests

Actual results:
fail

Expected results:
pass

Additional info:

Comment 2 Yu Qi Zhang 2020-11-12 21:31:32 UTC
Not sure if the component is correct but this is the failing command:
https://github.com/openshift/machine-config-operator/blob/b9ce2e4d84858331da74119b28dd5bdeb6c18dd0/test/e2e/utils_test.go#L167

Which is trying to do a read of the crio config via:
https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L115

And thus resulting in:
utils_test.go:180: 
        	Error Trace:	utils_test.go:180
        	            				ctrcfg_test.go:181
        	            				ctrcfg_test.go:115
        	            				ctrcfg_test.go:29
        	Error:      	Expected nil, but got: &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000215440), Stderr:[]uint8(nil)}
        	Test:       	TestContainerRuntimeConfigLogLevel
        	Messages:   	failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node ci-op-p6ihtdsq-1354f-fjg48-worker-c-w29h2: I1112 20:44:02.562732    8521 request.go:645] Throttling request took 1.181831179s, request: GET:https://api.ci-op-p6ihtdsq-1354f.origin-ci-int-gce.dev.openshift.com:6443/apis/certificates.k8s.io/v1?timeout=32s
        	            	Error from server: error dialing backend: dial tcp 10.0.32.2:10250: connect: connection refused

Comment 3 W. Trevor King 2020-11-12 21:47:34 UTC
Ryan found an SDN-related error in the kubelet logs on the target node, which we traced back to:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2213/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1326980267755180032/artifacts/e2e-gcp-op/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name == "sdn-q55tn").status.containerStatuses[] | select(.name == "kube-rbac-proxy")'
{
  "containerID": "cri-o://ea739d5211d6645011dcfeb0fe4c76265837ba7991ce644ccc06ecfc036aff97",
  "image": "registry.build01.ci.openshift.org/ci-op-l4ng4fvi/stable@sha256:9efd0551d14278a761a480667bf39da79ba93282509d828efa1ab207ce21033d",
  "imageID": "registry.build01.ci.openshift.org/ci-op-l4ng4fvi/stable@sha256:9efd0551d14278a761a480667bf39da79ba93282509d828efa1ab207ce21033d",
  "lastState": {
    "terminated": {
      "containerID": "cri-o://3cada243aed047937564c3d460c53fc182f344cace46204cbbbd755d11c14ccc",
      "exitCode": 1,
      "finishedAt": "2020-11-12T21:02:26Z",
      "message": "om None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n  File \"<string>\", line 1, in <module>\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n    return _default_decoder.decode(s)\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n    raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n  File \"<string>\", line 1, in <module>\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n    return _default_decoder.decode(s)\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n    raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n  File \"<string>\", line 1, in <module>\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n    return _default_decoder.decode(s)\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n    raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n",
      "reason": "Error",
      "startedAt": "2020-11-12T21:02:22Z"
    }
  },
  "name": "kube-rbac-proxy",
  "ready": true,
  "restartCount": 1,
  "started": true,
  "state": {
    "running": {
      "startedAt": "2020-11-12T21:02:27Z"
    }
  }
}

Comment 4 W. Trevor King 2020-11-12 22:14:02 UTC
kube-rbac-proxy has been fixed for 4.7 via bug 1885002.  Hasn't been backported to 4.6, but 4.7 MCO CI is failing with the same 'oc rsh ...' kubelet connection errors (e.g. [1]), so kube-rbac-proxy issue is a red herring here.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2098/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326984157745647616

Comment 5 Ryan Phillips 2020-11-12 22:29:48 UTC
Trevor's summary with my diagnosis:

  1. Test triggers a pool update.
  2. Test waits until it thinks the pool is done.
  3. MCO race tripped, claims pool done too fast.
  4. Test tries to 'oc rsh ...' a node.
  5. That node is mid-roll, connection fails.
  6. Test fails, complaining about failed kubelet connection.

[1] shows a config render taking 2 seconds. This should take minutes to roll out a change to the cluster. [2] has a sleep stating there is a race during the rollout causing Pool readiness to return too early.

Increasing the Sleep() might fix the issue... Is there a way to not sleep and know for sure the rollout has happened?

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2098/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326984157745647616/build-log.txt
[2] https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L108-L110

Reassigning to MCO to comment on the race... perhaps there is a better way to update the test.

Comment 6 Kirsten Garrison 2020-11-12 23:17:51 UTC
Those ctrcfg tests were written by and own by node team, the mcd tests don't require a sleep as we don't have to do that "delete ctrcfg" step or account for it (https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L110)? We just need to waitForRenderedConfig() and waitForPoolComplete(). 

I don't have any background on the comment looking at the history perhaps Peter has more color on it for us(https://github.com/openshift/machine-config-operator/pull/1414#issuecomment-582619470)? But the sleep has been around since 4.4 (backported to 4.3), the ctrcfg tests have not changed since April, so I'm not sure what's going on with this now.

Comment 10 errata-xmlrpc 2021-02-24 15:33:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.