Bug 1897361

Summary:	The MCO GCP-OP tests fail consistently on containerruntime tests with "connection refused"
Product:	OpenShift Container Platform	Reporter:	Yu Qi Zhang <jerzhang>
Component:	Node	Assignee:	Peter Hunt <pehunt>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, kgarriso, pchavva, pehunt, rphillips, wking
Version:	4.7
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:33:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1899368

Description Yu Qi Zhang 2020-11-12 21:19:09 UTC

Description of problem:

Starting around Nov. 10/11, the MCO GCP-OP tests have not had a successful run. Tests fail consistently with:

failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node jerzhang-201112-test2-rsj5m-worker-c-qn7zk: Error from server: error dialing backend: dial tcp 10.0.32.4:10250: connect: connection refused

See: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op and https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op

For recent run examples.

I bisected 4.7 nightly releases and have determined that one of the following payloads started breaking the tests:
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-194604
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-205538
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-220956
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2020-11-10-232221

i.e. I've tested 163946 (tested twice, both pass), 194604 (fail, slightly different error - timeout), 232221 (fail - connection refused)

For 194604 the timeout error looks like:

TestContainerRuntimeConfigLogLevel: utils_test.go:86: Pool node-log-level-f62d2ba4-8f63-4d58-b41b-383fb23c5939 has completed rendered-node-log-level-f62d2ba4-8f63-4d58-b41b-383fb23c5939-7d3fbb531435299e13d0b1aa6cccfac3 (waited 3.514754937s)
    TestContainerRuntimeConfigLogLevel: utils_test.go:180: 
        	Error Trace:	utils_test.go:180
        	            				ctrcfg_test.go:181
        	            				ctrcfg_test.go:115
        	            				ctrcfg_test.go:29
        	Error:      	Expected nil, but got: &exec.ExitError{ProcessState:(*os.ProcessState)(0xc0003ffa80), Stderr:[]uint8(nil)}
        	Test:       	TestContainerRuntimeConfigLogLevel
        	Messages:   	failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node ci-ln-gfk2h4t-f76d1-gfcl2-worker-c-7lmqq: error: Timeout occured

Which I'm not sure is related.

Offhand none of the changes in the above nightlies look very suspicious. In that timeframe I can also see the bump to crio for master:
cri-o	cri-o-0-1.20.0-0.rhaos4.7.git8e23406.el8.9-x86_64 -> cri-o-0-1.20.0-0.rhaos4.7.git3b1a241.el8.10-x86_64

But that doesn't look like the change that caused this problem.

So I'm not sure what changed to have caused this to suddenly start failing. Let me know if there are any other info that might be helpful.

Version-Release number of selected component (if applicable):
4.6, 4.7

How reproducible:
100%

Steps to Reproduce:
CI, manually running mco e2e tests

Actual results:
fail

Expected results:
pass

Additional info:

Comment 2 Yu Qi Zhang 2020-11-12 21:31:32 UTC

Not sure if the component is correct but this is the failing command:
https://github.com/openshift/machine-config-operator/blob/b9ce2e4d84858331da74119b28dd5bdeb6c18dd0/test/e2e/utils_test.go#L167

Which is trying to do a read of the crio config via:
https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L115

And thus resulting in:
utils_test.go:180: 
        	Error Trace:	utils_test.go:180
        	            				ctrcfg_test.go:181
        	            				ctrcfg_test.go:115
        	            				ctrcfg_test.go:29
        	Error:      	Expected nil, but got: &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000215440), Stderr:[]uint8(nil)}
        	Test:       	TestContainerRuntimeConfigLogLevel
        	Messages:   	failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/00-default] on node ci-op-p6ihtdsq-1354f-fjg48-worker-c-w29h2: I1112 20:44:02.562732    8521 request.go:645] Throttling request took 1.181831179s, request: GET:https://api.ci-op-p6ihtdsq-1354f.origin-ci-int-gce.dev.openshift.com:6443/apis/certificates.k8s.io/v1?timeout=32s
        	            	Error from server: error dialing backend: dial tcp 10.0.32.2:10250: connect: connection refused

Comment 3 W. Trevor King 2020-11-12 21:47:34 UTC

Ryan found an SDN-related error in the kubelet logs on the target node, which we traced back to:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2213/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1326980267755180032/artifacts/e2e-gcp-op/gather-extra/pods.json | jq -r '.items[] | select(.metadata.name == "sdn-q55tn").status.containerStatuses[] | select(.name == "kube-rbac-proxy")'
{
  "containerID": "cri-o://ea739d5211d6645011dcfeb0fe4c76265837ba7991ce644ccc06ecfc036aff97",
  "image": "registry.build01.ci.openshift.org/ci-op-l4ng4fvi/stable@sha256:9efd0551d14278a761a480667bf39da79ba93282509d828efa1ab207ce21033d",
  "imageID": "registry.build01.ci.openshift.org/ci-op-l4ng4fvi/stable@sha256:9efd0551d14278a761a480667bf39da79ba93282509d828efa1ab207ce21033d",
  "lastState": {
    "terminated": {
      "containerID": "cri-o://3cada243aed047937564c3d460c53fc182f344cace46204cbbbd755d11c14ccc",
      "exitCode": 1,
      "finishedAt": "2020-11-12T21:02:26Z",
      "message": "om None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n  File \"<string>\", line 1, in <module>\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n    return _default_decoder.decode(s)\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n    raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n  File \"<string>\", line 1, in <module>\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n    return _default_decoder.decode(s)\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n    raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\nTraceback (most recent call last):\n  File \"<string>\", line 1, in <module>\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 299, in load\n    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)\n  File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads\n    return _default_decoder.decode(s)\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n  File \"/usr/lib64/python3.6/json/decoder.py\", line 357, in raw_decode\n    raise JSONDecodeError(\"Expecting value\", s, err.value) from None\njson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n",
      "reason": "Error",
      "startedAt": "2020-11-12T21:02:22Z"
    }
  },
  "name": "kube-rbac-proxy",
  "ready": true,
  "restartCount": 1,
  "started": true,
  "state": {
    "running": {
      "startedAt": "2020-11-12T21:02:27Z"
    }
  }
}

Comment 4 W. Trevor King 2020-11-12 22:14:02 UTC

kube-rbac-proxy has been fixed for 4.7 via bug 1885002.  Hasn't been backported to 4.6, but 4.7 MCO CI is failing with the same 'oc rsh ...' kubelet connection errors (e.g. [1]), so kube-rbac-proxy issue is a red herring here.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2098/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326984157745647616

Comment 5 Ryan Phillips 2020-11-12 22:29:48 UTC

Trevor's summary with my diagnosis:

  1. Test triggers a pool update.
  2. Test waits until it thinks the pool is done.
  3. MCO race tripped, claims pool done too fast.
  4. Test tries to 'oc rsh ...' a node.
  5. That node is mid-roll, connection fails.
  6. Test fails, complaining about failed kubelet connection.

[1] shows a config render taking 2 seconds. This should take minutes to roll out a change to the cluster. [2] has a sleep stating there is a race during the rollout causing Pool readiness to return too early.

Increasing the Sleep() might fix the issue... Is there a way to not sleep and know for sure the rollout has happened?

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2098/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1326984157745647616/build-log.txt
[2] https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L108-L110

Reassigning to MCO to comment on the race... perhaps there is a better way to update the test.

Comment 6 Kirsten Garrison 2020-11-12 23:17:51 UTC

Those ctrcfg tests were written by and own by node team, the mcd tests don't require a sleep as we don't have to do that "delete ctrcfg" step or account for it (https://github.com/openshift/machine-config-operator/blob/e7baf4ed61c55f4d28abc770d1de00e052a50156/test/e2e/ctrcfg_test.go#L110)? We just need to waitForRenderedConfig() and waitForPoolComplete(). 

I don't have any background on the comment looking at the history perhaps Peter has more color on it for us(https://github.com/openshift/machine-config-operator/pull/1414#issuecomment-582619470)? But the sleep has been around since 4.4 (backported to 4.3), the ctrcfg tests have not changed since April, so I'm not sure what's going on with this now.

Comment 10 errata-xmlrpc 2021-02-24 15:33:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633