2075015 – etcd-guard connection refused event repeating pathologically (payload blocking)

Bug 2075015 - etcd-guard connection refused event repeating pathologically (payload blocking)

Summary: etcd-guard connection refused event repeating pathologically (payload blocking)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Haseeb Tariq
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-13 12:29 UTC by Devan Goodwin
Modified:	2022-08-10 11:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:07:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 787	0	None	Merged	Bug 2075015: Revert "replace quorumguard and add readyz server"	2022-04-18 17:29:43 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:07:26 UTC

Description Devan Goodwin 2022-04-13 12:29:06 UTC

Problem is surfacing in this test:

openshift-tests.[sig-arch] events should not repeat pathologically

This problem is occuring only on AWS and Azure, GCP is for some reason unaffected. It appears to be failing extremely frequently on these platforms, but not quite 100%.

The problem is blocking payloads from shipping and thus very urgent.

Example output: 

: [sig-arch] events should not repeat pathologically expand_less 	0s
{  1 events happened too frequently

event happened 65 times, something is wrong: ns/openshift-etcd pod/etcd-guard-ip-10-0-176-75.us-west-2.compute.internal node/ip-10-0-176-75.us-west-2.compute.internal - reason/ProbeError Readiness probe error: Get "https://10.0.176.75:9980/healthz": dial tcp 10.0.176.75:9980: connect: connection refused
body: 
}

Suspecting this change to cluster-etcd-operator which is new to the failing payloads:

    Bug 2063831: replace quorumguard and add readyz server #763
	
https://github.com/openshift/cluster-etcd-operator/pull/763 merged Apr 12 9:19 ADT

PR did have an upgrade job run on it, but it appears it uses gcp, which for some reason is not exhibiting this symptom, only azure and aws: https://prow.ci.openshift.org/pr-history/?org=openshift&repo=cluster-etcd-operator&pr=763




Focusing on a job run with ONLY the pathological event failure, and only for etcd quorum guard pods: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1514016456453394432

Looks like this may be during install, not upgrade.

Raw uncompressed events we observed: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1514016456453394432/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/junit/e2e-events_20220412-234130.json



❯ cat e2e-events_20220412-234130.json | jq '.items[] | select(.locator | contains("etcd-guard-ip-10-0-176-75.us-west-2.compute.internal")) | select(.message | contains("connection refused")) | .from'
"2022-04-12T23:46:15Z"
"2022-04-12T23:46:15Z"
"2022-04-12T23:46:16Z"
"2022-04-12T23:46:16Z"
"2022-04-12T23:46:16Z"
"2022-04-12T23:46:16Z"
"2022-04-12T23:46:21Z"
"2022-04-12T23:46:21Z"
"2022-04-12T23:46:26Z"
"2022-04-12T23:46:26Z"
"2022-04-12T23:46:31Z"
"2022-04-12T23:46:31Z"
"2022-04-12T23:46:36Z"
"2022-04-12T23:46:36Z"
"2022-04-12T23:46:41Z"
"2022-04-12T23:46:41Z"
"2022-04-12T23:46:46Z"
"2022-04-12T23:46:46Z"
"2022-04-12T23:46:51Z"
"2022-04-12T23:46:51Z"
"2022-04-12T23:46:56Z"
"2022-04-12T23:51:11Z"


Pod logs unfortunately empty: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1514016456453394432/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-guard-ip-10-0-176-75.us-west-2.compute.internal_guard.log

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1514016456453394432/artifacts/e2e-aws-upgrade/gather-must-gather/artifacts/event-filter.html we can see the container guard in etcd-guard-ip-10-0-176-75.us-west-2.compute.internal created container "guard" at 23:46:14.

We then get connection refused 60 times, mostly up until 23:36:56, and one more hit at 23:51:11.







given the limited platforms the problem can somewhat be seen here, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=openshift-tests.%5Bsig-arch%5D%20events%20should%20not%20repeat%20pathologically


Some jobs fail with ONLY this test failing.

If you'd like more examples you can see the sub-jobs hanging off these aggregated jobs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1514016467325030400

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1514016426216656896




Others fail with more events, 

event happened 66 times, something is wrong: ns/openshift-etcd pod/etcd-guard-ip-10-0-137-77.ec2.internal node/ip-10-0-137-77.ec2.internal - reason/ProbeError Readiness probe error: Get "https://10.0.137.77:9980/healthz": dial tcp 10.0.137.77:9980: connect: connection refused
body: 

event happened 23 times, something is wrong: ns/openshift-kube-scheduler pod/openshift-kube-scheduler-guard-ip-10-0-228-139.ec2.internal node/ip-10-0-228-139.ec2.internal - reason/ProbeError Readiness probe error: Get "https://10.0.228.139:10259/healthz": dial tcp 10.0.228.139:10259: connect: connection refused
body: 

event happened 22 times, something is wrong: ns/openshift-kube-scheduler pod/openshift-kube-scheduler-guard-ip-10-0-228-139.ec2.internal node/ip-10-0-228-139.ec2.internal - reason/Unhealthy Readiness probe failed: Get "https://10.0.228.139:10259/healthz": dial tcp 10.0.228.139:10259: connect: connection refused}



And some also fail with other tests such as:

: [sig-network] pods should successfully create sandboxes by other
: [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]

Comment 1 Devan Goodwin 2022-04-13 16:38:24 UTC

Revert PR looks to have confirmed this was the issue, AWS has passed, Azure is still running but this is promising and we should proceed with the revert. https://github.com/openshift/cluster-etcd-operator/pull/787

Comment 4 ge liu 2022-04-15 01:45:17 UTC

The revert have been verified with 4.11, and test should be pass based on Comment 1.

Comment 6 errata-xmlrpc 2022-08-10 11:07:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.