Problem is surfacing in this test: openshift-tests.[sig-arch] events should not repeat pathologically This problem is occuring only on AWS and Azure, GCP is for some reason unaffected. It appears to be failing extremely frequently on these platforms, but not quite 100%. The problem is blocking payloads from shipping and thus very urgent. Example output: : [sig-arch] events should not repeat pathologically expand_less 0s { 1 events happened too frequently event happened 65 times, something is wrong: ns/openshift-etcd pod/etcd-guard-ip-10-0-176-75.us-west-2.compute.internal node/ip-10-0-176-75.us-west-2.compute.internal - reason/ProbeError Readiness probe error: Get "https://10.0.176.75:9980/healthz": dial tcp 10.0.176.75:9980: connect: connection refused body: } Suspecting this change to cluster-etcd-operator which is new to the failing payloads: Bug 2063831: replace quorumguard and add readyz server #763 https://github.com/openshift/cluster-etcd-operator/pull/763 merged Apr 12 9:19 ADT PR did have an upgrade job run on it, but it appears it uses gcp, which for some reason is not exhibiting this symptom, only azure and aws: https://prow.ci.openshift.org/pr-history/?org=openshift&repo=cluster-etcd-operator&pr=763 Focusing on a job run with ONLY the pathological event failure, and only for etcd quorum guard pods: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1514016456453394432 Looks like this may be during install, not upgrade. Raw uncompressed events we observed: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1514016456453394432/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/junit/e2e-events_20220412-234130.json ❯ cat e2e-events_20220412-234130.json | jq '.items[] | select(.locator | contains("etcd-guard-ip-10-0-176-75.us-west-2.compute.internal")) | select(.message | contains("connection refused")) | .from' "2022-04-12T23:46:15Z" "2022-04-12T23:46:15Z" "2022-04-12T23:46:16Z" "2022-04-12T23:46:16Z" "2022-04-12T23:46:16Z" "2022-04-12T23:46:16Z" "2022-04-12T23:46:21Z" "2022-04-12T23:46:21Z" "2022-04-12T23:46:26Z" "2022-04-12T23:46:26Z" "2022-04-12T23:46:31Z" "2022-04-12T23:46:31Z" "2022-04-12T23:46:36Z" "2022-04-12T23:46:36Z" "2022-04-12T23:46:41Z" "2022-04-12T23:46:41Z" "2022-04-12T23:46:46Z" "2022-04-12T23:46:46Z" "2022-04-12T23:46:51Z" "2022-04-12T23:46:51Z" "2022-04-12T23:46:56Z" "2022-04-12T23:51:11Z" Pod logs unfortunately empty: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1514016456453394432/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-guard-ip-10-0-176-75.us-west-2.compute.internal_guard.log From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1514016456453394432/artifacts/e2e-aws-upgrade/gather-must-gather/artifacts/event-filter.html we can see the container guard in etcd-guard-ip-10-0-176-75.us-west-2.compute.internal created container "guard" at 23:46:14. We then get connection refused 60 times, mostly up until 23:36:56, and one more hit at 23:51:11. given the limited platforms the problem can somewhat be seen here, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=openshift-tests.%5Bsig-arch%5D%20events%20should%20not%20repeat%20pathologically Some jobs fail with ONLY this test failing. If you'd like more examples you can see the sub-jobs hanging off these aggregated jobs: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1514016467325030400 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1514016426216656896 Others fail with more events, event happened 66 times, something is wrong: ns/openshift-etcd pod/etcd-guard-ip-10-0-137-77.ec2.internal node/ip-10-0-137-77.ec2.internal - reason/ProbeError Readiness probe error: Get "https://10.0.137.77:9980/healthz": dial tcp 10.0.137.77:9980: connect: connection refused body: event happened 23 times, something is wrong: ns/openshift-kube-scheduler pod/openshift-kube-scheduler-guard-ip-10-0-228-139.ec2.internal node/ip-10-0-228-139.ec2.internal - reason/ProbeError Readiness probe error: Get "https://10.0.228.139:10259/healthz": dial tcp 10.0.228.139:10259: connect: connection refused body: event happened 22 times, something is wrong: ns/openshift-kube-scheduler pod/openshift-kube-scheduler-guard-ip-10-0-228-139.ec2.internal node/ip-10-0-228-139.ec2.internal - reason/Unhealthy Readiness probe failed: Get "https://10.0.228.139:10259/healthz": dial tcp 10.0.228.139:10259: connect: connection refused} And some also fail with other tests such as: : [sig-network] pods should successfully create sandboxes by other : [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel]
Revert PR looks to have confirmed this was the issue, AWS has passed, Azure is still running but this is promising and we should proceed with the revert. https://github.com/openshift/cluster-etcd-operator/pull/787
The revert have been verified with 4.11, and test should be pass based on Comment 1.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069