Bug 1946781 - ClusterOperatorDown fires during 4.8 compact CI
Summary: ClusterOperatorDown fires during 4.8 compact CI
Keywords:
Status: CLOSED DUPLICATE of bug 1939580
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Standa Laznicka
QA Contact: pmali
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-06 19:51 UTC by W. Trevor King
Modified: 2021-04-30 14:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
job=release-openshift-origin-installer-e2e-aws-compact-4.7=all
Last Closed: 2021-04-30 14:02:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2021-04-06 19:51:45 UTC
The compact jobs are failing frequently in CI:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=92h&type=junit&name=compact&search=ClusterOperatorDown.*authentication' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-serial (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial (all) - 2 runs, 50% failed, 100% of failures match = 50% impact

Picking one of those [1], the only failing test case was:

  : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
  Run #0: Failed	8s
  fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Apr  6 07:33:02.223: Unexpected alerts fired or pending after the test run:

  alert ClusterOperatorDown fired for 1 seconds with labels: {endpoint="metrics", instance="10.0.200.205:9099", job="cluster-version-operator", name="authentication", namespace="openshift-cluster-version", pod="cluster-version-operator-b4756cb5f-kh6h8", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci-2021-04-05-224633"}

From [2], the timeline for that job is something like:

* 6:59Z, Available=False with OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady .  Degraded=True with OAuthServerRouteEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError .
* 7:04Z, Available False with WellKnown_NotReady , Degraded=True with WellKnownReadyController_SyncError .
* 7:09Z, operator goes happy.

[3] has 'Managed cluster should start all core operators' passing at 07:09:59Z, so possibly we need to keep the install-Progressing workaround a bit longer and put off the revert from [4].  Might also be related to (or a dup of) bug 1939580 or bug 1929922.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact/1379322361316118528
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact/1379322361316118528
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact/1379322361316118528/artifacts/e2e-aws-compact/openshift-e2e-test/artifacts/e2e-intervals.json
[4]: https://github.com/openshift/cluster-authentication-operator/pull/423

Comment 1 Standa Laznicka 2021-04-30 14:02:54 UTC
Sounds like a duplicate of the other "ClusterOperatorIsDown" BZ, I'm going to close this in its favour.

*** This bug has been marked as a duplicate of bug 1939580 ***


Note You need to log in before you can comment on or make changes to this bug.