We are making firing alerts during upgrade a bug via e2e tests (because alerts generally surface another bug that we should fix before we ship), and the PR job that reports those errors has seen the authentication operator go degraded during 4.8 to 4.8 upgrade frequently alert ClusterOperatorDegraded fired for 210 seconds with labels: map[condition:Degraded endpoint:metrics instance:10.0.0.5:9099 job:cluster-version-operator name:authentication namespace:openshift-cluster-version pod:cluster-version-operator-b84d66998-rzglx reason:APIServerDeployment_UnavailablePod service:cluster-version-operator severity:critical] in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032 openshift-apiserver has also been reported impacted, so it's likely this is due to some disruption at a lower level. Needs triage and routing - the auth operator should not be degraded during an upgrade, so the alert shouldn't fire.
I'm going to use this to cover the variety of "authentication operator goes degraded / not available due to transient errors that can be retried" - roughly authentication operator is too sensitive to ANY error, generating noise and alerts. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1379871391536386048 I'm bumping this to urgent because it now causes 20-30% of upgrade failures (probably mostly due to the AWS issue, but also happening on GCP and others). I need triage and resolution of this to get CI back to green. Operators can't go degraded unless they're degraded.
Broadening this to cover both normal e2e runs and upgrade - happening in both parallel and serial runs https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26097/pull-ci-openshift-origin-master-e2e-aws-serial/1385694962938548224 although I'm not positive the serial and parallel runs are failing for the same reason.
Mentioning the update test-case for Sippy.
In [1]: : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] 8s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Apr 23 22:30:42.224: Unexpected alerts fired or pending after the test run: alert ClusterOperatorDown fired for 6 seconds with labels: {endpoint="metrics", instance="10.0.201.220:9099", job="cluster-version-operator", name="authentication", namespace="openshift-cluster-version", pod="cluster-version-operator-8f7797cf7-sg62t", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci.test-2021-04-23-205156-ci-op-0qjgszvb"} The issue seems to be that the auth operator comes up declaring Available=False WellKnown_NotReady (and possibly also for Degraded=True). That goes away around 21:34Z, right about when the CVO declares install-complete at 21:33:38Z [2]. But the alert fires for 6 seconds. It's possible that the fix for this is either "make installs faster, especially the ingress routes coming up". Alternatively, it could be "increase the 'for' delay before the CVO starts alerting on Available=False or Degraded=True ClusterOperators". Those aren't mutually exclusive. Other ideas? [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26097/pull-ci-openshift-origin-master-e2e-aws-serial/1385694962938548224 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26097/pull-ci-openshift-origin-master-e2e-aws-serial/1385694962938548224/artifacts/e2e-aws-serial/gather-extra/artifacts/clusterversion.json
Checking the original [1] in PromeCIeus too [2], cluster_operator_up{name="authentication"} shows the operator down during install (as in the previous comment), but also back down during two portions of the update phase. So leaving this bug under auth about the update business. But we may need to spin off a CVO bug for the install-time alerting. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032
If `WellKnown_NotReady` is the issue then the problem is with kube-apiserver not rolling out fast enough - the CAO checks the kube-apiservers statuses every 30s. I don't think there's anything I can do about the KAS rollouts and I don't think we need to check those endpoints more often, especially once they all start serving the required content. If the operator is reporting "Degraded", I think I can turn that to "Progressing" for a certain period of time instead. However, if `WellKnown_NotReady` is really the only reason this alert is firing during installation, the "increase the 'for' delay before the CVO starts alerting on Available=False or Degraded=True ClusterOperators" solution would probably be the right one, unless we're able to make kube-apiserver rollouts faster. Can you please explain what does the `cluster_operator_up` metrics denote? When is it 0, when is it 1?
> If the operator is reporting "Degraded", I think I can turn that to "Progressing" for a certain period of time instead. If WellKnown_NotReady means your well-known route is not ready (seems likely), and auth needs that well-known route to function (I have no idea), then being Available=False makes sense to me. > Can you please explain what does the `cluster_operator_up` metrics denote? When is it 0, when is it 1? It's currently getting reshuffled in [1], after which it will be "ClusterOperator has not had Available=True in the past 10m". We may end up with a CVO-side change that avoids ClusterOperatorDown during at least some of the install window. However, checking a recent GCP update job [2], and the response to the control-plane rolling the nodes seems more twitchy: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387175042994933760/artifacts/e2e-gcp-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/authentication condition/Available.*changed' Apr 28 00:09:17.280 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:09:27.436 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:09:32.563 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:09:37.693 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:12:17.140 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:12:17.306 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:12:22.542 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:12:27.942 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:12:33.193 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:12:33.311 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:13:33.671 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://10.128.0.80:8443/apis/oauth.openshift.io/v1: Get "https://10.128.0.80:8443/apis/oauth.openshift.io/v1": context deadline exceeded Apr 28 00:13:33.806 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:13:38.662 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://10.129.0.84:8443/apis/user.openshift.io/v1: Get "https://10.129.0.84:8443/apis/user.openshift.io/v1": dial tcp 10.129.0.84:8443: i/o timeout Apr 28 00:13:38.818 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:15:48.383 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:15:48.507 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:16:18.531 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:16:18.658 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:16:23.806 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": dial tcp 35.196.253.222:443: i/o timeout (Client.Timeout exceeded while awaiting headers) Apr 28 00:16:29.070 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:16:34.202 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:16:34.335 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:16:39.407 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:16:39.490 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:16:44.552 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:16:49.715 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:16:54.853 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 28 00:17:05.024 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:20:23.562 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Apr 28 00:20:23.740 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:20:38.634 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Apr 28 00:20:38.753 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:27:19.128 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://10.130.0.12:8443/apis/oauth.openshift.io/v1: Get "https://10.130.0.12:8443/apis/oauth.openshift.io/v1": dial tcp 10.130.0.12:8443: i/o timeout Apr 28 00:27:19.223 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Apr 28 00:45:50.745 E clusteroperator/authentication condition/Available status/False reason/OAuthServerDeployment_NoDeployment changed: OAuthServerDeploymentAvailable: deployment/openshift-authentication: could not be retrieved Apr 28 00:45:56.113 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well Twitching Available=False, sometimes only for sub-second events, seems sub-optimal. Possibly the fix is non-auth like "Kube API load balancing is busted", but we should fix whatever it is that's causing the update-time twitching. [1]: https://github.com/openshift/cluster-version-operator/pull/550 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387175042994933760
*** Bug 1946781 has been marked as a duplicate of this bug. ***
Bringing over some information from bug 1946781, so we know if we need to reopen bug 1946781 after the fix for this one lands: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=92h&type=junit&name=compact&search=ClusterOperatorD.*authentication' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-serial (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
Components may not go unavailable false for a single health check failure. You must absorb some sustained detection of down before you report available=false. I would expect AT MINIMUM several failures in a row spaced out over seconds before you report unavailable false (but by all means keep a metric of number of failures seen and successes seen and keep logging). It does appear there are other bugs, including the apiserver potentially be in rotation before it should be.
The linked fix should solve the authentication operator going degraded too early if it does not find the oauth metadata it expects on the kube-apiserver side, but it won't cover for the the network errors.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438