1939580 – Authentication operator is degraded during 4.8 to 4.8 upgrade and normal 4.8 e2e runs

Bug 1939580 - Authentication operator is degraded during 4.8 to 4.8 upgrade and normal 4.8 e2e runs

Summary: Authentication operator is degraded during 4.8 to 4.8 upgrade and normal 4.8 ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Standa Laznicka
QA Contact:	pmali
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1946781 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-16 16:46 UTC by Clayton Coleman
Modified:	2021-07-27 22:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	[sig-arch] Check if alerts are firing during or after upgrade success
Last Closed:	2021-07-27 22:53:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-authentication-operator pull 442	0	None	open	Bug 1939580: wellknownready: delay going degraded on oauthmetadata changes	2021-05-03 14:00:28 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:54:05 UTC

Description Clayton Coleman 2021-03-16 16:46:26 UTC

We are making firing alerts during upgrade a bug via e2e tests (because alerts generally surface another bug that we should fix before we ship), and the PR job that reports those errors has seen the authentication operator go degraded during 4.8 to 4.8 upgrade frequently

alert ClusterOperatorDegraded fired for 210 seconds with labels: map[condition:Degraded endpoint:metrics instance:10.0.0.5:9099 job:cluster-version-operator name:authentication namespace:openshift-cluster-version pod:cluster-version-operator-b84d66998-rzglx reason:APIServerDeployment_UnavailablePod service:cluster-version-operator severity:critical]

in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032

openshift-apiserver has also been reported impacted, so it's likely this is due to some disruption at a lower level.  Needs triage and routing - the auth operator should not be degraded during an upgrade, so the alert shouldn't fire.

Comment 1 Clayton Coleman 2021-04-07 22:13:51 UTC

I'm going to use this to cover the variety of "authentication operator goes degraded / not available due to transient errors that can be retried" - roughly authentication operator is too sensitive to ANY error, generating noise and alerts.

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp/1379871391536386048

I'm bumping this to urgent because it now causes 20-30% of upgrade failures (probably mostly due to the AWS issue, but also happening on GCP and others).  I need triage and resolution of this to get CI back to green.  Operators can't go degraded unless they're degraded.

Comment 2 Clayton Coleman 2021-04-26 17:03:01 UTC

Broadening this to cover both normal e2e runs and upgrade - happening in both parallel and serial runs https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26097/pull-ci-openshift-origin-master-e2e-aws-serial/1385694962938548224 although I'm not positive the serial and parallel runs are failing for the same reason.

Comment 3 W. Trevor King 2021-04-26 18:32:52 UTC

Mentioning the update test-case for Sippy.

Comment 4 W. Trevor King 2021-04-26 18:51:34 UTC

In [1]:

  : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]	8s
    fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Apr 23 22:30:42.224: Unexpected alerts fired or pending after the test run:

    alert ClusterOperatorDown fired for 6 seconds with labels: {endpoint="metrics", instance="10.0.201.220:9099", job="cluster-version-operator", name="authentication", namespace="openshift-cluster-version", pod="cluster-version-operator-8f7797cf7-sg62t", service="cluster-version-operator", severity="critical", version="4.8.0-0.ci.test-2021-04-23-205156-ci-op-0qjgszvb"}

The issue seems to be that the auth operator comes up declaring Available=False WellKnown_NotReady (and possibly also for Degraded=True).  That goes away around 21:34Z, right about when the CVO declares install-complete at 21:33:38Z [2].  But the alert fires for 6 seconds.  It's possible that the fix for this is either "make installs faster, especially the ingress routes coming up".  Alternatively, it could be "increase the 'for' delay before the CVO starts alerting on Available=False or Degraded=True ClusterOperators".  Those aren't mutually exclusive.  Other ideas?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26097/pull-ci-openshift-origin-master-e2e-aws-serial/1385694962938548224
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26097/pull-ci-openshift-origin-master-e2e-aws-serial/1385694962938548224/artifacts/e2e-aws-serial/gather-extra/artifacts/clusterversion.json

Comment 5 W. Trevor King 2021-04-26 19:18:22 UTC

Checking the original [1] in PromeCIeus too [2], cluster_operator_up{name="authentication"} shows the operator down during install (as in the previous comment), but also back down during two portions of the update phase.  So leaving this bug under auth about the update business.  But we may need to spin off a CVO bug for the install-time alerting.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032

Comment 6 Standa Laznicka 2021-04-27 10:49:24 UTC

If `WellKnown_NotReady` is the issue then the problem is with kube-apiserver not rolling out fast enough - the CAO checks the kube-apiservers statuses every 30s. I don't think there's anything I can do about the KAS rollouts and I don't think we need to check those endpoints more often, especially once they all start serving the required content. If the operator is reporting "Degraded", I think I can turn that to "Progressing" for a certain period of time instead.

However, if `WellKnown_NotReady` is really the only reason this alert is firing during installation, the "increase the 'for' delay before the CVO starts alerting on Available=False or Degraded=True ClusterOperators" solution would probably be the right one, unless we're able to make kube-apiserver rollouts faster.

Can you please explain what does the `cluster_operator_up` metrics denote? When is it 0, when is it 1?

Comment 7 W. Trevor King 2021-04-28 04:29:42 UTC

> If the operator is reporting "Degraded", I think I can turn that to "Progressing" for a certain period of time instead.

If WellKnown_NotReady means your well-known route is not ready (seems likely), and auth needs that well-known route to function (I have no idea), then being Available=False makes sense to me.  

> Can you please explain what does the `cluster_operator_up` metrics denote? When is it 0, when is it 1?

It's currently getting reshuffled in [1], after which it will be "ClusterOperator has not had Available=True in the past 10m".  We may end up with a CVO-side change that avoids ClusterOperatorDown during at least some of the install window.

However, checking a recent GCP update job [2], and the response to the control-plane rolling the nodes seems more twitchy:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387175042994933760/artifacts/e2e-gcp-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/authentication condition/Available.*changed'
Apr 28 00:09:17.280 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:09:27.436 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:09:32.563 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:09:37.693 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:12:17.140 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:12:17.306 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:12:22.542 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:12:27.942 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:12:33.193 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:12:33.311 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:13:33.671 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://10.128.0.80:8443/apis/oauth.openshift.io/v1: Get "https://10.128.0.80:8443/apis/oauth.openshift.io/v1": context deadline exceeded
Apr 28 00:13:33.806 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:13:38.662 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://10.129.0.84:8443/apis/user.openshift.io/v1: Get "https://10.129.0.84:8443/apis/user.openshift.io/v1": dial tcp 10.129.0.84:8443: i/o timeout
Apr 28 00:13:38.818 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:15:48.383 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:15:48.507 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:16:18.531 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:16:18.658 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:16:23.806 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": dial tcp 35.196.253.222:443: i/o timeout (Client.Timeout exceeded while awaiting headers)
Apr 28 00:16:29.070 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:16:34.202 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:16:34.335 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:16:39.407 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:16:39.490 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:16:44.552 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:16:49.715 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:16:54.853 E clusteroperator/authentication condition/Available status/False reason/OAuthServerRouteEndpointAccessibleController_EndpointUnavailable changed: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-op-6lqqd1k2-17f95.*********************************/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Apr 28 00:17:05.024 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:20:23.562 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
Apr 28 00:20:23.740 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:20:38.634 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
Apr 28 00:20:38.753 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:27:19.128 E clusteroperator/authentication condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://10.130.0.12:8443/apis/oauth.openshift.io/v1: Get "https://10.130.0.12:8443/apis/oauth.openshift.io/v1": dial tcp 10.130.0.12:8443: i/o timeout
Apr 28 00:27:19.223 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well
Apr 28 00:45:50.745 E clusteroperator/authentication condition/Available status/False reason/OAuthServerDeployment_NoDeployment changed: OAuthServerDeploymentAvailable: deployment/openshift-authentication: could not be retrieved
Apr 28 00:45:56.113 W clusteroperator/authentication condition/Available status/True reason/AsExpected changed: All is well

Twitching Available=False, sometimes only for sub-second events, seems sub-optimal.  Possibly the fix is non-auth like "Kube API load balancing is busted", but we should fix whatever it is that's causing the update-time twitching.

[1]: https://github.com/openshift/cluster-version-operator/pull/550
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1387175042994933760

Comment 8 Standa Laznicka 2021-04-30 14:02:51 UTC

*** Bug 1946781 has been marked as a duplicate of this bug. ***

Comment 9 W. Trevor King 2021-04-30 16:25:14 UTC

Bringing over some information from bug 1946781, so we know if we need to reopen bug 1946781 after the fix for this one lands:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=92h&type=junit&name=compact&search=ClusterOperatorD.*authentication' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-serial (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Comment 11 Clayton Coleman 2021-05-03 13:50:34 UTC

Components may not go unavailable false for a single health check failure.  You must absorb some sustained detection of down before you report available=false.  I would expect AT MINIMUM several failures in a row spaced out over seconds before you report unavailable false (but by all means keep a metric of number of failures seen and successes seen and keep logging).  It does appear there are other bugs, including the apiserver potentially be in rotation before it should be.

Comment 12 Standa Laznicka 2021-05-03 14:05:58 UTC

The linked fix should solve the authentication operator going degraded too early if it does not find the oauth metadata it expects on the kube-apiserver side, but it won't cover for the the network errors.

Comment 17 errata-xmlrpc 2021-07-27 22:53:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.