2019832 – 4.10 Nightlies blocked: Failed to upgrade authentication, operator was degraded

Bug 2019832 - 4.10 Nightlies blocked: Failed to upgrade authentication, operator was degraded [NEEDINFO]

Summary: 4.10 Nightlies blocked: Failed to upgrade authentication, operator was degraded

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	oauth-apiserver
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:	EmergencyRequest
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-03 12:39 UTC by Devan Goodwin
Modified:	2022-03-10 16:25 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:24:56 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-authentication-operator pull 503	0	None	open	Bug 2019832: pkg/operator: configure high inertia for apiserver and OAuthServer	2021-11-05 15:50:40 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:25:11 UTC

Description Devan Goodwin 2021-11-03 12:39:01 UTC

4.10 nightly stream is failing, example payload:

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-03-020416

In here you will see the aggregated job attempting periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade 10x, 9 of which failed on "Operator upgrade authentication"

The 10 prow jobs are linked but lets take this one as a sample to rally around:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528

Note that this run also has a node segfault, the other failed runs do not exhibit this, so unclear if we should ignore this on this prow job.

4.10 nightlies appear fully blocked so we are setting sev urgent on this one.

Comment 1 Michal Fojtik 2021-11-03 12:59:59 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 2 Sergiusz Urbaniak 2021-11-03 15:35:17 UTC

quickly glancing over the failure we see the auth operator reaches a good final state: 

  conditions:
  - lastTransitionTime: "2021-11-03T03:37:15Z"
    message: All is well
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-11-03T02:47:15Z"
    message: 'AuthenticatorCertKeyProgressing: All is well'
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-11-03T02:47:15Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2021-11-03T02:29:46Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Upgradeable

we'll have to look earlier in the events chain.

Comment 4 Sergiusz Urbaniak 2021-11-04 15:23:40 UTC

We found the following faulty behavior of cluster-authentication-operator: if at least one oauth-apiserver instance becomes unavailable cluster-authentication-operator reports degraded status too quickly as it would not respect the grace inertia period.

Steps to reproduce:
- drain a master node or set taint to master=noeexecute
- observe clusteroperator status, i.e. using `watch oc get co -o wide`

Actual result:
- cluster-authenication-operator degrades quickly after ~1 minute:

$ oc get co -o wide
authentication                             4.10.0-0.ci-2021-11-02-210304   True        False         True	31h     APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...

Expected result:
- cluster-authentication-operator respects the inertia period and does not go degraded until ~30 minutes

Comment 5 W. Trevor King 2021-11-04 21:45:12 UTC

What gives you the impression that it should be ~30 minutes?  The default "I have no idea what the underlying reasons mean" degraded inertia is 2 minutes [1] and I don't see the auth operator setting anything specific for its reason set [2].  If you're just saying "the default 2 minutes is too quick for APIServerDeploymentDegraded; we should set inertia to 30m for this particular reason", that makes sense to me.

[1]: https://github.com/openshift/library-go/blob/7a65fdb398e28782ee1650959a5e0419121e97ae/pkg/operator/status/status_controller.go#L75
[2]: https://github.com/openshift/cluster-authentication-operator/search?q=inertia

Comment 6 W. Trevor King 2021-11-04 21:48:40 UTC

[1] shows how to use WithDegradedInertia to tune this sort of thing in the status controller.

[1]: https://github.com/openshift/library-go/blob/874db8a3dac9034969ba6497aa53aa647bfe25f8/pkg/operator/status/status_controller_test.go#L305-L315

Comment 7 Sergiusz Urbaniak 2021-11-05 09:00:30 UTC

we have a custom inertia https://github.com/openshift/library-go/pull/1228

Comment 8 W. Trevor King 2021-11-05 13:55:53 UTC

Ah, ok.  I see you calling:

  WithStatusControllerPdbCompatibleHighInertia("APIServer"))

to set up a 30m inertia for ^APIServerDeploymentDegraded$ for ControlPlaneTopology=HighlyAvailable in [1], which landed for bug 2013222.  Looks like that hasn't been verified yet.  Maybe this one should be closed as a dup of that one?

[1]: https://github.com/openshift/cluster-authentication-operator/pull/499/commits/74461bc7c56daefe100f1c7d7b3166eb1e164488#diff-0d623dfd885adb20f991bda4c2453aebd732ca6dbb4d1d4be6e79805c3b48de6R525-R526

Comment 12 Xingxing Xia 2021-11-23 14:26:13 UTC

Sorry for not timely work on this bug due to occupied by other tasks. Checked above discussions, and compared with bug 2013222, getting to understand this bug supplements oauthserver for the high inertia. With both apiserver and oauthserver covered, looks full now. Double checked https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=Operator%20upgrade%20authentication like checking bug 2013222, now the test still is very good. Moving this bug to VERIFIED. Only found failure in ovn e.g. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade/1461846603307421696 which I think is OVN bug per often I often saw crashlooping apiserver in the past QE CI results analysis, will double check tomorrow:
"Failed to upgrade authentication, operator was degraded (APIServerDeployment_UnavailablePod): APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver (crashlooping container is waiting in apiserver-68459ccbf9-rtpdq pod)"

Comment 15 errata-xmlrpc 2022-03-10 16:24:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.