4.10 nightly stream is failing, example payload: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-03-020416 In here you will see the aggregated job attempting periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade 10x, 9 of which failed on "Operator upgrade authentication" The 10 prow jobs are linked but lets take this one as a sample to rally around: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528 Note that this run also has a node segfault, the other failed runs do not exhibit this, so unclear if we should ignore this on this prow job. 4.10 nightlies appear fully blocked so we are setting sev urgent on this one.
** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity. ** INFORMATION REQUIRED ** Please answer these questions before escalation to engineering: 1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather. 2. Give the output of "oc get clusteroperators -o yaml". 3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no] 4. List the top 5 relevant errors from the logs of the operators and operands in (3). 5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top. 6. Explain why (5) is likely the right order and list the information used for that assessment. 7. Explain why Engineering is necessary to make progress.
quickly glancing over the failure we see the auth operator reaches a good final state: conditions: - lastTransitionTime: "2021-11-03T03:37:15Z" message: All is well reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2021-11-03T02:47:15Z" message: 'AuthenticatorCertKeyProgressing: All is well' reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2021-11-03T02:47:15Z" message: All is well reason: AsExpected status: "True" type: Available - lastTransitionTime: "2021-11-03T02:29:46Z" message: All is well reason: AsExpected status: "True" type: Upgradeable we'll have to look earlier in the events chain.
We found the following faulty behavior of cluster-authentication-operator: if at least one oauth-apiserver instance becomes unavailable cluster-authentication-operator reports degraded status too quickly as it would not respect the grace inertia period. Steps to reproduce: - drain a master node or set taint to master=noeexecute - observe clusteroperator status, i.e. using `watch oc get co -o wide` Actual result: - cluster-authenication-operator degrades quickly after ~1 minute: $ oc get co -o wide authentication 4.10.0-0.ci-2021-11-02-210304 True False True 31h APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... Expected result: - cluster-authentication-operator respects the inertia period and does not go degraded until ~30 minutes
What gives you the impression that it should be ~30 minutes? The default "I have no idea what the underlying reasons mean" degraded inertia is 2 minutes [1] and I don't see the auth operator setting anything specific for its reason set [2]. If you're just saying "the default 2 minutes is too quick for APIServerDeploymentDegraded; we should set inertia to 30m for this particular reason", that makes sense to me. [1]: https://github.com/openshift/library-go/blob/7a65fdb398e28782ee1650959a5e0419121e97ae/pkg/operator/status/status_controller.go#L75 [2]: https://github.com/openshift/cluster-authentication-operator/search?q=inertia
[1] shows how to use WithDegradedInertia to tune this sort of thing in the status controller. [1]: https://github.com/openshift/library-go/blob/874db8a3dac9034969ba6497aa53aa647bfe25f8/pkg/operator/status/status_controller_test.go#L305-L315
we have a custom inertia https://github.com/openshift/library-go/pull/1228
Ah, ok. I see you calling: WithStatusControllerPdbCompatibleHighInertia("APIServer")) to set up a 30m inertia for ^APIServerDeploymentDegraded$ for ControlPlaneTopology=HighlyAvailable in [1], which landed for bug 2013222. Looks like that hasn't been verified yet. Maybe this one should be closed as a dup of that one? [1]: https://github.com/openshift/cluster-authentication-operator/pull/499/commits/74461bc7c56daefe100f1c7d7b3166eb1e164488#diff-0d623dfd885adb20f991bda4c2453aebd732ca6dbb4d1d4be6e79805c3b48de6R525-R526
Sorry for not timely work on this bug due to occupied by other tasks. Checked above discussions, and compared with bug 2013222, getting to understand this bug supplements oauthserver for the high inertia. With both apiserver and oauthserver covered, looks full now. Double checked https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=Operator%20upgrade%20authentication like checking bug 2013222, now the test still is very good. Moving this bug to VERIFIED. Only found failure in ovn e.g. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade/1461846603307421696 which I think is OVN bug per often I often saw crashlooping apiserver in the past QE CI results analysis, will double check tomorrow: "Failed to upgrade authentication, operator was degraded (APIServerDeployment_UnavailablePod): APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver (crashlooping container is waiting in apiserver-68459ccbf9-rtpdq pod)"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056