Bug 2013222 - Full breakage for nightly payload promotion [NEEDINFO]
Summary: Full breakage for nightly payload promotion
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Wally
QA Contact: Xingxing Xia
URL:
Whiteboard: EmergencyRequest
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-12 12:24 UTC by Devan Goodwin
Modified: 2021-11-16 14:12 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
mfojtik: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 499 0 None open Bug 2013222: pkg/operator: configure PDB high inertia 2021-10-13 08:00:55 UTC
Github openshift cluster-openshift-apiserver-operator pull 479 0 None open Bug 2013222: wire apiservercontrollerset.WithStatusControllerPdbCompatibleHighInertia 2021-10-13 07:55:28 UTC
Github openshift library-go pull 1228 0 None Merged Bug 2013222: operator/apiserver/controllerset: add options to WithClusterOperatorStatusController 2021-10-13 07:55:29 UTC

Description Devan Goodwin 2021-10-12 12:24:56 UTC
Operator upgrade authentication

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=Operator%20upgrade%20authentication

All nightly payloads have been rejected since 4.10.0-0.nightly-2021-10-10-083341

https://amd64.ocp.releases.ci.openshift.org/#4.10.0-0.nightly

David eads has debugged and provided a prototype solution here: https://github.com/openshift/library-go/pull/1227

Comment 1 Devan Goodwin 2021-10-12 12:27:48 UTC
Also affects test https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=Operator%20upgrade%20openshift-apiserver which is perhaps more clear for why we chose this component first.

Comment 2 David Eads 2021-10-12 12:34:41 UTC
This is currently blocking payload promotion of nightlies which are used by QE and the extended company/org to consume new features and develop on top of.  Given the 100% block of promotion, urgent priority is appropriate here.

Comment 3 Michal Fojtik 2021-10-12 12:38:34 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 4 David Eads 2021-10-12 12:43:28 UTC
This is currently blocking payload promotion of nightlies which are used by QE and the extended company/org to consume new features and develop on top of.  Given the 100% block of promotion, urgent priority is appropriate here.

Comment 5 Michal Fojtik 2021-10-12 13:08:33 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 9 Xingxing Xia 2021-11-16 14:12:48 UTC
Sorry didn't check this timely due to occupied by other work. Today checked https://github.com/openshift/library-go/pull/1228 code, seems hard to understand. (Will have a look at the replaced https://github.com/openshift/library-go/pull/1227 which has some discussion to help understand) Have to check the CI results via comment 0's links, which show the issue is not seen. So moving to VERIFIED.


Note You need to log in before you can comment on or make changes to this bug.