Bug 1948080
Summary: | authentication should not set Available=False APIServices_Error with 503s | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | apiserver-auth | Assignee: | Sergiusz Urbaniak <surbania> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 4.9 | CC: | aos-bugs, bshirren, lszaszki, mfojtik, sttts, surbania, wlewis, xxia |
Target Milestone: | --- | Keywords: | Upgrades |
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | LifecycleReset tag-ci | ||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:03:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2021-04-10 00:17:33 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. Fewer matching job names in the past 24h, but 100% impact means we're still hitting this very reliably: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/authentication+should+not+change+condition/Available' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 8 runs, 88% failed, 114% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 9 runs, 100% failed, 89% of failures match = 89% impact The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Today, I hit above oauth resource requests 503 for 5 times using `for` loop oc login during upgrade as tested in bug 1912820#c14. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations. sprint review: this BZ is being worked on. self-assigning: need to recheck if the fixes (https://github.com/openshift/library-go/pull/1111, https://github.com/openshift/library-go/pull/1189) are in oauth-apiserver. library-go https://github.com/openshift/cluster-authentication-operator/pull/457 was bumped in cluster-authentication-operator so we have the fix in https://github.com/openshift/library-go/pull/1111 being used now (starting from ~june 24). This leaves https://github.com/openshift/library-go/pull/1189 which needs to be bumped in cluster-authentication-operator. reviewed-in-sprint: waiting for PR merge The following query [1] shows that reporting the availability got better but it is still not perfect. There is still undergoing work to make it more resilient. For example The errors like "etcdserver: .*" will be fixed by https://github.com/openshift/kubernetes/pull/959 In general, network resiliency will be addressed in https://issues.redhat.com/browse/API-1139 [1] https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job @ @lukasz, use the query in Comment 47, there are 19 hits 503 in past 24h, while 26 hits in the past 14 days. Move the status to ASSIGNED. Please help to check. $ curl -s 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job' | sed 's/<[^>]*>//g' | grep "statusCode = 503" | wc -l 19 $ curl -s 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=336h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job' | sed 's/<[^>]*>//g' | grep "statusCode = 503" | wc -l 26 In addition, use the query in Comment 2 and limit the result to 4.9 related CI including the fix, there are below matching job names in the past 24h $ curl -s "https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+should+not+change+condition%2FAvailable&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job" | sed 's/<[^>]*>//g' | grep 'failures match' | sort periodic-ci-openshift-multiarch-master-nightly-4.9-upgrade-from-nightly-4.8-ocp-remote-libvirt-ppc64le (all) - 2 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade (all) - 90 runs, 36% failed, 181% of failures match = 64% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade (all) - 7 runs, 71% failed, 120% of failures match = 86% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 11 runs, 9% failed, 1000% of failures match = 91% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 10 runs, 10% failed, 900% of failures match = 90% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 11 runs, 91% failed, 90% of failures match = 82% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 67% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade (all) - 3 runs, 33% failed, 200% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-metal-ipi-upgrade (all) - 3 runs, 33% failed, 300% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-21714-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Use the query in Comment 47, there are 3 hits 503 in past 24h, while 52 hits in the past 14 days. Also, based on Comment 47, 'There is still undergoing work to make it more resilient', Move the status to Verified. $ curl -s 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job' | sed 's/<[^>]*>//g' | grep "statusCode = 503" | wc -l 3 $ curl -s 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=336h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job' | sed 's/<[^>]*>//g' | grep "statusCode = 503" | wc -l 52 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |