Bug 1948080

Summary:	authentication should not set Available=False APIServices_Error with 503s
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	apiserver-auth	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	low
Version:	4.9	CC:	aos-bugs, bshirren, lszaszki, mfojtik, sttts, surbania, wlewis, xxia
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	LifecycleReset tag-ci
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:03:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-04-10 00:17:33 UTC

From CI runs like [1]:

  [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available
    Run #0: Failed	0s
    5 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 13:17:56.703 - 55s   E clusteroperator/authentication condition/Available status/False reason/APIServicesAvailable: "oauth.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
    Apr 09 13:24:59.530 - 66ms  E clusteroperator/authentication condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://10.129.0.59:8443/apis/user.openshift.io/v1: Get "https://10.129.0.59:8443/apis/user.openshift.io/v1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    Apr 09 13:25:09.520 - 64ms  E clusteroperator/authentication condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://10.129.0.59:8443/apis/oauth.openshift.io/v1: Get "https://10.129.0.59:8443/apis/oauth.openshift.io/v1": context deadline exceeded
    Apr 09 13:25:19.602 - 92ms  E clusteroperator/authentication condition/Available status/False reason/APIServicesAvailable: "oauth.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
    Apr 09 13:31:04.225 - 60s   E clusteroperator/authentication condition/Available status/False reason/APIServicesAvailable: "oauth.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "user.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)

Very popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/authentication+should+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 20 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 18 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 19 runs, 100% failed, 79% of failures match = 79% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 9 runs, 56% failed, 60% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 9 runs, 100% failed, 89% of failures match = 89% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact

Possibly a dup of some non-update bug, but if so, please mention the test-case in that bug for Sippy ;).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152

Comment 1 Michal Fojtik 2021-05-10 01:14:28 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 2 W. Trevor King 2021-05-10 03:18:38 UTC

Fewer matching job names in the past 24h, but 100% impact means we're still hitting this very reliably:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/authentication+should+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 8 runs, 88% failed, 114% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 9 runs, 100% failed, 89% of failures match = 89% impact

Comment 3 Michal Fojtik 2021-05-10 04:14:35 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 4 Lukasz Szaszkiewicz 2021-05-24 07:39:25 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 5 Xingxing Xia 2021-05-25 11:30:31 UTC

Today, I hit above oauth resource requests 503 for 5 times using `for` loop oc login during upgrade as tested in bug 1912820#c14.

Comment 6 Lukasz Szaszkiewicz 2021-06-11 11:57:50 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 8 Lukasz Szaszkiewicz 2021-07-05 12:29:49 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 9 Scott Dodson 2021-07-15 18:54:26 UTC

This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.

Comment 14 Sergiusz Urbaniak 2021-08-16 12:34:22 UTC

sprint review: this BZ is being worked on.

Comment 15 Sergiusz Urbaniak 2021-09-02 13:39:58 UTC

self-assigning: need to recheck if the fixes (https://github.com/openshift/library-go/pull/1111, https://github.com/openshift/library-go/pull/1189) are in oauth-apiserver.

Comment 16 Sergiusz Urbaniak 2021-09-03 08:11:25 UTC

library-go https://github.com/openshift/cluster-authentication-operator/pull/457 was bumped in cluster-authentication-operator so we have the fix in https://github.com/openshift/library-go/pull/1111 being used now (starting from ~june 24).

This leaves https://github.com/openshift/library-go/pull/1189 which needs to be bumped in cluster-authentication-operator.

Comment 17 Sergiusz Urbaniak 2021-09-03 13:46:09 UTC

reviewed-in-sprint: waiting for PR merge

Comment 47 Lukasz Szaszkiewicz 2021-09-17 09:12:04 UTC

The following query [1] shows that reporting the availability got better but it is still not perfect. There is still undergoing work to make it more resilient. For example

The errors like "etcdserver: .*" will be fixed by https://github.com/openshift/kubernetes/pull/959

In general, network resiliency will be addressed in https://issues.redhat.com/browse/API-1139

[1] https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 48 liyao 2021-09-24 07:07:55 UTC

Comment 49 liyao 2021-09-24 07:56:23 UTC

@lukasz, use the query in Comment 47, there are 19 hits 503 in past 24h, while 26 hits in the past 14 days. Move the status to ASSIGNED. Please help to check.
$ curl -s 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job'  | sed 's/<[^>]*>//g'  | grep "statusCode = 503"  | wc -l
19
$ curl -s 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=336h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job'  | sed 's/<[^>]*>//g'  | grep "statusCode = 503"  | wc -l
26


In addition, use the query in Comment 2 and limit the result to 4.9 related CI including the fix, there are below matching job names in the past 24h
$ curl -s "https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+should+not+change+condition%2FAvailable&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job" |  sed 's/<[^>]*>//g' | grep 'failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.9-upgrade-from-nightly-4.8-ocp-remote-libvirt-ppc64le (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade (all) - 90 runs, 36% failed, 181% of failures match = 64% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade (all) - 7 runs, 71% failed, 120% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 11 runs, 9% failed, 1000% of failures match = 91% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 10 runs, 10% failed, 900% of failures match = 90% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 11 runs, 91% failed, 90% of failures match = 82% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 67% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade (all) - 3 runs, 33% failed, 200% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-metal-ipi-upgrade (all) - 3 runs, 33% failed, 300% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-21714-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Comment 51 liyao 2021-10-13 11:21:38 UTC

Use the query in Comment 47, there are 3 hits 503 in past 24h, while 52 hits in the past 14 days. Also, based on Comment 47, 'There is still undergoing work to make it more resilient', Move the status to Verified.

$ curl -s 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job'  | sed 's/<[^>]*>//g'  | grep "statusCode = 503"  | wc -l
3
$ curl -s 'https://search.ci.openshift.org/?search=clusteroperator%2Fauthentication+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=336h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=ovn%7Csingle-node%7C4.8%7C4.7&maxMatches=5&maxBytes=20971520&groupBy=job'  | sed 's/<[^>]*>//g'  | grep "statusCode = 503"  | wc -l
52

Comment 55 errata-xmlrpc 2022-03-10 16:03:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 56 Red Hat Bugzilla 2023-09-15 01:04:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days