Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1954420

Summary: auth operator fails with OAuthRouteCheckEndpointAccessibleControllerDegraded on 4.7.8->4.8 upgrade, WAS: Unable to apply (flowcontrol.apiserver.k8s.io/v1alpha1, Kind=FlowSchema) /openshift-sdn due to missing related FlowSchema
Product: OpenShift Container Platform Reporter: Ke Wang <kewang>
Component: apiserver-authAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED INSUFFICIENT_DATA QA Contact: liyao
Severity: high Docs Contact:
Priority: medium    
Version: 4.8CC: aos-bugs, mfojtik, mmasters, sttts, surbania, xxia
Target Milestone: ---Flags: mfojtik: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci LifecycleStale
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1954481 (view as bug list) Environment:
Last Closed: 2021-08-16 12:18:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Stefan Schimanski 2021-04-28 08:29:03 UTC
This was the plan: https://bugzilla.redhat.com/show_bug.cgi?id=1913399#c1

> - We should change the API version of the p&f rules to 'v1beta1' in 4.8. 
> - For 4.7, let's leave the version to 'v1alpha1'

Comment 3 Stefan Schimanski 2021-04-28 08:55:00 UTC
I created https://bugzilla.redhat.com/show_bug.cgi?id=1954481 for the v1alpha1 flowschema issue. I doubt it was the root cause for the upgrade issue here. Rather it looks like there is an issue with ingress. Standa will take over.

Comment 4 Standa Laznicka 2021-05-06 11:47:32 UTC
From the supplied must-gather I can see in the ingress operator that it started rolling out a new deployment at 2021-04-25T23:05:02.314Z, the authentication operator started failing with "context deadlines" on route connections at 2021-04-25T23:05:34.002014279Z, and stopped failing at about 2021-04-25T23:09:56.316104382Z.

Note that the service network was fine all the time (we have different checks for that). Therefore I'm moving this to the routing team.

Comment 5 Stephen Greene 2021-05-06 19:12:18 UTC
The ingress operator and console operator also check their own respective routes via periodic HTTP requests, so I find it interesting that only the authentication operator is degraded in this case.

Looking at the must-gather in Comment 1, I am noticing the following:

The ingress operator is unable to successfully complete it's route check in 2 specific instances:

2021-04-25T23:05:02.234039306Z 2021-04-25T23:05:02.233Z ERROR   operator.canary_controller      wait/wait.go:155        error performing canary route check     {"error": "expected canary request body to contain \"Healthcheck requested\""}
2021-04-25T23:06:04.893066410Z 2021-04-25T23:06:04.892Z ERROR   operator.canary_controller      wait/wait.go:155        error performing canary route check     {"error": "expected canary request body to contain \"Healthcheck requested\""}

Note the timestamp range here: 2021-04-25T23:05:00 -> 2021-04-25T23:06:00 approximately.

The console operator is also unable to complete it's route check in several instances as well:

2021-04-25T23:05:21.529068924Z E0425 23:05:21.529007       1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip>
2021-04-25T23:05:27.332228322Z E0425 23:05:27.332172       1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip>
2021-04-25T23:05:33.356335391Z E0425 23:05:33.347681       1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip>
2021-04-25T23:05:39.173808147Z E0425 23:05:39.172484       1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip>
2021-04-25T23:05:44.952864967Z E0425 23:05:44.952800       1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip>
2021-04-25T23:05:50.738659254Z E0425 23:05:50.738581       1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip>
2021-04-25T23:05:56.541376962Z E0425 23:05:56.541302       1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip>

Again we can observe Ingress disruption approximately from 2021-04-25T23:05:00 -> 2021-04-25T23:06:00. This could be related to nodes rebooting during upgrades, among other things, and shouldn't be a deal breaker.

The authentication operator stops logging route connectivity issues at 2021-04-25T23:09:56, which is a few minutes after the other operators have already done so (Im not pasting any log snippets here because the messages are very long).

However, despite the normal log messages that carry into 2021-04-26, the authentication operator still shows degraded for reason "OAuthRouteCheckEndpointAccessibleController_SyncError".
I suspect that Ingress for authentication is actually working moments after 2021-04-25T23:09:56, but for whatever reason, the operator stays degraded.

Re-assigning to Standa to investigate why the authentication cluster-operator is reporting degraded on 2021-04-26 without any corresponding OAuthRouteCheckEndpointAccessibleController_SyncError events or messages being visible after 2021-04-25T23:09:56.
Apologies if I am barking up the wrong tree here, but I do not suspect this is an Ingress issue given that the console and ingress operators are reporting that all is well. Maybe I don't quite understand how the authentication route probing logic works. Let me know if you have any questions Standa or if there is anything else I can do to help here.

On a side note, I wonder if this issue is reproducible using later versions of OCP 4.7 and 4.8. And in general, I suspect this issue might not be 100% reproducible even then.

Comment 6 Michal Fojtik 2021-06-05 19:29:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 7 Sergiusz Urbaniak 2021-08-16 12:18:53 UTC
closing out as we don't have sufficient data to investigate.