Bug 1954420
| Summary: | auth operator fails with OAuthRouteCheckEndpointAccessibleControllerDegraded on 4.7.8->4.8 upgrade, WAS: Unable to apply (flowcontrol.apiserver.k8s.io/v1alpha1, Kind=FlowSchema) /openshift-sdn due to missing related FlowSchema | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ke Wang <kewang> | |
| Component: | apiserver-auth | Assignee: | Sergiusz Urbaniak <surbania> | |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | liyao | |
| Severity: | high | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 4.8 | CC: | aos-bugs, mfojtik, mmasters, sttts, surbania, xxia | |
| Target Milestone: | --- | Flags: | mfojtik:
needinfo?
|
|
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | tag-ci LifecycleStale | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1954481 (view as bug list) | Environment: | ||
| Last Closed: | 2021-08-16 12:18:53 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
|
Comment 2
Stefan Schimanski
2021-04-28 08:29:03 UTC
I created https://bugzilla.redhat.com/show_bug.cgi?id=1954481 for the v1alpha1 flowschema issue. I doubt it was the root cause for the upgrade issue here. Rather it looks like there is an issue with ingress. Standa will take over. From the supplied must-gather I can see in the ingress operator that it started rolling out a new deployment at 2021-04-25T23:05:02.314Z, the authentication operator started failing with "context deadlines" on route connections at 2021-04-25T23:05:34.002014279Z, and stopped failing at about 2021-04-25T23:09:56.316104382Z. Note that the service network was fine all the time (we have different checks for that). Therefore I'm moving this to the routing team. The ingress operator and console operator also check their own respective routes via periodic HTTP requests, so I find it interesting that only the authentication operator is degraded in this case. Looking at the must-gather in Comment 1, I am noticing the following: The ingress operator is unable to successfully complete it's route check in 2 specific instances: 2021-04-25T23:05:02.234039306Z 2021-04-25T23:05:02.233Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check {"error": "expected canary request body to contain \"Healthcheck requested\""} 2021-04-25T23:06:04.893066410Z 2021-04-25T23:06:04.892Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check {"error": "expected canary request body to contain \"Healthcheck requested\""} Note the timestamp range here: 2021-04-25T23:05:00 -> 2021-04-25T23:06:00 approximately. The console operator is also unable to complete it's route check in several instances as well: 2021-04-25T23:05:21.529068924Z E0425 23:05:21.529007 1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip> 2021-04-25T23:05:27.332228322Z E0425 23:05:27.332172 1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip> 2021-04-25T23:05:33.356335391Z E0425 23:05:33.347681 1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip> 2021-04-25T23:05:39.173808147Z E0425 23:05:39.172484 1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip> 2021-04-25T23:05:44.952864967Z E0425 23:05:44.952800 1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip> 2021-04-25T23:05:50.738659254Z E0425 23:05:50.738581 1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip> 2021-04-25T23:05:56.541376962Z E0425 23:05:56.541302 1 status.go:78] RouteHealthDegraded FailedGet failed to GET route (https://console-openshift-console.apps.ugd-13682.qe.devcluster.openshift.com/health): <snip> Again we can observe Ingress disruption approximately from 2021-04-25T23:05:00 -> 2021-04-25T23:06:00. This could be related to nodes rebooting during upgrades, among other things, and shouldn't be a deal breaker. The authentication operator stops logging route connectivity issues at 2021-04-25T23:09:56, which is a few minutes after the other operators have already done so (Im not pasting any log snippets here because the messages are very long). However, despite the normal log messages that carry into 2021-04-26, the authentication operator still shows degraded for reason "OAuthRouteCheckEndpointAccessibleController_SyncError". I suspect that Ingress for authentication is actually working moments after 2021-04-25T23:09:56, but for whatever reason, the operator stays degraded. Re-assigning to Standa to investigate why the authentication cluster-operator is reporting degraded on 2021-04-26 without any corresponding OAuthRouteCheckEndpointAccessibleController_SyncError events or messages being visible after 2021-04-25T23:09:56. Apologies if I am barking up the wrong tree here, but I do not suspect this is an Ingress issue given that the console and ingress operators are reporting that all is well. Maybe I don't quite understand how the authentication route probing logic works. Let me know if you have any questions Standa or if there is anything else I can do to help here. On a side note, I wonder if this issue is reproducible using later versions of OCP 4.7 and 4.8. And in general, I suspect this issue might not be 100% reproducible even then. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. closing out as we don't have sufficient data to investigate. |