Bug 2039417 - Authetication operator is in degraded state with error RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth- [NEEDINFO]
Summary: Authetication operator is in degraded state with error RouteDegraded: Unable ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.9.z
Assignee: Sebastian Łaskawiec
QA Contact: Yash Tripathi
URL:
Whiteboard:
Depends On: 2048412
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-11 17:12 UTC by RamaKasturi
Modified: 2022-02-10 06:33 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2048412 (view as bug list)
Environment:
Last Closed: 2022-02-10 06:33:21 UTC
Target Upstream Version:
slaskawi: needinfo? (knarra)
slaskawi: needinfo? (eparis)
slaskawi: needinfo? (wlewis)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 542 0 None Merged Bug 2039417: remove degraded condition 4.9 2022-02-02 19:10:37 UTC
Red Hat Product Errata RHBA-2022:0340 0 None None None 2022-02-10 06:33:38 UTC

Description RamaKasturi 2022-01-11 17:12:26 UTC
Description of problem:
Authetication operator is in degraded state with error "RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift": context canceled"

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2022-01-10-190819

How reproducible:
Hit it once

Steps to Reproduce:
1. Upgrade ocp cluster from 4.8 to 4.9
2.
3.

Actual results:
upgrade stuck with error "RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift": context canceled" with authetication operator in degraded state

Expected results:
Authetication operator should not be in degraded state and upgrade should happen fine.

Additional info:
 oc get route -n openshift-authentication oauth-openshift
NAME              HOST/PORT                                                     PATH   SERVICES          PORT   TERMINATION WILDCARD
oauth-openshift   oauth-openshift.apps.knarra0111.qe.devcluster.openshift.com          oauth-openshift   6443   passthrough/Redirect None

Even though there exists a route authentication operator still complains that there is no route.

Comment 1 RamaKasturi 2022-01-11 17:56:44 UTC
Must-gather can be found at the link below

http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/2039417/must-gather.local.3625109277001075849/

Comment 2 Standa Laznicka 2022-01-12 08:32:02 UTC
How long does the state last? Does it eventually get healthy? What is the status of all the other operators?

Comment 3 RamaKasturi 2022-01-12 08:58:09 UTC
Hello Standa,

   Even after 25h, the cluster is in same state. It does not get healthy eventually. Below is the status of all other operators, As a workaround i learnt from xxia that deleting the pod in openshift-authentication-operator would work, but that did could not bring  back the cluster to healthy state.

[knarra@knarra ~]$ oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-0.nightly-2022-01-10-190819   True        False         True       25h     RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift": context canceled
baremetal                                  4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
cloud-controller-manager                   4.9.0-0.nightly-2022-01-10-190819   True        False         False      22h     
cloud-credential                           4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
cluster-autoscaler                         4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
config-operator                            4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
console                                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
csi-snapshot-controller                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
dns                                        4.8.0-0.nightly-2022-01-11-000651   True        False         False      25h     
etcd                                       4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
image-registry                             4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
ingress                                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
insights                                   4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
kube-apiserver                             4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
kube-controller-manager                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
kube-scheduler                             4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
kube-storage-version-migrator              4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
machine-api                                4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
machine-approver                           4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
machine-config                             4.8.0-0.nightly-2022-01-11-000651   True        False         False      25h     
marketplace                                4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
monitoring                                 4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
network                                    4.8.0-0.nightly-2022-01-11-000651   True        False         False      25h     
node-tuning                                4.9.0-0.nightly-2022-01-10-190819   True        False         False      22h     
openshift-apiserver                        4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
openshift-controller-manager               4.9.0-0.nightly-2022-01-10-190819   True        False         False      6h43m   
openshift-samples                          4.9.0-0.nightly-2022-01-10-190819   True        False         False      22h     
operator-lifecycle-manager                 4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
operator-lifecycle-manager-catalog         4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
operator-lifecycle-manager-packageserver   4.9.0-0.nightly-2022-01-10-190819   True        False         False      22h     
service-ca                                 4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
storage                                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h    

[knarra@knarra ~]$ oc delete pod authentication-operator-69c9b6f766-ps4vs -n openshift-authentication-operator
pod "authentication-operator-69c9b6f766-ps4vs" deleted
[knarra@knarra ~]$ oc get pods -n openshift-authentication-operator
NAME                                       READY   STATUS    RESTARTS   AGE
authentication-operator-69c9b6f766-m4mcg   1/1     Running   0          71s


[knarra@knarra ~]$ oc get pods -n openshift-authentication-operator
NAME                                       READY   STATUS    RESTARTS   AGE
authentication-operator-69c9b6f766-m4mcg   1/1     Running   0          6m31s
[knarra@knarra ~]$ 
[knarra@knarra ~]$ 
[knarra@knarra ~]$ 
[knarra@knarra ~]$ oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-0.nightly-2022-01-10-190819   True        False         True       25h     RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift": context canceled
baremetal                                  4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
cloud-controller-manager                   4.9.0-0.nightly-2022-01-10-190819   True        False         False      22h     
cloud-credential                           4.9.0-0.nightly-2022-01-10-190819   True        False         False      26h     
cluster-autoscaler                         4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
config-operator                            4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
console                                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
csi-snapshot-controller                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
dns                                        4.8.0-0.nightly-2022-01-11-000651   True        False         False      25h     
etcd                                       4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
image-registry                             4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
ingress                                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
insights                                   4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
kube-apiserver                             4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
kube-controller-manager                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
kube-scheduler                             4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
kube-storage-version-migrator              4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
machine-api                                4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
machine-approver                           4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
machine-config                             4.8.0-0.nightly-2022-01-11-000651   True        False         False      25h     
marketplace                                4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
monitoring                                 4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
network                                    4.8.0-0.nightly-2022-01-11-000651   True        False         False      25h     
node-tuning                                4.9.0-0.nightly-2022-01-10-190819   True        False         False      22h     
openshift-apiserver                        4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
openshift-controller-manager               4.9.0-0.nightly-2022-01-10-190819   True        False         False      6h54m   
openshift-samples                          4.9.0-0.nightly-2022-01-10-190819   True        False         False      22h     
operator-lifecycle-manager                 4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
operator-lifecycle-manager-catalog         4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
operator-lifecycle-manager-packageserver   4.9.0-0.nightly-2022-01-10-190819   True        False         False      22h     
service-ca                                 4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h     
storage                                    4.9.0-0.nightly-2022-01-10-190819   True        False         False      25h

Comment 4 Xingxing Xia 2022-01-13 09:41:07 UTC
Kasturi, let's keep comment public if no sensitive info such that customers can search and see.

Some debugging info:
Checked the authentication-operator-69c9b6f766-ps4vs pod logs:
The first occurrence of the error was at this timestamp:

2022-01-11T10:17:54.439647048Z I0111 10:17:54.439626       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2022-01-11T07:31:18Z","message":"RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get \"https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift\": context canceled","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2022-01-11T07:40:56Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2022-01-11T07:31:18Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-01-11T07:03:16Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}],"versions":[{"name":"operator","version":"4.9.0-0.nightly-2022-01-10-190819"},{"name":"oauth-apiserver","version":"4.8.0-0.nightly-2022-01-11-000651"},{"name":"oauth-openshift","version":"4.8.0-0.nightly-2022-01-11-000651_openshift"}]}}

The last occurrence of the error was at this timestamp:
2022-01-11T10:22:14.586241766Z I0111 10:22:14.585095       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"4c87c634-4e7c-4b4f-8838-5defba41d436", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver (container is waiting in pending apiserver-5c6c98f584-sjwfx pod)\nRouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get \"https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift\": context canceled" to "RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get \"https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift\": context canceled"

Checked the kube-apiserver pods' logs, searched oauth-openshift, found one (and only one) pod had below logs at below timestamps:

2022-01-11T10:17:52.206302985Z E0111 10:17:52.204625      15 timeout.go:135] post-timeout activity - time-elapsed: 2.125µs, GET "/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift" result: <nil>
2022-01-11T10:17:52.237308180Z E0111 10:17:52.237212      15 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/oauth.openshift.io/v1/oauthclients?allowWatchBookmarks=true&resourceVersion=102375&timeout=5m18s&timeoutSeconds=318&watch=true" audit-ID="96378777-38cf-4145-b420-6e4b688036ea"
2022-01-11T10:17:52.302103510Z E0111 10:17:52.301936      15 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Doauth-openshift&resourceVersion=101930&timeout=6m14s&timeoutSeconds=374&watch=true" audit-ID="8229c002-976d-4830-9b56-0208ec9449f6"

From above, looks like the transient kube-apiserver 'timeout' caused the authentication-operator transiently unable to access the route object.

My screen output recorded: at 13:10:11 my debugging could show authentication-operator can access the route:
[xxia@pres 2022-01-11 13:04:16 GMT my]$ oc describe co authentication
Name:         authentication
...
Status:
  Conditions:
    Last Transition Time:  2022-01-11T10:19:54Z
    Message:               RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift": context canceled
    Reason:                Route_FailedCreate
    Status:                True
    Type:                  Degraded
...

[xxia@pres 2022-01-11 13:10:11 GMT my]$ oc rsh -n openshift-authentication-operator authentication-operator-69c9b6f766-ps4vs
sh-4.4# curl -k -H "Authorization: bearer `cat /var/run/secrets/kubernetes.io/serviceaccount/token`" https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift
{ 
  "kind": "Route",
  "apiVersion": "route.openshift.io/v1",
  "metadata": {
    "name": "oauth-openshift",
...
}

That means, though kube-apiserver became stable and though authentication-operator can access the route object later, authentication-operator still got stuck in Degraded since 10:19:54 as of my debugging timestamp 13:10:11, and did not clean up above transient error state.

Comment 5 Krzysztof Ostrowski 2022-01-20 18:12:42 UTC
Assigning to @slaskawi@redhat.com

Comment 6 Sebastian Łaskawiec 2022-01-25 09:37:31 UTC
The behavior you described may indicate that the Degraded status is simply not being cleared up (due to some sort of bug). Luckily, this may easily be checked by manually erasing .status.conditions from co authentication:
    use `oc edit co authentication` or `oc patch` to achieve this

Once you check this, could you please get back to me with the results? If the error is gone for good - that means we have a bug in CAO. Since this bug is from 4.9, we need a reproducer for 4.10. This code has recently been updated so this bug might be gone. If the error is back, we'll need to investigate further.

Comment 7 Sebastian Łaskawiec 2022-01-26 07:55:24 UTC
I believe I found the error - the degraded condition is not cleared:
- When calling `common.GetOAuthServerRoute(c.routeLister, "OAuthConfigRoute")`, we collect a route and a slice of conditions [1]
- In case of an error, the `common.GetOAuthServerRoute` properly emits a degraded condition
- However once an error has been set, there's no code that erases it.

I'm lowering the priority to medium there's a simple workaround - just use `oc edit co authentication` and manually remove the degraded condition.

I'll start working on a proper solution soon. 

[1] https://github.com/openshift/cluster-authentication-operator/blob/697a2960f59372b81e852a1bf135f9fe4b8b9e40/pkg/controllers/payload/payload_config_controller.go#L138

Comment 8 RamaKasturi 2022-01-27 10:52:39 UTC
Hello Sebastian,

   From the comment above i see that you have found the root cause, would you still want me to reproduce the error or is that not required anymore ?

Thanks
kasturi

Comment 9 Sebastian Łaskawiec 2022-01-27 13:17:41 UTC
No, thanks! I believe I know where this one is coming from.

Comment 11 Sebastian Łaskawiec 2022-01-28 15:17:13 UTC
I spoke with the API and OC command line tools Teams, kubectl edit will not work here due to [1][2]. It is very likely that those changes will get into Kube 1.23.

In the meantime, use commands similar to this: 

  $ oc proxy
  $ $ curl -k -XPATCH -H "Accept: application/json" -H "Content-Type: application/json-patch+json" 'http://127.0.0.1:8001/apis/config.openshift.io/v1/clusteroperators/testing/status' -d '[{"op"
: "add", "path": "/status", "value": {"conditions": [{"lastTransitionTime": "2021-06-01T01:01:01Z", "type": "Upgradeable", "status": "False", "reason": "Testing", "message": "The whatsits are broken."}]}}]'

[1] https://github.com/kubernetes/kubernetes/issues/67455
[2] https://github.com/kubernetes/kubectl/issues/564

Comment 14 Mike Fiedler 2022-01-28 20:16:34 UTC
Hit this again in QE CI on a 4.8.28 -> 4.9.19 upgrade and could not get the curl command to work.   Note:  updated the operator URL to be authentication and updated the lastTransitionTime.  The PATCH seemed to be successful but the status never changed and the upgrade remained stuck.   Any other ideas for working around this?

[mifiedle@mffiedler aos-4_8]$ curl -k -XPATCH -H "Accept: application/json" -H "Content-Type: application/json-patch+json" 'http://127.0.0.1:8001/apis/config.openshift.io/v1/clusteroperators/authentication/status' -d '[{"op"
: "add", "path": "/status", "value": {"conditions": [{"lastTransitionTime": "2022-01-28T20:10:01Z", "type": "Upgradeable", "status": "False", "reason": "Testing", "message": "The whatsits are broken."}]}}]'

<space added for readability>

{"apiVersion":"config.openshift.io/v1","kind":"ClusterOperator","metadata":{"annotations":{"exclude.release.openshift.io/internal-openshift-hosted":"true","include.release.openshift.io/self-managed-high-availability":"true","include.release.openshift.io/single-node-developer":"true"},"creationTimestamp":"2022-01-28T14:09:28Z","generation":1,"managedFields":[{"apiVersion":"config.openshift.io/v1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:exclude.release.openshift.io/internal-openshift-hosted":{},"f:include.release.openshift.io/self-managed-high-availability":{},"f:include.release.openshift.io/single-node-developer":{}}},"f:spec":{},"f:status":{}},"manager":"cluster-version-operator","operation":"Update","time":"2022-01-28T14:09:29Z"},{"apiVersion":"config.openshift.io/v1","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{}}},"manager":"curl","operation":"Update","subresource":"status","time":"2022-01-28T20:13:26Z"}],"name":"authentication","resourceVersion":"204877","uid":"e2b658fe-dcb7-4a9f-89ad-57694d55b0b1"},"spec":{},"status":{"conditions":[{"lastTransitionTime":"2022-01-28T20:10:01Z","message":"The whatsits are broken.","reason":"Testing","status":"False","type":"Upgradeable"}]}}

[mifiedle@mffiedler aos-4_8]$ 
[mifiedle@mffiedler aos-4_8]$ oc get co authentication
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.9.0-0.nightly-2022-01-24-212243   True        False         True       29s     RouteDegraded: Unable to get or create required route openshift-authentication/oauth-openshift: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/namespaces/openshift-authentication/routes/oauth-openshift": context canceled

Comment 15 Mike Fiedler 2022-01-28 20:18:37 UTC
@slaskawi@redhat.com  I think this is a 4.9 upgrade blocker unless we have a working workaround.

Comment 16 Scott Dodson 2022-01-28 20:43:41 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?

    example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
    example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact? Is it serious enough to warrant blocking edges?

    example: Up to 2 minute disruption in edge routing
    example: Up to 90 seconds of API downtime
    example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

    example: Issue resolves itself after five minutes
    example: Admin uses oc to fix things
    example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

    example: No, it has always been like this we just never noticed
    example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 25 errata-xmlrpc 2022-02-10 06:33:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.19 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0340


Note You need to log in before you can comment on or make changes to this bug.