Bug 1980107
Summary: | Cannot login to cluster, oauth reports unhealthy | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Tom Dale <tdale> | ||||||
Component: | oauth-apiserver | Assignee: | Standa Laznicka <slaznick> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | liyao | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.8 | CC: | amccrae, aos-bugs, Holger.Wolf, krmoser, mfojtik, surbania, tdale, wlewis, wolfgang.voesch, wvoesch | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | s390x | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-07-09 09:56:47 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1934148 | ||||||||
Attachments: |
|
Description
Tom Dale
2021-07-07 19:17:51 UTC
Created attachment 1799395 [details]
oc get logs from oauth pods and events
Created attachment 1799397 [details]
login failure with loglevel=8
Please provide a must-gather. Looks like this this may be a resource related issue. This cluster is on a shared System z server/CPC . Another LPAR with a cluster with very high resource utilization was also running. Once we stopped the cluster on the other LPAR, this cluster became accessible soon after. Note that both before and after authentication was working running `oc describe nodes | grep Resource -A 5` was never showing resource utilization above 25% on any of the nodes. For must-gather logs after the authentication is now working see here -> https://drive.google.com/file/d/1V7oPDOSqBi0DSg0osbG7nMX3TwuT0dr7/view?usp=sharing . If this is the case I would still have expected `oc get co` would to show failed clusteroperators, but as you can see above this was not the case. Hi Tom, The logs for the authentication operator do show it moved to degraded/unavailable a few times. for example: 2021-07-07T15:45:17.816543155Z I0707 15:45:17.801176 1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-07-07T15:44:11Z","message":"APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()","reason":"APIServerDeployment_UnavailablePod","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-07-07T15:40:18Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-07-07T15:45:15Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-07-06T06:47:26Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} ... 2021-07-07T15:45:49.905654721Z I0707 15:45:49.898832 1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-07-07T15:45:49Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-07-07T15:40:18Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-07-07T15:45:15Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-07-06T06:47:26Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} It looks like it transitioned back to "Available": True - "Degraded": False, "Progressing:" False very soon after though, so querying "oc get co" may not have shown the state transition when you looked. Given that the cluster operator logs show that it did transition - I'll close this one out - let me know if you still have concerns though! Hey Andy, Thanks for your help. Currently I have a cluster that shows authentication co is available and not degraded since 9h ago. I had a script trying to login every 10 minutes and 15 times over the past 9 hours I encountered this authentication failure. Is this expected behavior? oc get co authentication -o yaml ... spec: {} status: conditions: - lastTransitionTime: "2021-07-08T19:03:42Z" message: All is well reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2021-07-08T16:14:02Z" message: 'AuthenticatorCertKeyProgressing: All is well' reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2021-07-09T03:51:53Z" <-- (( 9 Hours ago )) message: All is well reason: AsExpected status: "True" type: Available ... |