Description of problem: Installed ocp version 4.8.0-rc.3 on s390x using KVM and used cluster for a few days. Now cannot login to cluster instead get ``` oc login --server=9.12.23.92:6443 --insecure-skip-tls-verify -u kubeadmin Authentication required for https://9.12.23.92:6443 (openshift) Username: kubeadmin Password: Login failed (401 Unauthorized) Verify you have provided correct credentials. ```` or ``` oc login --server=9.12.23.92:6443 --insecure-skip-tls-verify -u tdale Error from server (InternalError): Internal error occurred: unexpected response: 400 ``` Once I log in as system:admin, I can access the cluster with oc, and all cluster operators (including authentication operator) are showing fine ``` -> % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-rc.3 True False False 6m57s baremetal 4.8.0-rc.3 True False False 36h cloud-credential 4.8.0-rc.3 True False False 36h cluster-autoscaler 4.8.0-rc.3 True False False 36h config-operator 4.8.0-rc.3 True False False 36h console 4.8.0-rc.3 True False False 38m csi-snapshot-controller 4.8.0-rc.3 True False False 36h dns 4.8.0-rc.3 True False False 4h22m etcd 4.8.0-rc.3 True False False 36h image-registry 4.8.0-rc.3 True False False 79m ingress 4.8.0-rc.3 True False False 36h insights 4.8.0-rc.3 True False False 36h kube-apiserver 4.8.0-rc.3 True False False 36h kube-controller-manager 4.8.0-rc.3 True False False 36h kube-scheduler 4.8.0-rc.3 True False False 36h kube-storage-version-migrator 4.8.0-rc.3 True False False 111m machine-api 4.8.0-rc.3 True False False 36h machine-approver 4.8.0-rc.3 True False False 36h machine-config 4.8.0-rc.3 True False False 9m55s marketplace 4.8.0-rc.3 True False False 36h monitoring 4.8.0-rc.3 True False False 6m44s network 4.8.0-rc.3 True False False 36h node-tuning 4.8.0-rc.3 True False False 36h openshift-apiserver 4.8.0-rc.3 True False False 7m16s openshift-controller-manager 4.8.0-rc.3 True False False 77m openshift-samples 4.8.0-rc.3 True False False 36h operator-lifecycle-manager 4.8.0-rc.3 True False False 36h operator-lifecycle-manager-catalog 4.8.0-rc.3 True False False 36h operator-lifecycle-manager-packageserver 4.8.0-rc.3 True False False 34m service-ca 4.8.0-rc.3 True False False 36h storage 4.8.0-rc.3 True False False 36h ``` When I look at events in openshift-authentication namespace I get tons of the following ``` 15s Warning ProbeError pod/oauth-openshift-6685f8f4b5-cj8qx Liveness probe error: Get "https://10.129.0.44:6443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) body: 15s Warning Unhealthy pod/oauth-openshift-6685f8f4b5-cj8qx Liveness probe failed: Get "https://10.129.0.44:6443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) ``` and can see that the pods are Running ```-> % oc get pods -n openshift-authentication -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES oauth-openshift-dbbd9559-5v2xs 1/1 Running 0 16m 10.130.0.53 master-2.pok-92.ocptest.pok.stglabs.ibm.com <none> <none> oauth-openshift-dbbd9559-lqhf9 1/1 Running 0 14m 10.129.0.45 master-0.pok-92.ocptest.pok.stglabs.ibm.com <none> <none> oauth-openshift-dbbd9559-xmqfz 1/1 Running 0 15m 10.128.0.50 master-1.pok-92.ocptest.pok.stglabs.ibm.com <none> <none> ``` These pods are running for about 20 minutes before the replicasets start scaling down to 0 then back up as shown in the events ``` 164m Normal ScalingReplicaSet deployment/oauth-openshift Scaled down replica set oauth-openshift-8df757558 to 1 164m Normal ScalingReplicaSet deployment/oauth-openshift Scaled up replica set oauth-openshift-7f678d58b4 to 2 163m Normal ScalingReplicaSet deployment/oauth-openshift Scaled down replica set oauth-openshift-8df757558 to 0 163m Normal ScalingReplicaSet deployment/oauth-openshift Scaled up replica set oauth-openshift-7f678d58b4 to 3 103m Normal ScalingReplicaSet deployment/oauth-openshift Scaled down replica set oauth-openshift-7f678d58b4 to 2 103m Normal ScalingReplicaSet deployment/oauth-openshift Scaled up replica set oauth-openshift-9b6878b55 to 1 102m Normal ScalingReplicaSet deployment/oauth-openshift ``` This isn't an issue with network access as application routes and the console are available, although the console does not allow log in either. Version-Release number of selected component (if applicable): Client Version: 4.8.0-202104292348.p0-a765590 Server Version: 4.8.0-rc.3 Kubernetes Version: v1.21.1+f36aa36 How reproducible: Can't easily reproduce Additional info: I had many test workloads running on the cluster before authentication issues started. After logging in as system:admin I removed all these projects and still have the same authentication issues.
Created attachment 1799395 [details] oc get logs from oauth pods and events
Created attachment 1799397 [details] login failure with loglevel=8
Please provide a must-gather.
Must-gather here https://drive.google.com/file/d/1xvei95Az6An-AKuXhDyvGPO3Q-7P4l5h/view?usp=sharing
Looks like this this may be a resource related issue. This cluster is on a shared System z server/CPC . Another LPAR with a cluster with very high resource utilization was also running. Once we stopped the cluster on the other LPAR, this cluster became accessible soon after. Note that both before and after authentication was working running `oc describe nodes | grep Resource -A 5` was never showing resource utilization above 25% on any of the nodes. For must-gather logs after the authentication is now working see here -> https://drive.google.com/file/d/1V7oPDOSqBi0DSg0osbG7nMX3TwuT0dr7/view?usp=sharing . If this is the case I would still have expected `oc get co` would to show failed clusteroperators, but as you can see above this was not the case.
Hi Tom, The logs for the authentication operator do show it moved to degraded/unavailable a few times. for example: 2021-07-07T15:45:17.816543155Z I0707 15:45:17.801176 1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-07-07T15:44:11Z","message":"APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()","reason":"APIServerDeployment_UnavailablePod","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-07-07T15:40:18Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-07-07T15:45:15Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-07-06T06:47:26Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} ... 2021-07-07T15:45:49.905654721Z I0707 15:45:49.898832 1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-07-07T15:45:49Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-07-07T15:40:18Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-07-07T15:45:15Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-07-06T06:47:26Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} It looks like it transitioned back to "Available": True - "Degraded": False, "Progressing:" False very soon after though, so querying "oc get co" may not have shown the state transition when you looked. Given that the cluster operator logs show that it did transition - I'll close this one out - let me know if you still have concerns though!
Hey Andy, Thanks for your help. Currently I have a cluster that shows authentication co is available and not degraded since 9h ago. I had a script trying to login every 10 minutes and 15 times over the past 9 hours I encountered this authentication failure. Is this expected behavior? oc get co authentication -o yaml ... spec: {} status: conditions: - lastTransitionTime: "2021-07-08T19:03:42Z" message: All is well reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2021-07-08T16:14:02Z" message: 'AuthenticatorCertKeyProgressing: All is well' reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2021-07-09T03:51:53Z" <-- (( 9 Hours ago )) message: All is well reason: AsExpected status: "True" type: Available ...