Bug 1980107 - Cannot login to cluster, oauth reports unhealthy
Summary: Cannot login to cluster, oauth reports unhealthy
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oauth-apiserver
Version: 4.8
Hardware: s390x
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Standa Laznicka
QA Contact: liyao
URL:
Whiteboard:
Depends On:
Blocks: ocp-48-z-tracker
TreeView+ depends on / blocked
 
Reported: 2021-07-07 19:17 UTC by Tom Dale
Modified: 2021-07-09 13:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-09 09:56:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oc get logs from oauth pods and events (178.77 KB, text/plain)
2021-07-07 19:20 UTC, Tom Dale
no flags Details
login failure with loglevel=8 (12.12 KB, text/plain)
2021-07-07 19:23 UTC, Tom Dale
no flags Details

Description Tom Dale 2021-07-07 19:17:51 UTC
Description of problem:
Installed ocp version 4.8.0-rc.3 on s390x using KVM and used cluster for a few days. Now cannot login to cluster instead get 

```
oc login --server=9.12.23.92:6443 --insecure-skip-tls-verify -u kubeadmin
Authentication required for https://9.12.23.92:6443 (openshift)
Username: kubeadmin
Password:
Login failed (401 Unauthorized)
Verify you have provided correct credentials.
````
or
```
oc login --server=9.12.23.92:6443 --insecure-skip-tls-verify -u tdale
Error from server (InternalError): Internal error occurred: unexpected response: 400
```

Once I log in as system:admin, I can access the cluster with oc, and all cluster operators (including authentication operator) are showing fine
```
-> % oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-rc.3   True        False         False      6m57s
baremetal                                  4.8.0-rc.3   True        False         False      36h
cloud-credential                           4.8.0-rc.3   True        False         False      36h
cluster-autoscaler                         4.8.0-rc.3   True        False         False      36h
config-operator                            4.8.0-rc.3   True        False         False      36h
console                                    4.8.0-rc.3   True        False         False      38m
csi-snapshot-controller                    4.8.0-rc.3   True        False         False      36h
dns                                        4.8.0-rc.3   True        False         False      4h22m
etcd                                       4.8.0-rc.3   True        False         False      36h
image-registry                             4.8.0-rc.3   True        False         False      79m
ingress                                    4.8.0-rc.3   True        False         False      36h
insights                                   4.8.0-rc.3   True        False         False      36h
kube-apiserver                             4.8.0-rc.3   True        False         False      36h
kube-controller-manager                    4.8.0-rc.3   True        False         False      36h
kube-scheduler                             4.8.0-rc.3   True        False         False      36h
kube-storage-version-migrator              4.8.0-rc.3   True        False         False      111m
machine-api                                4.8.0-rc.3   True        False         False      36h
machine-approver                           4.8.0-rc.3   True        False         False      36h
machine-config                             4.8.0-rc.3   True        False         False      9m55s
marketplace                                4.8.0-rc.3   True        False         False      36h
monitoring                                 4.8.0-rc.3   True        False         False      6m44s
network                                    4.8.0-rc.3   True        False         False      36h
node-tuning                                4.8.0-rc.3   True        False         False      36h
openshift-apiserver                        4.8.0-rc.3   True        False         False      7m16s
openshift-controller-manager               4.8.0-rc.3   True        False         False      77m
openshift-samples                          4.8.0-rc.3   True        False         False      36h
operator-lifecycle-manager                 4.8.0-rc.3   True        False         False      36h
operator-lifecycle-manager-catalog         4.8.0-rc.3   True        False         False      36h
operator-lifecycle-manager-packageserver   4.8.0-rc.3   True        False         False      34m
service-ca                                 4.8.0-rc.3   True        False         False      36h
storage                                    4.8.0-rc.3   True        False         False      36h
```

When I look at events in openshift-authentication namespace I get tons of the following  

```
15s         Warning   ProbeError          pod/oauth-openshift-6685f8f4b5-cj8qx    Liveness probe error: Get "https://10.129.0.44:6443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body:
15s         Warning   Unhealthy           pod/oauth-openshift-6685f8f4b5-cj8qx    Liveness probe failed: Get "https://10.129.0.44:6443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
```
and can see that the pods are Running
```-> % oc get pods -n openshift-authentication -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                          NOMINATED NODE   READINESS GATES
oauth-openshift-dbbd9559-5v2xs   1/1     Running   0          16m   10.130.0.53   master-2.pok-92.ocptest.pok.stglabs.ibm.com   <none>           <none>
oauth-openshift-dbbd9559-lqhf9   1/1     Running   0          14m   10.129.0.45   master-0.pok-92.ocptest.pok.stglabs.ibm.com   <none>           <none>
oauth-openshift-dbbd9559-xmqfz   1/1     Running   0          15m   10.128.0.50   master-1.pok-92.ocptest.pok.stglabs.ibm.com   <none>           <none>

```


These pods are running for about 20 minutes before the replicasets start scaling down to 0 then back up as shown in the events
```
164m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled down replica set oauth-openshift-8df757558 to 1
164m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled up replica set oauth-openshift-7f678d58b4 to 2
163m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled down replica set oauth-openshift-8df757558 to 0
163m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled up replica set oauth-openshift-7f678d58b4 to 3
103m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled down replica set oauth-openshift-7f678d58b4 to 2
103m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled up replica set oauth-openshift-9b6878b55 to 1
102m        Normal    ScalingReplicaSet   deployment/oauth-openshift                      
```

This isn't an issue with network access as application routes and the console are available, although the console does not allow log in either.


Version-Release number of selected component (if applicable):
Client Version: 4.8.0-202104292348.p0-a765590
Server Version: 4.8.0-rc.3
Kubernetes Version: v1.21.1+f36aa36

How reproducible:
Can't easily reproduce

Additional info:
I had many test workloads running on the cluster before authentication  issues started. After logging in as system:admin I removed all these projects and still have the same authentication issues.

Comment 1 Tom Dale 2021-07-07 19:20:20 UTC
Created attachment 1799395 [details]
oc get logs from oauth pods and events

Comment 2 Tom Dale 2021-07-07 19:23:51 UTC
Created attachment 1799397 [details]
login failure with loglevel=8

Comment 3 Standa Laznicka 2021-07-08 07:34:38 UTC
Please provide a must-gather.

Comment 5 Tom Dale 2021-07-08 18:45:45 UTC
Looks like this this may be a resource related issue. This cluster is on a shared System z server/CPC . Another LPAR with a cluster with very high resource utilization was also running. Once we stopped the cluster on the other LPAR, this cluster became accessible soon after. Note that both before and after authentication was working running `oc describe nodes | grep Resource -A 5` was never showing resource utilization above 25% on any of the nodes.  For must-gather logs after the authentication is now working see here -> https://drive.google.com/file/d/1V7oPDOSqBi0DSg0osbG7nMX3TwuT0dr7/view?usp=sharing  . 

If this is the case I would still have expected `oc get co` would to show failed clusteroperators, but as you can see above this was not the case.

Comment 6 Andy McCrae 2021-07-09 09:56:47 UTC
Hi Tom,

The logs for the authentication operator do show it moved to degraded/unavailable a few times. for example:

2021-07-07T15:45:17.816543155Z I0707 15:45:17.801176       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-07-07T15:44:11Z","message":"APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()","reason":"APIServerDeployment_UnavailablePod","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-07-07T15:40:18Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-07-07T15:45:15Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-07-06T06:47:26Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
...
2021-07-07T15:45:49.905654721Z I0707 15:45:49.898832       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-07-07T15:45:49Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-07-07T15:40:18Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-07-07T15:45:15Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-07-06T06:47:26Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

It looks like it transitioned back to "Available": True -  "Degraded": False, "Progressing:" False very soon after though, so querying "oc get co" may not have shown the state transition when you looked.

Given that the cluster operator logs show that it did transition - I'll close this one out - let me know if you still have concerns though!

Comment 7 Tom Dale 2021-07-09 13:17:04 UTC
Hey Andy,

Thanks for your help. Currently I have a cluster that shows authentication co is available and not degraded since 9h ago. I had a script trying to login every 10 minutes and 15 times over the past 9 hours I encountered this authentication failure. Is this expected behavior? 

oc get co authentication -o yaml 
...
spec: {}
status:
  conditions:
  - lastTransitionTime: "2021-07-08T19:03:42Z"
    message: All is well
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-07-08T16:14:02Z"
    message: 'AuthenticatorCertKeyProgressing: All is well'
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-07-09T03:51:53Z" <-- (( 9 Hours ago ))
    message: All is well
    reason: AsExpected
    status: "True"
    type: Available

...


Note You need to log in before you can comment on or make changes to this bug.