Bug 1980107

Summary:

Cannot login to cluster, oauth reports unhealthy

Product:

OpenShift Container Platform

Reporter:

Tom Dale <tdale>

Component:

oauth-apiserver

Assignee:

Standa Laznicka <slaznick>

Status:

CLOSED NOTABUG

QA Contact:

liyao

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.8

CC:

amccrae, aos-bugs, Holger.Wolf, krmoser, mfojtik, surbania, tdale, wlewis, wolfgang.voesch, wvoesch

Target Milestone:

---

Target Release:

---

Hardware:

s390x

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-07-09 09:56:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1934148

Attachments:

Description	Flags
oc get logs from oauth pods and events	none
login failure with loglevel=8	none

Description Tom Dale 2021-07-07 19:17:51 UTC

Description of problem:
Installed ocp version 4.8.0-rc.3 on s390x using KVM and used cluster for a few days. Now cannot login to cluster instead get 

```
oc login --server=9.12.23.92:6443 --insecure-skip-tls-verify -u kubeadmin
Authentication required for https://9.12.23.92:6443 (openshift)
Username: kubeadmin
Password:
Login failed (401 Unauthorized)
Verify you have provided correct credentials.
````
or
```
oc login --server=9.12.23.92:6443 --insecure-skip-tls-verify -u tdale
Error from server (InternalError): Internal error occurred: unexpected response: 400
```

Once I log in as system:admin, I can access the cluster with oc, and all cluster operators (including authentication operator) are showing fine
```
-> % oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-rc.3   True        False         False      6m57s
baremetal                                  4.8.0-rc.3   True        False         False      36h
cloud-credential                           4.8.0-rc.3   True        False         False      36h
cluster-autoscaler                         4.8.0-rc.3   True        False         False      36h
config-operator                            4.8.0-rc.3   True        False         False      36h
console                                    4.8.0-rc.3   True        False         False      38m
csi-snapshot-controller                    4.8.0-rc.3   True        False         False      36h
dns                                        4.8.0-rc.3   True        False         False      4h22m
etcd                                       4.8.0-rc.3   True        False         False      36h
image-registry                             4.8.0-rc.3   True        False         False      79m
ingress                                    4.8.0-rc.3   True        False         False      36h
insights                                   4.8.0-rc.3   True        False         False      36h
kube-apiserver                             4.8.0-rc.3   True        False         False      36h
kube-controller-manager                    4.8.0-rc.3   True        False         False      36h
kube-scheduler                             4.8.0-rc.3   True        False         False      36h
kube-storage-version-migrator              4.8.0-rc.3   True        False         False      111m
machine-api                                4.8.0-rc.3   True        False         False      36h
machine-approver                           4.8.0-rc.3   True        False         False      36h
machine-config                             4.8.0-rc.3   True        False         False      9m55s
marketplace                                4.8.0-rc.3   True        False         False      36h
monitoring                                 4.8.0-rc.3   True        False         False      6m44s
network                                    4.8.0-rc.3   True        False         False      36h
node-tuning                                4.8.0-rc.3   True        False         False      36h
openshift-apiserver                        4.8.0-rc.3   True        False         False      7m16s
openshift-controller-manager               4.8.0-rc.3   True        False         False      77m
openshift-samples                          4.8.0-rc.3   True        False         False      36h
operator-lifecycle-manager                 4.8.0-rc.3   True        False         False      36h
operator-lifecycle-manager-catalog         4.8.0-rc.3   True        False         False      36h
operator-lifecycle-manager-packageserver   4.8.0-rc.3   True        False         False      34m
service-ca                                 4.8.0-rc.3   True        False         False      36h
storage                                    4.8.0-rc.3   True        False         False      36h
```

When I look at events in openshift-authentication namespace I get tons of the following  

```
15s         Warning   ProbeError          pod/oauth-openshift-6685f8f4b5-cj8qx    Liveness probe error: Get "https://10.129.0.44:6443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body:
15s         Warning   Unhealthy           pod/oauth-openshift-6685f8f4b5-cj8qx    Liveness probe failed: Get "https://10.129.0.44:6443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
```
and can see that the pods are Running
```-> % oc get pods -n openshift-authentication -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                          NOMINATED NODE   READINESS GATES
oauth-openshift-dbbd9559-5v2xs   1/1     Running   0          16m   10.130.0.53   master-2.pok-92.ocptest.pok.stglabs.ibm.com   <none>           <none>
oauth-openshift-dbbd9559-lqhf9   1/1     Running   0          14m   10.129.0.45   master-0.pok-92.ocptest.pok.stglabs.ibm.com   <none>           <none>
oauth-openshift-dbbd9559-xmqfz   1/1     Running   0          15m   10.128.0.50   master-1.pok-92.ocptest.pok.stglabs.ibm.com   <none>           <none>

```


These pods are running for about 20 minutes before the replicasets start scaling down to 0 then back up as shown in the events
```
164m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled down replica set oauth-openshift-8df757558 to 1
164m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled up replica set oauth-openshift-7f678d58b4 to 2
163m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled down replica set oauth-openshift-8df757558 to 0
163m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled up replica set oauth-openshift-7f678d58b4 to 3
103m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled down replica set oauth-openshift-7f678d58b4 to 2
103m        Normal    ScalingReplicaSet   deployment/oauth-openshift              Scaled up replica set oauth-openshift-9b6878b55 to 1
102m        Normal    ScalingReplicaSet   deployment/oauth-openshift                      
```

This isn't an issue with network access as application routes and the console are available, although the console does not allow log in either.


Version-Release number of selected component (if applicable):
Client Version: 4.8.0-202104292348.p0-a765590
Server Version: 4.8.0-rc.3
Kubernetes Version: v1.21.1+f36aa36

How reproducible:
Can't easily reproduce

Additional info:
I had many test workloads running on the cluster before authentication  issues started. After logging in as system:admin I removed all these projects and still have the same authentication issues.

Comment 1 Tom Dale 2021-07-07 19:20:20 UTC

Created attachment 1799395 [details]
oc get logs from oauth pods and events

Comment 2 Tom Dale 2021-07-07 19:23:51 UTC

Created attachment 1799397 [details]
login failure with loglevel=8

Comment 3 Standa Laznicka 2021-07-08 07:34:38 UTC

Please provide a must-gather.

Comment 4 Tom Dale 2021-07-08 13:21:13 UTC

Must-gather here https://drive.google.com/file/d/1xvei95Az6An-AKuXhDyvGPO3Q-7P4l5h/view?usp=sharing

Comment 5 Tom Dale 2021-07-08 18:45:45 UTC

Looks like this this may be a resource related issue. This cluster is on a shared System z server/CPC . Another LPAR with a cluster with very high resource utilization was also running. Once we stopped the cluster on the other LPAR, this cluster became accessible soon after. Note that both before and after authentication was working running `oc describe nodes | grep Resource -A 5` was never showing resource utilization above 25% on any of the nodes.  For must-gather logs after the authentication is now working see here -> https://drive.google.com/file/d/1V7oPDOSqBi0DSg0osbG7nMX3TwuT0dr7/view?usp=sharing  . 

If this is the case I would still have expected `oc get co` would to show failed clusteroperators, but as you can see above this was not the case.

Comment 6 Andy McCrae 2021-07-09 09:56:47 UTC

Hi Tom,

The logs for the authentication operator do show it moved to degraded/unavailable a few times. for example:

2021-07-07T15:45:17.816543155Z I0707 15:45:17.801176       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-07-07T15:44:11Z","message":"APIServerDeploymentDegraded: 2 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()","reason":"APIServerDeployment_UnavailablePod","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-07-07T15:40:18Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-07-07T15:45:15Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-07-06T06:47:26Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
...
2021-07-07T15:45:49.905654721Z I0707 15:45:49.898832       1 status_controller.go:211] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2021-07-07T15:45:49Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-07-07T15:40:18Z","message":"AuthenticatorCertKeyProgressing: All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-07-07T15:45:15Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-07-06T06:47:26Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

It looks like it transitioned back to "Available": True -  "Degraded": False, "Progressing:" False very soon after though, so querying "oc get co" may not have shown the state transition when you looked.

Given that the cluster operator logs show that it did transition - I'll close this one out - let me know if you still have concerns though!

Comment 7 Tom Dale 2021-07-09 13:17:04 UTC

Hey Andy,

Thanks for your help. Currently I have a cluster that shows authentication co is available and not degraded since 9h ago. I had a script trying to login every 10 minutes and 15 times over the past 9 hours I encountered this authentication failure. Is this expected behavior? 

oc get co authentication -o yaml 
...
spec: {}
status:
  conditions:
  - lastTransitionTime: "2021-07-08T19:03:42Z"
    message: All is well
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-07-08T16:14:02Z"
    message: 'AuthenticatorCertKeyProgressing: All is well'
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-07-09T03:51:53Z" <-- (( 9 Hours ago ))
    message: All is well
    reason: AsExpected
    status: "True"
    type: Available

...