Description of problem: On an IPI-installed OCP v4.5.0-rc.2 cluster on Azure, with 3 master and 3 worker nodes of type Standard_DS4_v2, we are seeing oc commands from more than one jump host are failing intermittently with "error: You must be logged in to the server (Unauthorized)". While running the SVT Reliability run (https://github.com/openshift/svt/tree/master/reliability), where we create projects, builds, deploy quickstart apps, delete projects etc for several days, there were several TLS Handshake Errors # grep -c "TLS handshake error" oc_logs_apiserver* oc_logs_apiserver-65b875f65c-87cv4_062520.txt:122543 oc_logs_apiserver-65b875f65c-pk8mc_062520.txt:0 oc_logs_apiserver-65b875f65c-wbx4w_062520.txt:0 # grep -c "TLS handshake error" kube* kube-apiserver-qe-reliability-45-24csh-master-0_062520.txt:7 kube-apiserver-qe-reliability-45-24csh-master-1_062520.txt:0 kube-apiserver-qe-reliability-45-24csh-master-2_062520.txt:5 # oc get co error: the server doesn't have a resource type "co" # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-rc.2 True False True 2d3h cloud-credential 4.5.0-rc.2 True False False 2d3h cluster-autoscaler 4.5.0-rc.2 True False False 2d3h config-operator 4.5.0-rc.2 True False False 2d3h console 4.5.0-rc.2 True False False 2d3h csi-snapshot-controller 4.5.0-rc.2 True False False 2d3h dns 4.5.0-rc.2 True False False 2d3h etcd 4.5.0-rc.2 True False False 2d3h image-registry 4.5.0-rc.2 True False False 2d3h ingress 4.5.0-rc.2 True False False 2d3h insights 4.5.0-rc.2 True False False 2d3h kube-apiserver 4.5.0-rc.2 True False False 2d3h kube-controller-manager 4.5.0-rc.2 True False False 2d3h kube-scheduler 4.5.0-rc.2 True False False 2d3h kube-storage-version-migrator 4.5.0-rc.2 True False False 2d3h machine-api 4.5.0-rc.2 True False False 2d3h machine-approver 4.5.0-rc.2 True False False 2d3h machine-config 4.5.0-rc.2 True False False 2d3h marketplace 4.5.0-rc.2 True False False 2d3h monitoring 4.5.0-rc.2 False False True 121m network 4.5.0-rc.2 True False False 2d3h node-tuning 4.5.0-rc.2 True False False 2d3h openshift-apiserver 4.5.0-rc.2 True False False 103m openshift-controller-manager 4.5.0-rc.2 True False False 28h openshift-samples 4.5.0-rc.2 True False False 2d3h operator-lifecycle-manager 4.5.0-rc.2 True False False 2d3h operator-lifecycle-manager-catalog 4.5.0-rc.2 True False False 2d3h operator-lifecycle-manager-packageserver 4.5.0-rc.2 True False False 28h service-ca 4.5.0-rc.2 True False False 2d3h storage 4.5.0-rc.2 True False False 2d3h # oc get nodes NAME STATUS ROLES AGE VERSION qe-reliability-45-24csh-master-0 Ready master 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-master-1 Ready master 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-master-2 Ready master 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-worker-centralus1-j9hvv Ready worker 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-worker-centralus2-52mnw Ready worker 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-worker-centralus3-st8f2 Ready worker 2d3h v1.18.3+91d0edd # oc get nodes error: You must be logged in to the server (Unauthorized) # oc get nodes NAME STATUS ROLES AGE VERSION qe-reliability-45-24csh-master-0 Ready master 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-master-1 Ready master 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-master-2 Ready master 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-worker-centralus1-j9hvv Ready worker 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-worker-centralus2-52mnw Ready worker 2d3h v1.18.3+91d0edd qe-reliability-45-24csh-worker-centralus3-st8f2 Ready worker 2d3h v1.18.3+91d0edd # oc login -u testuser-47 -p <user_passwd> --loglevel=10 . . . I0626 16:42:56.424874 2053 request.go:1068] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401} I0626 16:42:56.425295 2053 helpers.go:216] server response object: [{ "metadata": {}, "status": "Failure", "message": "the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)", "reason": "ServiceUnavailable", "details": { "group": "authorization.openshift.io", "kind": "subjectaccessreviews", "causes": [ { "reason": "UnexpectedServerResponse", "message": "Error trying to reach service: 'net/http: TLS handshake timeout'" } ] }, "code": 503 }] F0626 16:42:56.425326 2053 helpers.go:115] Error from server (ServiceUnavailable): the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io) Version-Release number of selected component (if applicable): # oc version Client Version: 4.5.0-rc.2 Server Version: 4.5.0-rc.2 Kubernetes Version: v1.18.3+91d0edd How reproducible: Always from two separate jump hosts running the same version of oc client Steps to Reproduce: 1. IPI Install of OCP v4.5.0-rc.2 2. Download and install the oc client from https://openshift-release-artifacts.svc.ci.openshift.org/4.5.0-rc.2/openshift-client-linux-4.5.0-rc.2.tar.gz 3. execute oc commands, oc login, oc get co, repeat them, back to back Actual results: oc command outputs intermittently fail with: error: You must be logged in to the server (Unauthorized) api-server logs show TLS Handshake errors Expected results: No errors. No TLS handshake errors in the api-server logs Additional info: Link to must-gather logs, openshift api-server, kube-api-server, and oc command outputs in next comment
Checked the must-gather in comment 1, found: $ grep "https://10..*:8443" namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-688cdf6c5-qmnfc/openshift-apiserver-operator/openshift-apiserver-operator/logs/current.log ... 2020-06-26T13:09:24.644499509Z I0626 13:09:24.644452 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"fce402b7-ae29-476d-a3cd-b201378f5181", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.64:8443/apis/apps.openshift.io/v1: Get https://10.129.0.64:8443/apis/apps.openshift.io/v1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)") ... ^ all grep results are from the pod with IP 10.129.0.64, which is on one master $ grep "podIP.*10.129.0.64" namespaces/openshift-apiserver/pods/apiserver-65b875f65c-*/*.yaml openshift-apiserver/pods/apiserver-65b875f65c-87cv4/apiserver-65b875f65c-87cv4.yaml: podIP: 10.129.0.64 $ vi namespaces/openshift-apiserver/pods/apiserver-65b875f65c-87cv4/openshift-apiserver/openshift-apiserver/logs/current.log ... 2020-06-26T16:53:29.52567754Z I0626 16:53:29.525656 1 log.go:172] http: TLS handshake error from 10.128.0.1:52796: EOF 2020-06-26T16:53:30.611304762Z I0626 16:53:30.611244 1 log.go:172] http: TLS handshake error from 10.128.0.1:52804: EOF ^ this "TLS handshake error from 10.128.0.1:52804: EOF" is summarized in bug 1825219#c19 .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.38 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2641
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days