Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1851549

Summary: oc commands fail intermittently with TLS Handshake timeout Errors on Azure IPI installed cluster
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: NetworkingAssignee: mcambria <mcambria>
Networking sub component: openshift-sdn QA Contact: Arti Sood <asood>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconstan, anusaxen, aos-bugs, bbennett, kuiwang, mcambria, mfojtik, mifiedle, oarribas, palonsor, rsandu, talessio, vuberti, xxia, zzhao
Version: 4.5Keywords: FastFix
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-14 07:16:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1967994    
Bug Blocks: 1836052    

Description Walid A. 2020-06-26 22:20:40 UTC
Description of problem:
On an IPI-installed OCP v4.5.0-rc.2 cluster on Azure, with 3 master and 3 worker nodes of type Standard_DS4_v2, we are seeing oc commands from more than one jump host are failing intermittently with "error: You must be logged in to the server (Unauthorized)".  While running the SVT Reliability run (https://github.com/openshift/svt/tree/master/reliability), where we create projects, builds, deploy quickstart apps, delete projects etc for several days, there were several TLS Handshake Errors 

# grep -c "TLS handshake error" oc_logs_apiserver*
oc_logs_apiserver-65b875f65c-87cv4_062520.txt:122543
oc_logs_apiserver-65b875f65c-pk8mc_062520.txt:0
oc_logs_apiserver-65b875f65c-wbx4w_062520.txt:0

# grep -c "TLS handshake error" kube*
kube-apiserver-qe-reliability-45-24csh-master-0_062520.txt:7
kube-apiserver-qe-reliability-45-24csh-master-1_062520.txt:0
kube-apiserver-qe-reliability-45-24csh-master-2_062520.txt:5

# oc get co
error: the server doesn't have a resource type "co"
# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-rc.2   True        False         True       2d3h
cloud-credential                           4.5.0-rc.2   True        False         False      2d3h
cluster-autoscaler                         4.5.0-rc.2   True        False         False      2d3h
config-operator                            4.5.0-rc.2   True        False         False      2d3h
console                                    4.5.0-rc.2   True        False         False      2d3h
csi-snapshot-controller                    4.5.0-rc.2   True        False         False      2d3h
dns                                        4.5.0-rc.2   True        False         False      2d3h
etcd                                       4.5.0-rc.2   True        False         False      2d3h
image-registry                             4.5.0-rc.2   True        False         False      2d3h
ingress                                    4.5.0-rc.2   True        False         False      2d3h
insights                                   4.5.0-rc.2   True        False         False      2d3h
kube-apiserver                             4.5.0-rc.2   True        False         False      2d3h
kube-controller-manager                    4.5.0-rc.2   True        False         False      2d3h
kube-scheduler                             4.5.0-rc.2   True        False         False      2d3h
kube-storage-version-migrator              4.5.0-rc.2   True        False         False      2d3h
machine-api                                4.5.0-rc.2   True        False         False      2d3h
machine-approver                           4.5.0-rc.2   True        False         False      2d3h
machine-config                             4.5.0-rc.2   True        False         False      2d3h
marketplace                                4.5.0-rc.2   True        False         False      2d3h
monitoring                                 4.5.0-rc.2   False       False         True       121m
network                                    4.5.0-rc.2   True        False         False      2d3h
node-tuning                                4.5.0-rc.2   True        False         False      2d3h
openshift-apiserver                        4.5.0-rc.2   True        False         False      103m
openshift-controller-manager               4.5.0-rc.2   True        False         False      28h
openshift-samples                          4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager                 4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager-catalog         4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager-packageserver   4.5.0-rc.2   True        False         False      28h
service-ca                                 4.5.0-rc.2   True        False         False      2d3h
storage                                    4.5.0-rc.2   True        False         False      2d3h

# oc get nodes
NAME                                              STATUS   ROLES    AGE    VERSION
qe-reliability-45-24csh-master-0                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-1                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-2                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus1-j9hvv   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus2-52mnw   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus3-st8f2   Ready    worker   2d3h   v1.18.3+91d0edd

# oc get nodes
error: You must be logged in to the server (Unauthorized)

# oc get nodes
NAME                                              STATUS   ROLES    AGE    VERSION
qe-reliability-45-24csh-master-0                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-1                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-2                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus1-j9hvv   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus2-52mnw   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus3-st8f2   Ready    worker   2d3h   v1.18.3+91d0edd


# oc login -u testuser-47 -p <user_passwd> --loglevel=10
.
.
.
I0626 16:42:56.424874    2053 request.go:1068] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
I0626 16:42:56.425295    2053 helpers.go:216] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)",
  "reason": "ServiceUnavailable",
  "details": {
    "group": "authorization.openshift.io",
    "kind": "subjectaccessreviews",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "Error trying to reach service: 'net/http: TLS handshake timeout'"
      }
    ]
  },
  "code": 503
}]
F0626 16:42:56.425326    2053 helpers.go:115] Error from server (ServiceUnavailable): the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)



Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.5.0-rc.2
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd

How reproducible:
Always from two separate jump hosts running the same version of oc client

Steps to Reproduce:
1.  IPI Install of OCP v4.5.0-rc.2
2.  Download and install the oc client from https://openshift-release-artifacts.svc.ci.openshift.org/4.5.0-rc.2/openshift-client-linux-4.5.0-rc.2.tar.gz
3.  execute oc commands, oc login, oc get co, repeat them, back to back

Actual results:
oc command outputs intermittently fail with:
error: You must be logged in to the server (Unauthorized)
api-server logs show TLS Handshake errors

Expected results:
No errors.  No TLS handshake errors in the api-server logs

Additional info:
Link to must-gather logs, openshift api-server, kube-api-server, and oc command outputs in next comment

Comment 2 Xingxing Xia 2020-06-28 04:00:48 UTC
Checked the must-gather in comment 1, found:
$ grep "https://10..*:8443" namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-688cdf6c5-qmnfc/openshift-apiserver-operator/openshift-apiserver-operator/logs/current.log
...
2020-06-26T13:09:24.644499509Z I0626 13:09:24.644452       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"fce402b7-ae29-476d-a3cd-b201378f5181", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.64:8443/apis/apps.openshift.io/v1: Get https://10.129.0.64:8443/apis/apps.openshift.io/v1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)")
...

^ all grep results are from the pod with IP 10.129.0.64, which is on one master

$ grep "podIP.*10.129.0.64" namespaces/openshift-apiserver/pods/apiserver-65b875f65c-*/*.yaml
openshift-apiserver/pods/apiserver-65b875f65c-87cv4/apiserver-65b875f65c-87cv4.yaml:  podIP: 10.129.0.64
$ vi namespaces/openshift-apiserver/pods/apiserver-65b875f65c-87cv4/openshift-apiserver/openshift-apiserver/logs/current.log
...
2020-06-26T16:53:29.52567754Z I0626 16:53:29.525656       1 log.go:172] http: TLS handshake error from 10.128.0.1:52796: EOF
2020-06-26T16:53:30.611304762Z I0626 16:53:30.611244       1 log.go:172] http: TLS handshake error from 10.128.0.1:52804: EOF

^ this "TLS handshake error from 10.128.0.1:52804: EOF" is summarized in bug 1825219#c19 .

Comment 27 errata-xmlrpc 2021-07-14 07:16:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2641

Comment 29 Red Hat Bugzilla 2023-09-15 00:33:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days