Bug 1851549 - OCP 4.5: oc commands fail intermittently with TLS Handshake timeout Errors on Azure IPI installed cluster
Summary: OCP 4.5: oc commands fail intermittently with TLS Handshake timeout Errors on...
Keywords:
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.5.z
Assignee: mcambria@redhat.com
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1825219
Blocks: 1836052
TreeView+ depends on / blocked
 
Reported: 2020-06-26 22:20 UTC by Walid A.
Modified: 2020-08-25 01:09 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Walid A. 2020-06-26 22:20:40 UTC
Description of problem:
On an IPI-installed OCP v4.5.0-rc.2 cluster on Azure, with 3 master and 3 worker nodes of type Standard_DS4_v2, we are seeing oc commands from more than one jump host are failing intermittently with "error: You must be logged in to the server (Unauthorized)".  While running the SVT Reliability run (https://github.com/openshift/svt/tree/master/reliability), where we create projects, builds, deploy quickstart apps, delete projects etc for several days, there were several TLS Handshake Errors 

# grep -c "TLS handshake error" oc_logs_apiserver*
oc_logs_apiserver-65b875f65c-87cv4_062520.txt:122543
oc_logs_apiserver-65b875f65c-pk8mc_062520.txt:0
oc_logs_apiserver-65b875f65c-wbx4w_062520.txt:0

# grep -c "TLS handshake error" kube*
kube-apiserver-qe-reliability-45-24csh-master-0_062520.txt:7
kube-apiserver-qe-reliability-45-24csh-master-1_062520.txt:0
kube-apiserver-qe-reliability-45-24csh-master-2_062520.txt:5

# oc get co
error: the server doesn't have a resource type "co"
# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-rc.2   True        False         True       2d3h
cloud-credential                           4.5.0-rc.2   True        False         False      2d3h
cluster-autoscaler                         4.5.0-rc.2   True        False         False      2d3h
config-operator                            4.5.0-rc.2   True        False         False      2d3h
console                                    4.5.0-rc.2   True        False         False      2d3h
csi-snapshot-controller                    4.5.0-rc.2   True        False         False      2d3h
dns                                        4.5.0-rc.2   True        False         False      2d3h
etcd                                       4.5.0-rc.2   True        False         False      2d3h
image-registry                             4.5.0-rc.2   True        False         False      2d3h
ingress                                    4.5.0-rc.2   True        False         False      2d3h
insights                                   4.5.0-rc.2   True        False         False      2d3h
kube-apiserver                             4.5.0-rc.2   True        False         False      2d3h
kube-controller-manager                    4.5.0-rc.2   True        False         False      2d3h
kube-scheduler                             4.5.0-rc.2   True        False         False      2d3h
kube-storage-version-migrator              4.5.0-rc.2   True        False         False      2d3h
machine-api                                4.5.0-rc.2   True        False         False      2d3h
machine-approver                           4.5.0-rc.2   True        False         False      2d3h
machine-config                             4.5.0-rc.2   True        False         False      2d3h
marketplace                                4.5.0-rc.2   True        False         False      2d3h
monitoring                                 4.5.0-rc.2   False       False         True       121m
network                                    4.5.0-rc.2   True        False         False      2d3h
node-tuning                                4.5.0-rc.2   True        False         False      2d3h
openshift-apiserver                        4.5.0-rc.2   True        False         False      103m
openshift-controller-manager               4.5.0-rc.2   True        False         False      28h
openshift-samples                          4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager                 4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager-catalog         4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager-packageserver   4.5.0-rc.2   True        False         False      28h
service-ca                                 4.5.0-rc.2   True        False         False      2d3h
storage                                    4.5.0-rc.2   True        False         False      2d3h

# oc get nodes
NAME                                              STATUS   ROLES    AGE    VERSION
qe-reliability-45-24csh-master-0                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-1                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-2                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus1-j9hvv   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus2-52mnw   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus3-st8f2   Ready    worker   2d3h   v1.18.3+91d0edd

# oc get nodes
error: You must be logged in to the server (Unauthorized)

# oc get nodes
NAME                                              STATUS   ROLES    AGE    VERSION
qe-reliability-45-24csh-master-0                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-1                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-2                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus1-j9hvv   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus2-52mnw   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus3-st8f2   Ready    worker   2d3h   v1.18.3+91d0edd


# oc login -u testuser-47 -p <user_passwd> --loglevel=10
.
.
.
I0626 16:42:56.424874    2053 request.go:1068] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
I0626 16:42:56.425295    2053 helpers.go:216] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)",
  "reason": "ServiceUnavailable",
  "details": {
    "group": "authorization.openshift.io",
    "kind": "subjectaccessreviews",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "Error trying to reach service: 'net/http: TLS handshake timeout'"
      }
    ]
  },
  "code": 503
}]
F0626 16:42:56.425326    2053 helpers.go:115] Error from server (ServiceUnavailable): the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)



Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.5.0-rc.2
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd

How reproducible:
Always from two separate jump hosts running the same version of oc client

Steps to Reproduce:
1.  IPI Install of OCP v4.5.0-rc.2
2.  Download and install the oc client from https://openshift-release-artifacts.svc.ci.openshift.org/4.5.0-rc.2/openshift-client-linux-4.5.0-rc.2.tar.gz
3.  execute oc commands, oc login, oc get co, repeat them, back to back

Actual results:
oc command outputs intermittently fail with:
error: You must be logged in to the server (Unauthorized)
api-server logs show TLS Handshake errors

Expected results:
No errors.  No TLS handshake errors in the api-server logs

Additional info:
Link to must-gather logs, openshift api-server, kube-api-server, and oc command outputs in next comment

Comment 2 Xingxing Xia 2020-06-28 04:00:48 UTC
Checked the must-gather in comment 1, found:
$ grep "https://10..*:8443" namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-688cdf6c5-qmnfc/openshift-apiserver-operator/openshift-apiserver-operator/logs/current.log
...
2020-06-26T13:09:24.644499509Z I0626 13:09:24.644452       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"fce402b7-ae29-476d-a3cd-b201378f5181", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.64:8443/apis/apps.openshift.io/v1: Get https://10.129.0.64:8443/apis/apps.openshift.io/v1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)")
...

^ all grep results are from the pod with IP 10.129.0.64, which is on one master

$ grep "podIP.*10.129.0.64" namespaces/openshift-apiserver/pods/apiserver-65b875f65c-*/*.yaml
openshift-apiserver/pods/apiserver-65b875f65c-87cv4/apiserver-65b875f65c-87cv4.yaml:  podIP: 10.129.0.64
$ vi namespaces/openshift-apiserver/pods/apiserver-65b875f65c-87cv4/openshift-apiserver/openshift-apiserver/logs/current.log
...
2020-06-26T16:53:29.52567754Z I0626 16:53:29.525656       1 log.go:172] http: TLS handshake error from 10.128.0.1:52796: EOF
2020-06-26T16:53:30.611304762Z I0626 16:53:30.611244       1 log.go:172] http: TLS handshake error from 10.128.0.1:52804: EOF

^ this "TLS handshake error from 10.128.0.1:52804: EOF" is summarized in bug 1825219#c19 .


Note You need to log in before you can comment on or make changes to this bug.