1851549 – oc commands fail intermittently with TLS Handshake timeout Errors on Azure IPI installed cluster

Bug 1851549 - oc commands fail intermittently with TLS Handshake timeout Errors on Azure IPI installed cluster

Summary: oc commands fail intermittently with TLS Handshake timeout Errors on Azure IP...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	mcambria@redhat.com
QA Contact:	Arti Sood
Docs Contact:
URL:
Whiteboard:
Depends On:	1967994
Blocks:	1836052
TreeView+	depends on / blocked

Reported:	2020-06-26 22:20 UTC by Walid A.
Modified:	2024-03-25 16:06 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-14 07:16:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1121	None	open	Bug 1851549: Backport daemonset to drop icmp frag needed packets received from other nodes in the cluster to Rel 4.6	2021-06-11 13:54:13 UTC
Red Hat Knowledge Base (Solution)	5252831	None	None	None	2021-07-14 08:14:09 UTC
Red Hat Product Errata	RHBA-2021:2641	None	None	None	2021-07-14 07:16:53 UTC

Internal Links: 1979312

Description Walid A. 2020-06-26 22:20:40 UTC

Description of problem:
On an IPI-installed OCP v4.5.0-rc.2 cluster on Azure, with 3 master and 3 worker nodes of type Standard_DS4_v2, we are seeing oc commands from more than one jump host are failing intermittently with "error: You must be logged in to the server (Unauthorized)".  While running the SVT Reliability run (https://github.com/openshift/svt/tree/master/reliability), where we create projects, builds, deploy quickstart apps, delete projects etc for several days, there were several TLS Handshake Errors 

# grep -c "TLS handshake error" oc_logs_apiserver*
oc_logs_apiserver-65b875f65c-87cv4_062520.txt:122543
oc_logs_apiserver-65b875f65c-pk8mc_062520.txt:0
oc_logs_apiserver-65b875f65c-wbx4w_062520.txt:0

# grep -c "TLS handshake error" kube*
kube-apiserver-qe-reliability-45-24csh-master-0_062520.txt:7
kube-apiserver-qe-reliability-45-24csh-master-1_062520.txt:0
kube-apiserver-qe-reliability-45-24csh-master-2_062520.txt:5

# oc get co
error: the server doesn't have a resource type "co"
# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-rc.2   True        False         True       2d3h
cloud-credential                           4.5.0-rc.2   True        False         False      2d3h
cluster-autoscaler                         4.5.0-rc.2   True        False         False      2d3h
config-operator                            4.5.0-rc.2   True        False         False      2d3h
console                                    4.5.0-rc.2   True        False         False      2d3h
csi-snapshot-controller                    4.5.0-rc.2   True        False         False      2d3h
dns                                        4.5.0-rc.2   True        False         False      2d3h
etcd                                       4.5.0-rc.2   True        False         False      2d3h
image-registry                             4.5.0-rc.2   True        False         False      2d3h
ingress                                    4.5.0-rc.2   True        False         False      2d3h
insights                                   4.5.0-rc.2   True        False         False      2d3h
kube-apiserver                             4.5.0-rc.2   True        False         False      2d3h
kube-controller-manager                    4.5.0-rc.2   True        False         False      2d3h
kube-scheduler                             4.5.0-rc.2   True        False         False      2d3h
kube-storage-version-migrator              4.5.0-rc.2   True        False         False      2d3h
machine-api                                4.5.0-rc.2   True        False         False      2d3h
machine-approver                           4.5.0-rc.2   True        False         False      2d3h
machine-config                             4.5.0-rc.2   True        False         False      2d3h
marketplace                                4.5.0-rc.2   True        False         False      2d3h
monitoring                                 4.5.0-rc.2   False       False         True       121m
network                                    4.5.0-rc.2   True        False         False      2d3h
node-tuning                                4.5.0-rc.2   True        False         False      2d3h
openshift-apiserver                        4.5.0-rc.2   True        False         False      103m
openshift-controller-manager               4.5.0-rc.2   True        False         False      28h
openshift-samples                          4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager                 4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager-catalog         4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager-packageserver   4.5.0-rc.2   True        False         False      28h
service-ca                                 4.5.0-rc.2   True        False         False      2d3h
storage                                    4.5.0-rc.2   True        False         False      2d3h

# oc get nodes
NAME                                              STATUS   ROLES    AGE    VERSION
qe-reliability-45-24csh-master-0                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-1                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-2                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus1-j9hvv   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus2-52mnw   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus3-st8f2   Ready    worker   2d3h   v1.18.3+91d0edd

# oc get nodes
error: You must be logged in to the server (Unauthorized)

# oc get nodes
NAME                                              STATUS   ROLES    AGE    VERSION
qe-reliability-45-24csh-master-0                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-1                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-2                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus1-j9hvv   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus2-52mnw   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus3-st8f2   Ready    worker   2d3h   v1.18.3+91d0edd


# oc login -u testuser-47 -p <user_passwd> --loglevel=10
.
.
.
I0626 16:42:56.424874    2053 request.go:1068] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
I0626 16:42:56.425295    2053 helpers.go:216] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)",
  "reason": "ServiceUnavailable",
  "details": {
    "group": "authorization.openshift.io",
    "kind": "subjectaccessreviews",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "Error trying to reach service: 'net/http: TLS handshake timeout'"
      }
    ]
  },
  "code": 503
}]
F0626 16:42:56.425326    2053 helpers.go:115] Error from server (ServiceUnavailable): the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)



Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.5.0-rc.2
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd

How reproducible:
Always from two separate jump hosts running the same version of oc client

Steps to Reproduce:
1.  IPI Install of OCP v4.5.0-rc.2
2.  Download and install the oc client from https://openshift-release-artifacts.svc.ci.openshift.org/4.5.0-rc.2/openshift-client-linux-4.5.0-rc.2.tar.gz
3.  execute oc commands, oc login, oc get co, repeat them, back to back

Actual results:
oc command outputs intermittently fail with:
error: You must be logged in to the server (Unauthorized)
api-server logs show TLS Handshake errors

Expected results:
No errors.  No TLS handshake errors in the api-server logs

Additional info:
Link to must-gather logs, openshift api-server, kube-api-server, and oc command outputs in next comment

Comment 2 Xingxing Xia 2020-06-28 04:00:48 UTC

Checked the must-gather in comment 1, found:
$ grep "https://10..*:8443" namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-688cdf6c5-qmnfc/openshift-apiserver-operator/openshift-apiserver-operator/logs/current.log
...
2020-06-26T13:09:24.644499509Z I0626 13:09:24.644452       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"fce402b7-ae29-476d-a3cd-b201378f5181", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.64:8443/apis/apps.openshift.io/v1: Get https://10.129.0.64:8443/apis/apps.openshift.io/v1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)")
...

^ all grep results are from the pod with IP 10.129.0.64, which is on one master

$ grep "podIP.*10.129.0.64" namespaces/openshift-apiserver/pods/apiserver-65b875f65c-*/*.yaml
openshift-apiserver/pods/apiserver-65b875f65c-87cv4/apiserver-65b875f65c-87cv4.yaml:  podIP: 10.129.0.64
$ vi namespaces/openshift-apiserver/pods/apiserver-65b875f65c-87cv4/openshift-apiserver/openshift-apiserver/logs/current.log
...
2020-06-26T16:53:29.52567754Z I0626 16:53:29.525656       1 log.go:172] http: TLS handshake error from 10.128.0.1:52796: EOF
2020-06-26T16:53:30.611304762Z I0626 16:53:30.611244       1 log.go:172] http: TLS handshake error from 10.128.0.1:52804: EOF

^ this "TLS handshake error from 10.128.0.1:52804: EOF" is summarized in bug 1825219#c19 .

Comment 27 errata-xmlrpc 2021-07-14 07:16:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2641

Comment 29 Red Hat Bugzilla 2023-09-15 00:33:16 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.