Bug 1851549 - oc commands fail intermittently with TLS Handshake timeout Errors on Azure IPI installed cluster
Summary: oc commands fail intermittently with TLS Handshake timeout Errors on Azure IP...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.6.z
Assignee: mcambria@redhat.com
QA Contact: Arti Sood
URL:
Whiteboard:
Depends On: 1967994
Blocks: 1836052
TreeView+ depends on / blocked
 
Reported: 2020-06-26 22:20 UTC by Walid A.
Modified: 2023-12-15 18:19 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-14 07:16:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1121 0 None open Bug 1851549: Backport daemonset to drop icmp frag needed packets received from other nodes in the cluster to Rel 4.6 2021-06-11 13:54:13 UTC
Red Hat Knowledge Base (Solution) 5252831 0 None None None 2021-07-14 08:14:09 UTC
Red Hat Product Errata RHBA-2021:2641 0 None None None 2021-07-14 07:16:53 UTC

Internal Links: 1979312

Description Walid A. 2020-06-26 22:20:40 UTC
Description of problem:
On an IPI-installed OCP v4.5.0-rc.2 cluster on Azure, with 3 master and 3 worker nodes of type Standard_DS4_v2, we are seeing oc commands from more than one jump host are failing intermittently with "error: You must be logged in to the server (Unauthorized)".  While running the SVT Reliability run (https://github.com/openshift/svt/tree/master/reliability), where we create projects, builds, deploy quickstart apps, delete projects etc for several days, there were several TLS Handshake Errors 

# grep -c "TLS handshake error" oc_logs_apiserver*
oc_logs_apiserver-65b875f65c-87cv4_062520.txt:122543
oc_logs_apiserver-65b875f65c-pk8mc_062520.txt:0
oc_logs_apiserver-65b875f65c-wbx4w_062520.txt:0

# grep -c "TLS handshake error" kube*
kube-apiserver-qe-reliability-45-24csh-master-0_062520.txt:7
kube-apiserver-qe-reliability-45-24csh-master-1_062520.txt:0
kube-apiserver-qe-reliability-45-24csh-master-2_062520.txt:5

# oc get co
error: the server doesn't have a resource type "co"
# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-rc.2   True        False         True       2d3h
cloud-credential                           4.5.0-rc.2   True        False         False      2d3h
cluster-autoscaler                         4.5.0-rc.2   True        False         False      2d3h
config-operator                            4.5.0-rc.2   True        False         False      2d3h
console                                    4.5.0-rc.2   True        False         False      2d3h
csi-snapshot-controller                    4.5.0-rc.2   True        False         False      2d3h
dns                                        4.5.0-rc.2   True        False         False      2d3h
etcd                                       4.5.0-rc.2   True        False         False      2d3h
image-registry                             4.5.0-rc.2   True        False         False      2d3h
ingress                                    4.5.0-rc.2   True        False         False      2d3h
insights                                   4.5.0-rc.2   True        False         False      2d3h
kube-apiserver                             4.5.0-rc.2   True        False         False      2d3h
kube-controller-manager                    4.5.0-rc.2   True        False         False      2d3h
kube-scheduler                             4.5.0-rc.2   True        False         False      2d3h
kube-storage-version-migrator              4.5.0-rc.2   True        False         False      2d3h
machine-api                                4.5.0-rc.2   True        False         False      2d3h
machine-approver                           4.5.0-rc.2   True        False         False      2d3h
machine-config                             4.5.0-rc.2   True        False         False      2d3h
marketplace                                4.5.0-rc.2   True        False         False      2d3h
monitoring                                 4.5.0-rc.2   False       False         True       121m
network                                    4.5.0-rc.2   True        False         False      2d3h
node-tuning                                4.5.0-rc.2   True        False         False      2d3h
openshift-apiserver                        4.5.0-rc.2   True        False         False      103m
openshift-controller-manager               4.5.0-rc.2   True        False         False      28h
openshift-samples                          4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager                 4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager-catalog         4.5.0-rc.2   True        False         False      2d3h
operator-lifecycle-manager-packageserver   4.5.0-rc.2   True        False         False      28h
service-ca                                 4.5.0-rc.2   True        False         False      2d3h
storage                                    4.5.0-rc.2   True        False         False      2d3h

# oc get nodes
NAME                                              STATUS   ROLES    AGE    VERSION
qe-reliability-45-24csh-master-0                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-1                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-2                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus1-j9hvv   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus2-52mnw   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus3-st8f2   Ready    worker   2d3h   v1.18.3+91d0edd

# oc get nodes
error: You must be logged in to the server (Unauthorized)

# oc get nodes
NAME                                              STATUS   ROLES    AGE    VERSION
qe-reliability-45-24csh-master-0                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-1                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-master-2                  Ready    master   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus1-j9hvv   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus2-52mnw   Ready    worker   2d3h   v1.18.3+91d0edd
qe-reliability-45-24csh-worker-centralus3-st8f2   Ready    worker   2d3h   v1.18.3+91d0edd


# oc login -u testuser-47 -p <user_passwd> --loglevel=10
.
.
.
I0626 16:42:56.424874    2053 request.go:1068] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
I0626 16:42:56.425295    2053 helpers.go:216] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)",
  "reason": "ServiceUnavailable",
  "details": {
    "group": "authorization.openshift.io",
    "kind": "subjectaccessreviews",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "Error trying to reach service: 'net/http: TLS handshake timeout'"
      }
    ]
  },
  "code": 503
}]
F0626 16:42:56.425326    2053 helpers.go:115] Error from server (ServiceUnavailable): the server is currently unable to handle the request (post subjectaccessreviews.authorization.openshift.io)



Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.5.0-rc.2
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd

How reproducible:
Always from two separate jump hosts running the same version of oc client

Steps to Reproduce:
1.  IPI Install of OCP v4.5.0-rc.2
2.  Download and install the oc client from https://openshift-release-artifacts.svc.ci.openshift.org/4.5.0-rc.2/openshift-client-linux-4.5.0-rc.2.tar.gz
3.  execute oc commands, oc login, oc get co, repeat them, back to back

Actual results:
oc command outputs intermittently fail with:
error: You must be logged in to the server (Unauthorized)
api-server logs show TLS Handshake errors

Expected results:
No errors.  No TLS handshake errors in the api-server logs

Additional info:
Link to must-gather logs, openshift api-server, kube-api-server, and oc command outputs in next comment

Comment 2 Xingxing Xia 2020-06-28 04:00:48 UTC
Checked the must-gather in comment 1, found:
$ grep "https://10..*:8443" namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-688cdf6c5-qmnfc/openshift-apiserver-operator/openshift-apiserver-operator/logs/current.log
...
2020-06-26T13:09:24.644499509Z I0626 13:09:24.644452       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator", UID:"fce402b7-ae29-476d-a3cd-b201378f5181", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.64:8443/apis/apps.openshift.io/v1: Get https://10.129.0.64:8443/apis/apps.openshift.io/v1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)")
...

^ all grep results are from the pod with IP 10.129.0.64, which is on one master

$ grep "podIP.*10.129.0.64" namespaces/openshift-apiserver/pods/apiserver-65b875f65c-*/*.yaml
openshift-apiserver/pods/apiserver-65b875f65c-87cv4/apiserver-65b875f65c-87cv4.yaml:  podIP: 10.129.0.64
$ vi namespaces/openshift-apiserver/pods/apiserver-65b875f65c-87cv4/openshift-apiserver/openshift-apiserver/logs/current.log
...
2020-06-26T16:53:29.52567754Z I0626 16:53:29.525656       1 log.go:172] http: TLS handshake error from 10.128.0.1:52796: EOF
2020-06-26T16:53:30.611304762Z I0626 16:53:30.611244       1 log.go:172] http: TLS handshake error from 10.128.0.1:52804: EOF

^ this "TLS handshake error from 10.128.0.1:52804: EOF" is summarized in bug 1825219#c19 .

Comment 27 errata-xmlrpc 2021-07-14 07:16:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2641

Comment 29 Red Hat Bugzilla 2023-09-15 00:33:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.