Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1435261

Summary:	oc client TLS timeout
Product:	OpenShift Container Platform	Reporter:	Wolfram Richter <wrichter>
Component:	oc	Assignee:	Fabiano Franz <ffranz>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	high
Version:	3.4.1	CC:	aos-bugs, asherkho, ccoleman, deads, dfroehli, jliggitt, jmorales, jokerman, mmccomas, odockal, smunilla, spousty, vlaad, wrichter
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Mac OS
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: a Golang issue (happens up to 1.7.4) adds an overhead of around 4 seconds to TLS handshake on macOS. Consequence: 'oc' calls time out intermittently on macOS. Fix: backported the existing fix to 1.7.5 and upgraded Golang we use to build 'oc' to that version. Result: TLS handshake time reduced by about 4 seconds on macOS.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-10 05:18:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Fabiano Franz 2017-03-24 21:06:16 UTC

Can you specify if any of the *working* clients/laptops you mentioned are also running Mac OS? This *may* (although not yet investigated) be related to some known Go compiler issues with VPN and DNS resolution[1] on Macs.

[1] https://github.com/kubernetes/kubernetes/issues/23130

Comment 2 Wolfram Richter 2017-03-26 19:43:18 UTC

I have a colleague

Comment 3 Wolfram Richter 2017-03-26 19:44:48 UTC

I have a colleague who is using MacOS and is able to access the OCP cluster while I can't - he is on PTO this week. I need to check which OC version he is using.

Comment 5 Anton Sherkhonov 2017-03-27 14:37:54 UTC

Hi,

I've started having exact same problem.
Cluster is 3.4 (with self-signed cert). It's multi-master, the problem persists when connecting to any master, restart of atomic-openshift-master-api doesn't help.

Client:
$ oc version
oc v3.4.1.5
kubernetes v1.4.0+776c994
features: Basic-Auth

(also remote VPN)

Connecting:
$ oc login https://10.16.155.13:8443 --loglevel=10
I0327 10:06:26.240528    5617 loader.go:354] Config loaded from file /Users/asherkho-osx/.kube/config
I0327 10:06:26.240980    5617 round_trippers.go:299] curl -k -v -XHEAD  https://10.16.155.13:8443/
I0327 10:06:36.308045    5617 round_trippers.go:318] HEAD https://10.16.155.13:8443/  in 10067 milliseconds
I0327 10:06:36.308083    5617 round_trippers.go:324] Response Headers:
F0327 10:06:36.308140    5617 helpers.go:115] error: net/http: TLS handshake timeout

Everything was working fine ~1 week ago. 

I *didn't* update the client or the cluster. I can login to the cluster from fedora box successfully.

Possible changes I've done: reboot of laptop.. I don't see anything very suspicious in the log of updates that were installed on the Mac.

Comment 6 Wolfram Richter 2017-03-27 14:58:51 UTC

This is odd - after having verified late yesterday evening that I still cannot login (when I asked my co-workers to test), I can log in again today. To my knowledge, I haven't changed anything on my laptop and certainly not on the cluster.

Wolframs-MBP-6:~ wolfram$ oc login https://openshift.hailstorm3.coe.muc.redhat.com:8443 --loglevel=10
I0327 16:56:39.803710    6175 loader.go:354] Config loaded from file /Users/wolfram/.kube/config
I0327 16:56:39.804049    6175 round_trippers.go:299] curl -k -v -XHEAD  https://openshift.hailstorm3.coe.muc.redhat.com:8443/
I0327 16:56:48.388520    6175 round_trippers.go:318] HEAD https://openshift.hailstorm3.coe.muc.redhat.com:8443/  in 8584 milliseconds
I0327 16:56:48.388547    6175 round_trippers.go:324] Response Headers:
I0327 16:56:48.388723    6175 round_trippers.go:299] curl -k -v -XGET  -H "X-Csrf-Token: 1" https://openshift.hailstorm3.coe.muc.redhat.com:8443/oauth/authorize?response_type=token&client_id=openshift-challenging-client
I0327 16:56:48.681777    6175 round_trippers.go:318] GET https://openshift.hailstorm3.coe.muc.redhat.com:8443/oauth/authorize?response_type=token&client_id=openshift-challenging-client 401 Unauthorized in 293 milliseconds
I0327 16:56:48.681804    6175 round_trippers.go:324] Response Headers:
I0327 16:56:48.681808    6175 round_trippers.go:327]     Www-Authenticate: Basic realm="openshift"
I0327 16:56:48.681811    6175 round_trippers.go:327]     Date: Mon, 27 Mar 2017 14:56:48 GMT
I0327 16:56:48.681814    6175 round_trippers.go:327]     Content-Length: 0
I0327 16:56:48.681816    6175 round_trippers.go:327]     Content-Type: text/plain; charset=utf-8
Authentication required for https://openshift.hailstorm3.coe.muc.redhat.com:8443 (openshift)
Username:^C
Wolframs-MBP-6:~ wolfram$

Comment 8 Steven Citron-Pousty 2017-04-06 21:42:41 UTC

This just hit a student in one of my workshops on OSX and they even tried the 3.6 client and got the same error. 

Can we please raise this error to higher than medium as there is not even a workaround at this time.

Comment 9 Jorge Morales 2017-04-11 15:32:40 UTC

We've had another user with error during one of our trainings. Here's the info he could compile. If there needs to be more info compiled, please add here all that's required to diagnose.

https://gist.github.com/ghjm/14ce9be556405e09e114a883cbb8540d

Comment 10 Fabiano Franz 2017-04-12 20:27:02 UTC

Wolfram, can you provide the output of 'sw_vers' from your machine? I just had a script run 'oc login' against your server for 1100 times and couldn't reproduce the issue not even once. My machine is the following (note it's 2 point releases below the one provided by Jorge, I'll make mine upgrade later today and try again after that).

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.12.2
BuildVersion:	16C67

Comment 11 Wolfram Richter 2017-04-12 20:29:29 UTC

My Laptop: 

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.12.3
BuildVersion:	16D32

Comment 12 Fabiano Franz 2017-04-12 20:33:49 UTC

Thanks! Are you able to reproduce it ATM against your server?

Comment 13 Anton Sherkhonov 2017-04-12 20:37:52 UTC

FYI It's reproducible right now on my laptop.

I don't know what it depends upon - it is very intermittent (was working fine yesterday).

$ sw_vers
ProductName:   	Mac OS X
ProductVersion:	10.12.4
BuildVersion:  	16E195

Comment 14 Wolfram Richter 2017-04-12 20:38:33 UTC

even though I had already successfully logged in today, I cannot at this moment:

ovpn-116-18:~ wolfram$ oc login https://openshift.hailstorm1.coe.muc.redhat.com:8443 -u admin -p XXXXX --loglevel=10
I0412 22:36:14.954080   25516 loader.go:354] Config loaded from file /Users/wolfram/.kube/config
I0412 22:36:14.955285   25516 round_trippers.go:299] curl -k -v -XHEAD  https://openshift.hailstorm1.coe.muc.redhat.com:8443/
I0412 22:36:25.046263   25516 round_trippers.go:318] HEAD https://openshift.hailstorm1.coe.muc.redhat.com:8443/  in 10090 milliseconds
I0412 22:36:25.046321   25516 round_trippers.go:324] Response Headers:
F0412 22:36:25.047435   25516 helpers.go:115] error: net/http: TLS handshake timeout
ovpn-116-18:~ wolfram$

Comment 15 Fabiano Franz 2017-04-13 15:21:09 UTC

Could you please provide the contents of /Users/wolfram/.kube/config? Feel free to mask password and tokens, and send it directly to me if you prefer. Thanks!

Comment 16 Wolfram Richter 2017-04-13 17:11:00 UTC

Sent the file via email

Comment 17 Wolfram Richter 2017-04-13 20:29:47 UTC

I have the nagging suspicion that the issue is related to system load on my laptop…

Just a couple of minutes ago I had time machine preparing a backup, kaspersky antivirus consuming 150% CPU, mail another 50%, a dangling oc process 35%, some ansible scripts running, etc… and I consistently could NOT login. I killed mail, stopped the time machine backup, killed the oc process, etc…  and was able to login again. Restarted mail and during the initial load spike -> no access...

Cheers,
Wolfram

Comment 19 Fabiano Franz 2017-04-18 02:21:10 UTC

Clayton, PTAL.

This is related to an issue[1] and a couple related others[2][3] in Golang 1.7.4+ (the one we use to release) that affect users of macOS Sierra specifically. 

Basically the issue adds an overhead of ~3.5 seconds to the TLS handshake (more depending on the size of cert/key and networking conditions, like under VPN or proxy, we've seen it add around 6 seconds to the handshake). FYI our TLS handshake timeout is 10 seconds (Golang default and used by most tools).

Thankfully there's a fix and it was backported[4] and is part of 1.7.5[5][6].

We should consider switching the release to 1.7.5 since from the reports here this is affecting users. Sounds reasonable, are we still on time for such change?

[1] https://github.com/golang/go/issues/18688
[2] https://github.com/golang/go/issues/18141
[3] https://github.com/golang/go/issues/18203
[4] https://github.com/golang/go/commit/5c2b5ee3c8271c3d19d8f0c38804983950c752bf
[5] https://golang.org/doc/devel/release.html#go1.7.minor
[6] https://github.com/golang/go/issues?q=milestone%3AGo1.7.5

Comment 20 Wolfram Richter 2017-04-20 13:46:56 UTC

I can confirm that on my system, an OC client compiled the mentioned settings shaves about 8 seconds of the login sequence, completing in <500 msec.

Comment 21 Fabiano Franz 2017-04-20 19:46:14 UTC

Fixed in 

https://github.com/openshift/origin/pull/13803
https://github.com/openshift/origin/pull/13836

Comment 22 Fabiano Franz 2017-04-20 19:47:48 UTC

*** Bug 1444198 has been marked as a duplicate of this bug. ***

Comment 24 Xingxing Xia 2017-06-07 11:36:58 UTC

Configured ovpn in Mac Sierra laptop. Given the laptop only has network connection through VPN, I verified via 'for' loop using oc v3.6.94 against several different clusters. The oc can communicate with servers well without TLS timeout

Comment 26 errata-xmlrpc 2017-08-10 05:18:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716