1309881 – [DOCS] Logging improvement: Deploying pods TLS handshake timeout errors may be due to MTU sizes.

Bug 1309881 - [DOCS] Logging improvement: Deploying pods TLS handshake timeout errors may be due to MTU sizes.

Summary: [DOCS] Logging improvement: Deploying pods TLS handshake timeout errors may b...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Vikram Goyal
QA Contact:	Jordan Liggitt
Docs Contact:	Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-02-18 21:40 UTC by Matt Woodson
Modified:	2016-08-06 23:57 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-07-15 12:58:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matt Woodson 2016-02-18 21:40:07 UTC

Description of problem:

In deploying a pod we would get failures.  Looking in the deploy pod docker log:

docker log <deployer container ID>

would show the the error:

================================
 1 deployer.go:65] couldn't get deployment <namespace>/cakephp-example-6: Get https://internal.api.clustername.openshift.com:443/api/v1/namespaces/<namespace>/replicationcontrollers/cakephp-example-6: net/http: TLS handshake timeout
================================

This ultimately turned out to be an MTU issue.  The mtu size of eth0 was 9000 (jumbo frames) while the mtu of tun0 (ovs) was 1500.  We noticed through a tcpdump that anything bigger than 1500 was coming in on eth0 but was not going across tun0.  It was being dropped and caused the error shown above.

When speaking with Clayton, he suggested that this error may be able to be bubbled up and show the user that there may be an MTU issue present.  Finding this MTU bug was very involved.  Having the user alerted to this earlier may help.


Version-Release number of selected component (if applicable):

atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
openvswitch-2.4.0-1.el7.x86_64


Steps to Reproduce:
1.  configure openshift node with eth0 with mtu size of 9000.
2.  configure tun0 to have size of 1500
3.  attempt to deploy pod to the node.
4.  if the deploy fails, on the node where the deploy failed, run "docker logs <id of deploy container>" 

Actual results:

TLS handshake timeout

Expected results:

The build should communicate with the master api without errors.

Additional info:

There is probably an underlying bug that doesn't allow tun0 to pass traffic bigger than the MTU that it has set.  Frame fragmentation should happen and traffic should pass.

Comment 1 Clayton Coleman 2016-02-18 21:50:18 UTC

In general when we report Golang connection errors via a client on the platform we should probably suggest this error.  That's going to be any client running on the cluster, plus maybe masters.

Comment 3 Jordan Liggitt 2016-04-24 15:35:28 UTC

Yes, this should be a docs issue

Comment 4 Thien-Thi Nguyen 2016-05-17 12:46:23 UTC

Doc Plan:
- Add section to "Troubleshooting OpenShift SDN" that describes how MTU mismatch between tun0 and eth0 (for example) can be the cause of authentication (SSL handshake) errors.
- Link to the new section from various places, including "Master and Node Configuration" and "Configuring the SDN".

Comment 5 Thien-Thi Nguyen 2016-05-23 11:59:28 UTC

Matt, WDYT of the doc plan (comment #4)?

Comment 7 Thien-Thi Nguyen 2016-05-24 21:43:13 UTC

PR: https://github.com/openshift/openshift-docs/pull/2145

Comment 13 openshift-github-bot 2016-06-23 14:39:30 UTC

Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/a344e65c4d3ef9b49a3a81aee4f03a6644188543
Merge pull request #2145 from tnguyen-rh/bz1309881

Add "TLS Handshake Timeout" section

Closes bug 1309881

Comment 14 Thien-Thi Nguyen 2016-06-23 14:41:36 UTC

Moving to RELEASE_PENDING.

Comment 16 Thien-Thi Nguyen 2016-07-15 12:58:05 UTC

Verified live:
https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#tls-handshake-timeout

Moving to CLOSED CURRENTRELEASE.

Note You need to log in before you can comment on or make changes to this bug.