Bug 1309881 - [DOCS] Logging improvement: Deploying pods TLS handshake timeout errors may be due to MTU sizes.
[DOCS] Logging improvement: Deploying pods TLS handshake timeout errors may b...
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation (Show other bugs)
3.1.0
Unspecified Linux
medium Severity low
: ---
: ---
Assigned To: Vikram Goyal
Jordan Liggitt
Vikram Goyal
:
Depends On:
Blocks: OSOPS_V3
  Show dependency treegraph
 
Reported: 2016-02-18 16:40 EST by Matt Woodson
Modified: 2016-08-06 19:57 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-07-15 08:58:05 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matt Woodson 2016-02-18 16:40:07 EST
Description of problem:

In deploying a pod we would get failures.  Looking in the deploy pod docker log:

docker log <deployer container ID>

would show the the error:

================================
 1 deployer.go:65] couldn't get deployment <namespace>/cakephp-example-6: Get https://internal.api.clustername.openshift.com:443/api/v1/namespaces/<namespace>/replicationcontrollers/cakephp-example-6: net/http: TLS handshake timeout
================================

This ultimately turned out to be an MTU issue.  The mtu size of eth0 was 9000 (jumbo frames) while the mtu of tun0 (ovs) was 1500.  We noticed through a tcpdump that anything bigger than 1500 was coming in on eth0 but was not going across tun0.  It was being dropped and caused the error shown above.

When speaking with Clayton, he suggested that this error may be able to be bubbled up and show the user that there may be an MTU issue present.  Finding this MTU bug was very involved.  Having the user alerted to this earlier may help.


Version-Release number of selected component (if applicable):

atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
openvswitch-2.4.0-1.el7.x86_64


Steps to Reproduce:
1.  configure openshift node with eth0 with mtu size of 9000.
2.  configure tun0 to have size of 1500
3.  attempt to deploy pod to the node.
4.  if the deploy fails, on the node where the deploy failed, run "docker logs <id of deploy container>" 

Actual results:

TLS handshake timeout

Expected results:

The build should communicate with the master api without errors.

Additional info:

There is probably an underlying bug that doesn't allow tun0 to pass traffic bigger than the MTU that it has set.  Frame fragmentation should happen and traffic should pass.
Comment 1 Clayton Coleman 2016-02-18 16:50:18 EST
In general when we report Golang connection errors via a client on the platform we should probably suggest this error.  That's going to be any client running on the cluster, plus maybe masters.
Comment 3 Jordan Liggitt 2016-04-24 11:35:28 EDT
Yes, this should be a docs issue
Comment 4 Thien-Thi Nguyen 2016-05-17 08:46:23 EDT
Doc Plan:
- Add section to "Troubleshooting OpenShift SDN" that describes how MTU mismatch between tun0 and eth0 (for example) can be the cause of authentication (SSL handshake) errors.
- Link to the new section from various places, including "Master and Node Configuration" and "Configuring the SDN".
Comment 5 Thien-Thi Nguyen 2016-05-23 07:59:28 EDT
Matt, WDYT of the doc plan (comment #4)?
Comment 7 Thien-Thi Nguyen 2016-05-24 17:43:13 EDT
PR: https://github.com/openshift/openshift-docs/pull/2145
Comment 13 openshift-github-bot 2016-06-23 10:39:30 EDT
Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/a344e65c4d3ef9b49a3a81aee4f03a6644188543
Merge pull request #2145 from tnguyen-rh/bz1309881

Add "TLS Handshake Timeout" section

Closes bug 1309881
Comment 14 Thien-Thi Nguyen 2016-06-23 10:41:36 EDT
Moving to RELEASE_PENDING.
Comment 16 Thien-Thi Nguyen 2016-07-15 08:58:05 EDT
Verified live:
https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#tls-handshake-timeout

Moving to CLOSED CURRENTRELEASE.

Note You need to log in before you can comment on or make changes to this bug.