Description of problem: In deploying a pod we would get failures. Looking in the deploy pod docker log: docker log <deployer container ID> would show the the error: ================================ 1 deployer.go:65] couldn't get deployment <namespace>/cakephp-example-6: Get https://internal.api.clustername.openshift.com:443/api/v1/namespaces/<namespace>/replicationcontrollers/cakephp-example-6: net/http: TLS handshake timeout ================================ This ultimately turned out to be an MTU issue. The mtu size of eth0 was 9000 (jumbo frames) while the mtu of tun0 (ovs) was 1500. We noticed through a tcpdump that anything bigger than 1500 was coming in on eth0 but was not going across tun0. It was being dropped and caused the error shown above. When speaking with Clayton, he suggested that this error may be able to be bubbled up and show the user that there may be an MTU issue present. Finding this MTU bug was very involved. Having the user alerted to this earlier may help. Version-Release number of selected component (if applicable): atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 openvswitch-2.4.0-1.el7.x86_64 Steps to Reproduce: 1. configure openshift node with eth0 with mtu size of 9000. 2. configure tun0 to have size of 1500 3. attempt to deploy pod to the node. 4. if the deploy fails, on the node where the deploy failed, run "docker logs <id of deploy container>" Actual results: TLS handshake timeout Expected results: The build should communicate with the master api without errors. Additional info: There is probably an underlying bug that doesn't allow tun0 to pass traffic bigger than the MTU that it has set. Frame fragmentation should happen and traffic should pass.
In general when we report Golang connection errors via a client on the platform we should probably suggest this error. That's going to be any client running on the cluster, plus maybe masters.
Yes, this should be a docs issue
Doc Plan: - Add section to "Troubleshooting OpenShift SDN" that describes how MTU mismatch between tun0 and eth0 (for example) can be the cause of authentication (SSL handshake) errors. - Link to the new section from various places, including "Master and Node Configuration" and "Configuring the SDN".
Matt, WDYT of the doc plan (comment #4)?
PR: https://github.com/openshift/openshift-docs/pull/2145
Commit pushed to master at https://github.com/openshift/openshift-docs https://github.com/openshift/openshift-docs/commit/a344e65c4d3ef9b49a3a81aee4f03a6644188543 Merge pull request #2145 from tnguyen-rh/bz1309881 Add "TLS Handshake Timeout" section Closes bug 1309881
Moving to RELEASE_PENDING.
Verified live: https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#tls-handshake-timeout Moving to CLOSED CURRENTRELEASE.