Bug 1576798

Summary: request to kubernetes api doesn't work on bond interfaces
Product: OpenShift Container Platform Reporter: Vladislav Walek <vwalek>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Status: CLOSED NOTABUG QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.9.0CC: aos-bugs, bbennett, ccallega, farandac, meggen, vwalek, weliang
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-16 18:07:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vladislav Walek 2018-05-10 12:12:19 UTC
Description of problem:

customer has bond interface from 2 network interfaces, bond if with one IP address (XXXX/24). 
The kubernetes api https://172.30.0.1 doesn't work from within the pod (therefore router/registry can't be deployed) - the reason is timeout:

error: couldn't get deployment router: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/router: dial tcp 172.30.0.1:443: i/o timeout

Also reproducible with master api directly (only 1 master in the cluster).
The error occurs on every deploy pod running on every node (tested on master, one nodes).
The api request works directly on the node (not using tun0 interface).
Endpoints are ok - k8s api and master api works from the node directly.
Ping between the tun0 interfaces doesn't work.
After checking the tcpdump, the connection is coming from tun0 - but nothing is on the bond interface (only arp requests).
Disconnected environment from the internet.
No firewall between the nodes.

Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.9 (latest)
Containerized
OVS multitenant/subnet (tested with both)

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
I will attach the logs in private comment.
Possibly - could be wrong bond configuration.

Comment 4 Ben Bennett 2018-05-11 14:21:23 UTC
Weibin: can you try to reproduce this please?

Comment 9 Ben Bennett 2018-05-14 17:48:35 UTC
What are the endpoints for the kubernetes service:
  oc get ep -n default kubernetes

And can they curl one of the endpoints directly?

e.g.:
 $ oc get ep -n default kubernetes
NAME         ENDPOINTS                                         AGE
kubernetes   172.17.0.2:8053,172.17.0.2:8443,172.17.0.2:8053   2h

 $ curl -k https://172.17.0.2:8443
...

Comment 11 Weibin Liang 2018-05-15 14:36:24 UTC
(In reply to Ben Bennett from comment #4)
> Weibin: can you try to reproduce this please?

Ben,

Both Beijing and westford openshift networking QE do not have hardware and setup
to reproduce above issue.

Comment 12 Martin Eggen 2018-05-16 08:26:17 UTC
The linked support case has been resolved by the customer. 

Their hosts had been configured with routing entries in a separate routing table (1) in addition to the main table. The extra routes caused some traffic to be routed through the bond0 interface rather than tun0.

These routes were not shown by normal "ip route list" used f.ex by our sdn-debug script.

When this was discovered the extra routing configuration has been removed from their hosts and deployment using bond interface runs as expected.