Description of problem:
customer has bond interface from 2 network interfaces, bond if with one IP address (XXXX/24).
The kubernetes api https://172.30.0.1 doesn't work from within the pod (therefore router/registry can't be deployed) - the reason is timeout:
error: couldn't get deployment router: Get https://172.30.0.1:443/api/v1/namespaces/default/replicationcontrollers/router: dial tcp 172.30.0.1:443: i/o timeout
Also reproducible with master api directly (only 1 master in the cluster).
The error occurs on every deploy pod running on every node (tested on master, one nodes).
The api request works directly on the node (not using tun0 interface).
Endpoints are ok - k8s api and master api works from the node directly.
Ping between the tun0 interfaces doesn't work.
After checking the tcpdump, the connection is coming from tun0 - but nothing is on the bond interface (only arp requests).
Disconnected environment from the internet.
No firewall between the nodes.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.9 (latest)
OVS multitenant/subnet (tested with both)
Steps to Reproduce:
I will attach the logs in private comment.
Possibly - could be wrong bond configuration.
Weibin: can you try to reproduce this please?
What are the endpoints for the kubernetes service:
oc get ep -n default kubernetes
And can they curl one of the endpoints directly?
$ oc get ep -n default kubernetes
NAME ENDPOINTS AGE
kubernetes 172.17.0.2:8053,172.17.0.2:8443,172.17.0.2:8053 2h
$ curl -k https://172.17.0.2:8443
(In reply to Ben Bennett from comment #4)
> Weibin: can you try to reproduce this please?
Both Beijing and westford openshift networking QE do not have hardware and setup
to reproduce above issue.
The linked support case has been resolved by the customer.
Their hosts had been configured with routing entries in a separate routing table (1) in addition to the main table. The extra routes caused some traffic to be routed through the bond0 interface rather than tun0.
These routes were not shown by normal "ip route list" used f.ex by our sdn-debug script.
When this was discovered the extra routing configuration has been removed from their hosts and deployment using bond interface runs as expected.