Description of problem: We have some troubles at openshift-sdn, http requests from outside of the cluster to pod occasionally takes a long time to get response (tens of seconds) or no response at all but http 503. We have troubleshooted problem down to OVS eth switch where we can see from tcpdump that traffic arrives to OVS port1(2 in picture) and normally we also see tcpdump from vethX (1 in picture) to pod but when traffic halts we only see SYNs in OVS port1(2 in picture) to pod but no traffic at vethx(1 in picture). Picture I am referencing is from here: https://docs.openshift.com/container-platform/3.3/admin_guide/sdn_troubleshooting.html "SDN Flows Inside a Node" This behaviour can be seen frequently, just curl http endpoint in while loop. Behaviour can be seen in few minutes and it is repeating. Version-Release number of selected component (if applicable): OCP 3.3 on Openstack using How reproducible: On customer side Steps to Reproduce: 1. install using https://github.com/redhat-openstack/openshift-on-openstack 2. 3. Actual results: Expected results: Additional info:
Just to make it clear, the workaround is to set nodeIP to the internal ip address of the instance.
PR for the fix: https://github.com/openshift/origin/pull/12107
Tested with latest origin build. # openshift version openshift v1.5.0-alpha.0+8a850ad-503 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 I found that the hostIP may still flip-flop when registering to master. [root@ghuang-ocp-openshift-master-0 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET ghuang-ocp-openshift-infra-0.example.com ghuang-ocp-openshift-infra-0.example.com 192.168.10.6 10.130.6.0/23 ghuang-ocp-openshift-master-0.example.com ghuang-ocp-openshift-master-0.example.com 192.168.10.5 10.129.6.0/23 ghuang-ocp-openshift-node-rxch1oh4.example.com ghuang-ocp-openshift-node-rxch1oh4.example.com 192.168.10.7 10.131.6.0/23 [root@ghuang-ocp-openshift-master-0 ~]# oc delete node --all node "ghuang-ocp-openshift-infra-0.example.com" deleted node "ghuang-ocp-openshift-master-0.example.com" deleted node "ghuang-ocp-openshift-node-rxch1oh4.example.com" deleted Restart all nodes. # oc get hostsubnet NAME HOST HOST IP SUBNET ghuang-ocp-openshift-infra-0.example.com ghuang-ocp-openshift-infra-0.example.com 10.0.10.3 10.128.8.0/23 ghuang-ocp-openshift-master-0.example.com ghuang-ocp-openshift-master-0.example.com 192.168.10.5 10.130.8.0/23 ghuang-ocp-openshift-node-rxch1oh4.example.com ghuang-ocp-openshift-node-rxch1oh4.example.com 192.168.10.7 10.129.8.0/23 And the node with incorrect HOST IP cannot reach the other nodes through the cluster IP. [root@ghuang-ocp-openshift-infra-0 ~]# ping 10.130.8.1 PING 10.130.8.1 (10.130.8.1) 56(84) bytes of data. From 10.128.8.1 icmp_seq=1 Destination Host Unreachable From 10.128.8.1 icmp_seq=2 Destination Host Unreachable
The qe test for this bug is incorrect. We have to simulate change in IP address of a live node and notice that the change is reflected in the node status, but not in the hostsubnet fields. So the step of 'oc delete node --all' dive above should not be done. Adding a new node will always pick up whatever address is reported.
Tested with origin branch origin/release-1.3 And cherry-pick the changes in commit a5e26fff69b3f66cf56b182cc9b8994e37c39f87 Before the commit included. # journalctl -lf | grep -i subnet Dec 25 22:15:57 ghuang-origin-openshift-master-0.example.com origin-master[116023]: I1225 22:15:57.340624 116023 subnets.go:67] Updated HostSubnet ghuang-origin-openshift-master-0.example.com (host: "ghuang-origin-openshift-master-0.example.com", ip: "10.0.10.2", subnet: "10.129.0.0/23") Dec 25 22:16:07 ghuang-origin-openshift-master-0.example.com origin-master[116023]: I1225 22:16:07.843053 116023 subnets.go:67] Updated HostSubnet ghuang-origin-openshift-master-0.example.com (host: "ghuang-origin-openshift-master-0.example.com", ip: "192.168.10.5", subnet: "10.129.0.0/23") # while true; do time curl -s --resolve unsecure.example.com:80:10.19.114.135 http://unsecure.example.com --output /dev/null -w "status %{http_code}" ; sleep 1 ; done status 200 real 0m1.603s user 0m0.004s sys 0m0.004s status 200 real 0m0.597s user 0m0.003s sys 0m0.004s status 200 real 0m0.599s user 0m0.003s sys 0m0.004s status 200 real 0m0.598s user 0m0.005s sys 0m0.002s ... ... status 503 real 0m20.608s user 0m0.004s sys 0m0.004s After the change and rebuild the binary. No host IP flip-flop in the master log. No 503 return code from the loop accessing.
Oh, sorry, the fix was not applied to OCP build. Assign the bug back.
Tested with ocp 3.3.1.9, the issue still can be reproduced. # journalctl -lf -u atomic-openshift-master | grep -i subnet Jan 09 04:35:16 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:16.962750 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-mast er-0.example.com" Jan 09 04:35:16 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:16.973505 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-master-0.example. com (host: "ghuang-33-openshift-master-0.example.com", ip: "10.0.10.3", subnet: "10.1.2.0/24") Jan 09 04:35:26 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:26.061139 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node -o7hnjs9a.example.com" Jan 09 04:35:26 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:26.092295 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.exa mple.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "192.168.10.7", subnet: "10.1.0.0/24") Jan 09 04:35:37 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:37.366932 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-master-0.example.com" Jan 09 04:35:37 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:37.410237 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-master-0.example.com (host: "ghuang-33-openshift-master-0.example.com", ip: "192.168.10.6", subnet: "10.1.2.0/24") Jan 09 04:35:46 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:46.467806 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node-o7hnjs9a.example.com" Jan 09 04:35:46 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:46.537907 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.example.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "10.0.10.4", subnet: "10.1.0.0/24") Jan 09 04:35:56 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:56.863929 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node-o7hnjs9a.example.com" Jan 09 04:35:56 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:56.897706 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.example.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "192.168.10.7", subnet: "10.1.0.0/24") Jan 09 04:35:57 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:57.751319 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-master-0.example.com" @scott Can you help confirm that the fix was merged into 3.3.1.9? Thanks.
Meng, I think we have installer work to do to fix this. For now if you set nodeIP to the IP address of the interface and use 3.3.1.9 does the problem go away?
@scott Adding nodeIP to node-config.yaml cannot fix the issue. And I tried on the origin env again, after rebuild the openshift binary with the fix in. The issue cannot be reproduced.
My apologies, the fix was not included in v3.3.1.9. It will be included in the next 3.3 build. Moving to MODIFIED until such a build is created.
Tested with OCP build 3.3.1.11 Issue has been fixed, no IP flip-flop logs found in master log and no 503 return when keep accessing the route. Move to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0199