Bug 1400609
Summary: | [3.3] http traffic failures when accessing pod from outside of the cluster | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Miheer Salunke <misalunk> | |
Component: | Networking | Assignee: | Rajat Chopra <rchopra> | |
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 3.3.1 | CC: | aos-bugs, bbennett, bmeng, clichybi, jokerman, mifiedle, mmccomas, myllynen, rchopra, sdodson | |
Target Milestone: | --- | |||
Target Release: | 3.3.1 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | atomic-openshift-3.3.1.11-1.git.0.cba037c.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The IP addresses for a node were not sorted.
Consequence: When the first is chosen, it may be different from the last one used, so the IP address would appear to have changed. OpenShift would update the node -> IP mapping and that causes problems with everything moving from one interface to another.
Fix: Sort the addresses.
Result: The traffic flows correctly and the addresses don't change.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1410128 (view as bug list) | Environment: | ||
Last Closed: | 2017-01-26 20:42:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1410128 |
Description
Miheer Salunke
2016-12-01 15:26:53 UTC
Just to make it clear, the workaround is to set nodeIP to the internal ip address of the instance. PR for the fix: https://github.com/openshift/origin/pull/12107 Tested with latest origin build. # openshift version openshift v1.5.0-alpha.0+8a850ad-503 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 I found that the hostIP may still flip-flop when registering to master. [root@ghuang-ocp-openshift-master-0 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET ghuang-ocp-openshift-infra-0.example.com ghuang-ocp-openshift-infra-0.example.com 192.168.10.6 10.130.6.0/23 ghuang-ocp-openshift-master-0.example.com ghuang-ocp-openshift-master-0.example.com 192.168.10.5 10.129.6.0/23 ghuang-ocp-openshift-node-rxch1oh4.example.com ghuang-ocp-openshift-node-rxch1oh4.example.com 192.168.10.7 10.131.6.0/23 [root@ghuang-ocp-openshift-master-0 ~]# oc delete node --all node "ghuang-ocp-openshift-infra-0.example.com" deleted node "ghuang-ocp-openshift-master-0.example.com" deleted node "ghuang-ocp-openshift-node-rxch1oh4.example.com" deleted Restart all nodes. # oc get hostsubnet NAME HOST HOST IP SUBNET ghuang-ocp-openshift-infra-0.example.com ghuang-ocp-openshift-infra-0.example.com 10.0.10.3 10.128.8.0/23 ghuang-ocp-openshift-master-0.example.com ghuang-ocp-openshift-master-0.example.com 192.168.10.5 10.130.8.0/23 ghuang-ocp-openshift-node-rxch1oh4.example.com ghuang-ocp-openshift-node-rxch1oh4.example.com 192.168.10.7 10.129.8.0/23 And the node with incorrect HOST IP cannot reach the other nodes through the cluster IP. [root@ghuang-ocp-openshift-infra-0 ~]# ping 10.130.8.1 PING 10.130.8.1 (10.130.8.1) 56(84) bytes of data. From 10.128.8.1 icmp_seq=1 Destination Host Unreachable From 10.128.8.1 icmp_seq=2 Destination Host Unreachable The qe test for this bug is incorrect. We have to simulate change in IP address of a live node and notice that the change is reflected in the node status, but not in the hostsubnet fields. So the step of 'oc delete node --all' dive above should not be done. Adding a new node will always pick up whatever address is reported. Tested with origin branch origin/release-1.3 And cherry-pick the changes in commit a5e26fff69b3f66cf56b182cc9b8994e37c39f87 Before the commit included. # journalctl -lf | grep -i subnet Dec 25 22:15:57 ghuang-origin-openshift-master-0.example.com origin-master[116023]: I1225 22:15:57.340624 116023 subnets.go:67] Updated HostSubnet ghuang-origin-openshift-master-0.example.com (host: "ghuang-origin-openshift-master-0.example.com", ip: "10.0.10.2", subnet: "10.129.0.0/23") Dec 25 22:16:07 ghuang-origin-openshift-master-0.example.com origin-master[116023]: I1225 22:16:07.843053 116023 subnets.go:67] Updated HostSubnet ghuang-origin-openshift-master-0.example.com (host: "ghuang-origin-openshift-master-0.example.com", ip: "192.168.10.5", subnet: "10.129.0.0/23") # while true; do time curl -s --resolve unsecure.example.com:80:10.19.114.135 http://unsecure.example.com --output /dev/null -w "status %{http_code}" ; sleep 1 ; done status 200 real 0m1.603s user 0m0.004s sys 0m0.004s status 200 real 0m0.597s user 0m0.003s sys 0m0.004s status 200 real 0m0.599s user 0m0.003s sys 0m0.004s status 200 real 0m0.598s user 0m0.005s sys 0m0.002s ... ... status 503 real 0m20.608s user 0m0.004s sys 0m0.004s After the change and rebuild the binary. No host IP flip-flop in the master log. No 503 return code from the loop accessing. Oh, sorry, the fix was not applied to OCP build. Assign the bug back. Tested with ocp 3.3.1.9, the issue still can be reproduced. # journalctl -lf -u atomic-openshift-master | grep -i subnet Jan 09 04:35:16 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:16.962750 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-mast er-0.example.com" Jan 09 04:35:16 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:16.973505 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-master-0.example. com (host: "ghuang-33-openshift-master-0.example.com", ip: "10.0.10.3", subnet: "10.1.2.0/24") Jan 09 04:35:26 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:26.061139 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node -o7hnjs9a.example.com" Jan 09 04:35:26 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:26.092295 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.exa mple.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "192.168.10.7", subnet: "10.1.0.0/24") Jan 09 04:35:37 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:37.366932 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-master-0.example.com" Jan 09 04:35:37 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:37.410237 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-master-0.example.com (host: "ghuang-33-openshift-master-0.example.com", ip: "192.168.10.6", subnet: "10.1.2.0/24") Jan 09 04:35:46 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:46.467806 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node-o7hnjs9a.example.com" Jan 09 04:35:46 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:46.537907 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.example.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "10.0.10.4", subnet: "10.1.0.0/24") Jan 09 04:35:56 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:56.863929 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node-o7hnjs9a.example.com" Jan 09 04:35:56 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:56.897706 15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.example.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "192.168.10.7", subnet: "10.1.0.0/24") Jan 09 04:35:57 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:57.751319 15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-master-0.example.com" @scott Can you help confirm that the fix was merged into 3.3.1.9? Thanks. Meng, I think we have installer work to do to fix this. For now if you set nodeIP to the IP address of the interface and use 3.3.1.9 does the problem go away? @scott Adding nodeIP to node-config.yaml cannot fix the issue. And I tried on the origin env again, after rebuild the openshift binary with the fix in. The issue cannot be reproduced. My apologies, the fix was not included in v3.3.1.9. It will be included in the next 3.3 build. Moving to MODIFIED until such a build is created. Tested with OCP build 3.3.1.11 Issue has been fixed, no IP flip-flop logs found in master log and no 503 return when keep accessing the route. Move to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0199 |