Bug 1400609 - [3.3] http traffic failures when accessing pod from outside of the cluster
Summary: [3.3] http traffic failures when accessing pod from outside of the cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.3.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.3.1
Assignee: Rajat Chopra
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks: 1410128
TreeView+ depends on / blocked
 
Reported: 2016-12-01 15:26 UTC by Miheer Salunke
Modified: 2017-01-26 20:42 UTC (History)
10 users (show)

Fixed In Version: atomic-openshift-3.3.1.11-1.git.0.cba037c.el7
Doc Type: Bug Fix
Doc Text:
Cause: The IP addresses for a node were not sorted. Consequence: When the first is chosen, it may be different from the last one used, so the IP address would appear to have changed. OpenShift would update the node -> IP mapping and that causes problems with everything moving from one interface to another. Fix: Sort the addresses. Result: The traffic flows correctly and the addresses don't change.
Clone Of:
: 1410128 (view as bug list)
Environment:
Last Closed: 2017-01-26 20:42:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0199 0 normal SHIPPED_LIVE OpenShift Container Platform 3.3.1.11 and 3.2.1.23 bug fix update 2017-01-27 01:41:56 UTC

Description Miheer Salunke 2016-12-01 15:26:53 UTC
Description of problem:

We have some troubles at openshift-sdn, http requests from outside of the cluster to pod occasionally takes a long time to get response (tens of seconds) or no response at all but http 503. We have troubleshooted problem down to OVS eth switch where we can see from tcpdump that traffic arrives to OVS port1(2 in picture) and normally we also see tcpdump from vethX (1 in picture) to pod but when traffic halts we only see SYNs in OVS port1(2 in picture) to pod but no traffic at vethx(1 in picture).

Picture I am referencing is from here: https://docs.openshift.com/container-platform/3.3/admin_guide/sdn_troubleshooting.html
"SDN Flows Inside a Node"


This behaviour can be seen frequently, just curl http endpoint in while loop. Behaviour can be seen in few minutes and it is repeating.


Version-Release number of selected component (if applicable):
OCP 3.3 on Openstack using 

How reproducible:
On customer side

Steps to Reproduce:
1. install using https://github.com/redhat-openstack/openshift-on-openstack
2.
3.

Actual results:


Expected results:


Additional info:

Comment 19 Scott Dodson 2016-12-09 13:27:55 UTC
Just to make it clear, the workaround is to set nodeIP to the internal ip address of the instance.

Comment 22 Rajat Chopra 2016-12-12 18:14:01 UTC
PR for the fix: https://github.com/openshift/origin/pull/12107

Comment 27 Meng Bo 2016-12-22 11:01:21 UTC
Tested with latest origin build.
# openshift version
openshift v1.5.0-alpha.0+8a850ad-503
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

I found that the hostIP may still flip-flop when registering to master.

[root@ghuang-ocp-openshift-master-0 ~]# oc get hostsubnet
NAME                                             HOST                                             HOST IP        SUBNET
ghuang-ocp-openshift-infra-0.example.com         ghuang-ocp-openshift-infra-0.example.com         192.168.10.6   10.130.6.0/23
ghuang-ocp-openshift-master-0.example.com        ghuang-ocp-openshift-master-0.example.com        192.168.10.5   10.129.6.0/23
ghuang-ocp-openshift-node-rxch1oh4.example.com   ghuang-ocp-openshift-node-rxch1oh4.example.com   192.168.10.7   10.131.6.0/23

[root@ghuang-ocp-openshift-master-0 ~]# oc delete node --all
node "ghuang-ocp-openshift-infra-0.example.com" deleted
node "ghuang-ocp-openshift-master-0.example.com" deleted
node "ghuang-ocp-openshift-node-rxch1oh4.example.com" deleted

Restart all nodes.

# oc get hostsubnet 
NAME                                             HOST                                             HOST IP        SUBNET
ghuang-ocp-openshift-infra-0.example.com         ghuang-ocp-openshift-infra-0.example.com         10.0.10.3      10.128.8.0/23
ghuang-ocp-openshift-master-0.example.com        ghuang-ocp-openshift-master-0.example.com        192.168.10.5   10.130.8.0/23
ghuang-ocp-openshift-node-rxch1oh4.example.com   ghuang-ocp-openshift-node-rxch1oh4.example.com   192.168.10.7   10.129.8.0/23


And the node with incorrect HOST IP cannot reach the other nodes through the cluster IP.
[root@ghuang-ocp-openshift-infra-0 ~]# ping 10.130.8.1
PING 10.130.8.1 (10.130.8.1) 56(84) bytes of data.
From 10.128.8.1 icmp_seq=1 Destination Host Unreachable
From 10.128.8.1 icmp_seq=2 Destination Host Unreachable

Comment 29 Rajat Chopra 2016-12-25 18:17:58 UTC
The qe test for this bug is incorrect. We have to simulate change in IP address of a live node and notice that the change is reflected in the node status, but not in the hostsubnet fields.

So the step of 'oc delete node --all' dive above should not be done. Adding a new node will always pick up whatever address is reported.

Comment 30 Meng Bo 2016-12-26 04:19:47 UTC
Tested with origin branch origin/release-1.3
And cherry-pick the changes in commit a5e26fff69b3f66cf56b182cc9b8994e37c39f87


Before the commit included.

# journalctl -lf | grep -i subnet
Dec 25 22:15:57 ghuang-origin-openshift-master-0.example.com origin-master[116023]: I1225 22:15:57.340624  116023 subnets.go:67] Updated HostSubnet ghuang-origin-openshift-master-0.example.com (host: "ghuang-origin-openshift-master-0.example.com", ip: "10.0.10.2", subnet: "10.129.0.0/23")
Dec 25 22:16:07 ghuang-origin-openshift-master-0.example.com origin-master[116023]: I1225 22:16:07.843053  116023 subnets.go:67] Updated HostSubnet ghuang-origin-openshift-master-0.example.com (host: "ghuang-origin-openshift-master-0.example.com", ip: "192.168.10.5", subnet: "10.129.0.0/23")

# while true; do time curl -s --resolve unsecure.example.com:80:10.19.114.135 http://unsecure.example.com --output /dev/null -w "status
 %{http_code}" ; sleep 1 ; done
status 200
real    0m1.603s
user    0m0.004s
sys     0m0.004s
status 200
real    0m0.597s
user    0m0.003s
sys     0m0.004s
status 200
real    0m0.599s
user    0m0.003s
sys     0m0.004s
status 200
real    0m0.598s
user    0m0.005s
sys     0m0.002s
...
...
status 503
real    0m20.608s
user    0m0.004s
sys     0m0.004s


After the change and rebuild the binary.
No host IP flip-flop in the master log.
No 503 return code from the loop accessing.

Comment 31 Meng Bo 2016-12-26 04:21:06 UTC
Oh, sorry, the fix was not applied to OCP build. Assign the bug back.

Comment 37 Meng Bo 2017-01-09 09:55:30 UTC
Tested with ocp 3.3.1.9, the issue still can be reproduced.

# journalctl -lf -u atomic-openshift-master | grep -i subnet
Jan 09 04:35:16 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:16.962750   15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-mast
er-0.example.com"
Jan 09 04:35:16 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:16.973505   15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-master-0.example.
com (host: "ghuang-33-openshift-master-0.example.com", ip: "10.0.10.3", subnet: "10.1.2.0/24")
Jan 09 04:35:26 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:26.061139   15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node
-o7hnjs9a.example.com"
Jan 09 04:35:26 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:26.092295   15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.exa
mple.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "192.168.10.7", subnet: "10.1.0.0/24")
Jan 09 04:35:37 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:37.366932   15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-master-0.example.com"
Jan 09 04:35:37 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:37.410237   15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-master-0.example.com (host: "ghuang-33-openshift-master-0.example.com", ip: "192.168.10.6", subnet: "10.1.2.0/24")
Jan 09 04:35:46 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:46.467806   15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node-o7hnjs9a.example.com"
Jan 09 04:35:46 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:46.537907   15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.example.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "10.0.10.4", subnet: "10.1.0.0/24")
Jan 09 04:35:56 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:56.863929   15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-node-o7hnjs9a.example.com"
Jan 09 04:35:56 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:56.897706   15782 subnets.go:67] Updated HostSubnet ghuang-33-openshift-node-o7hnjs9a.example.com (host: "ghuang-33-openshift-node-o7hnjs9a.example.com", ip: "192.168.10.7", subnet: "10.1.0.0/24")
Jan 09 04:35:57 ghuang-33-openshift-master-0.example.com atomic-openshift-master[15782]: I0109 04:35:57.751319   15782 subnets.go:182] Watch MODIFIED event for Node "ghuang-33-openshift-master-0.example.com"


@scott Can you help confirm that the fix was merged into 3.3.1.9? Thanks.

Comment 38 Scott Dodson 2017-01-09 16:28:08 UTC
Meng,

I think we have installer work to do to fix this. For now if you set nodeIP to the IP address of the interface and use 3.3.1.9 does the problem go away?

Comment 39 Meng Bo 2017-01-10 09:02:28 UTC
@scott

Adding nodeIP to node-config.yaml cannot fix the issue.

And I tried on the origin env again, after rebuild the openshift binary with the fix in. The issue cannot be reproduced.

Comment 40 Scott Dodson 2017-01-10 14:05:52 UTC
My apologies, the fix was not included in v3.3.1.9. It will be included in the next 3.3 build. Moving to MODIFIED until such a build is created.

Comment 46 Meng Bo 2017-01-22 02:49:00 UTC
Tested with OCP build 3.3.1.11

Issue has been fixed, no IP flip-flop logs found in master log and no 503 return when keep accessing the route.

Move to VERIFIED.

Comment 48 errata-xmlrpc 2017-01-26 20:42:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0199


Note You need to log in before you can comment on or make changes to this bug.