Bug 1607395

Summary:	Need to update egress IPs when node changes IP
Product:	OpenShift Container Platform	Reporter:	Dan Winship <danw>
Component:	Networking	Assignee:	Dan Winship <danw>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Meng Bo <bmeng>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	aos-bugs, cdc, dcbw, xtian
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The Egress IP code did not handle node IPs changing. Consequence: In some cloud environments, sometimes when a node is removed and then brought back later, it will be given a different IP address. If there were egress IPs hosted on that node, then other nodes would not update their OVS flows to use the new node IP. (This bug was noticed during code review and may not have actually affected any customers.) Fix: The Egress IP code tracks changes to node IPs now. Result: If an egress node changes IP, other nodes will update their rules accordingly.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-12-21 15:16:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Winship 2018-07-23 12:54:29 UTC

Description of problem:

If a node with EgressIPs reboots and comes up with a different IP address, other nodes will not update their OVS flows for the egress IP to point to the new node IP.

Steps to Reproduce:

1. Create two nodes ("egress node" and "app node"), assign an egress IP to the egress node
2. Create a namespace, assign the egress IP to that namespace, create a pod in that namespace on the app node, confirm that egress traffic from the pod gets sent to the egress node to use the egress IP.
3. Stop the egress node's atomic-openshift-node service, change the node's IP address, and reboot it

Actual results:

The app node will still have OVS flows in table 100 pointing to the original egress node IP. The pod will no longer be able send egress traffic.

Expected results:

The app node updates itself to reflect the node IP change, and pod egress traffic keeps working.

Comment 1 Dan Winship 2018-07-23 12:57:44 UTC

https://github.com/openshift/origin/pull/20393

Comment 2 Meng Bo 2018-07-25 06:29:20 UTC

Is this a valid scenario? When the node IP changed, there are some other things will be broken too. Eg, the cert which master generated for node will be unavailable and will cause the tls handshake error on node to master communications.

Comment 3 Dan Winship 2018-07-25 13:14:34 UTC

Hm... HostSubnets definitely get renumbered sometimes. If you have a cloud deployment with nodes coming and going, and nodes being dynamically assigned IPs via DHCP or something, then you can get something where two nodes reboot, and then come back up with their IP addresses swapped. We've had bugs involving that before. I don't know what happens with certificates in that case...

Comment 4 Dan Williams 2018-07-25 17:24:27 UTC

(In reply to Dan Winship from comment #3)
> Hm... HostSubnets definitely get renumbered sometimes. If you have a cloud
> deployment with nodes coming and going, and nodes being dynamically assigned
> IPs via DHCP or something, then you can get something where two nodes
> reboot, and then come back up with their IP addresses swapped. We've had
> bugs involving that before. I don't know what happens with certificates in
> that case...

Great question.  I know I've seen this happen in the past too, but I don't know about the cert issue.  Could the issue be add/delete nodes via Ansible?  Don't we only support Ansible as the mechanism for modifying your cluster, including adding/removing nodes?

Comment 6 Meng Bo 2018-09-12 09:13:13 UTC

Checked on v3.11.0-0.32.0

Issue has been fixed. 

The tun_dst field value will be updated according to the node ip changed.

Before the egress node IP changed:
# ovs-ofctl dump-flows br0 -O openflow13  | grep table=100
 cookie=0x0, duration=1507.103s, table=100, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=10.66.140.72,tp_dst=53 actions=output:2
 cookie=0x0, duration=1507.103s, table=100, n_packets=0, n_bytes=0, priority=200,udp,nw_dst=10.66.140.72,tp_dst=53 actions=output:2
 cookie=0x0, duration=170.514s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x673eb6 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.66.140.77->tun_dst,output:1
 cookie=0x0, duration=1507.103s, table=100, n_packets=0, n_bytes=0, priority=0 actions=goto_table:101


After the egress node IP changed:
# ovs-ofctl dump-flows br0 -O openflow13  | grep table=100
 cookie=0x0, duration=2294.917s, table=100, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=10.66.140.72,tp_dst=53 actions=output:2
 cookie=0x0, duration=2294.917s, table=100, n_packets=0, n_bytes=0, priority=200,udp,nw_dst=10.66.140.72,tp_dst=53 actions=output:2
 cookie=0x0, duration=190.496s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x673eb6 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.66.140.211->tun_dst,output:1
 cookie=0x0, duration=2294.917s, table=100, n_packets=0, n_bytes=0, priority=0 actions=goto_table:101

Comment 7 Luke Meyer 2018-12-21 15:16:32 UTC

Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.