Description of problem: If a node with EgressIPs reboots and comes up with a different IP address, other nodes will not update their OVS flows for the egress IP to point to the new node IP. Steps to Reproduce: 1. Create two nodes ("egress node" and "app node"), assign an egress IP to the egress node 2. Create a namespace, assign the egress IP to that namespace, create a pod in that namespace on the app node, confirm that egress traffic from the pod gets sent to the egress node to use the egress IP. 3. Stop the egress node's atomic-openshift-node service, change the node's IP address, and reboot it Actual results: The app node will still have OVS flows in table 100 pointing to the original egress node IP. The pod will no longer be able send egress traffic. Expected results: The app node updates itself to reflect the node IP change, and pod egress traffic keeps working.
https://github.com/openshift/origin/pull/20393
Is this a valid scenario? When the node IP changed, there are some other things will be broken too. Eg, the cert which master generated for node will be unavailable and will cause the tls handshake error on node to master communications.
Hm... HostSubnets definitely get renumbered sometimes. If you have a cloud deployment with nodes coming and going, and nodes being dynamically assigned IPs via DHCP or something, then you can get something where two nodes reboot, and then come back up with their IP addresses swapped. We've had bugs involving that before. I don't know what happens with certificates in that case...
(In reply to Dan Winship from comment #3) > Hm... HostSubnets definitely get renumbered sometimes. If you have a cloud > deployment with nodes coming and going, and nodes being dynamically assigned > IPs via DHCP or something, then you can get something where two nodes > reboot, and then come back up with their IP addresses swapped. We've had > bugs involving that before. I don't know what happens with certificates in > that case... Great question. I know I've seen this happen in the past too, but I don't know about the cert issue. Could the issue be add/delete nodes via Ansible? Don't we only support Ansible as the mechanism for modifying your cluster, including adding/removing nodes?
Checked on v3.11.0-0.32.0 Issue has been fixed. The tun_dst field value will be updated according to the node ip changed. Before the egress node IP changed: # ovs-ofctl dump-flows br0 -O openflow13 | grep table=100 cookie=0x0, duration=1507.103s, table=100, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=10.66.140.72,tp_dst=53 actions=output:2 cookie=0x0, duration=1507.103s, table=100, n_packets=0, n_bytes=0, priority=200,udp,nw_dst=10.66.140.72,tp_dst=53 actions=output:2 cookie=0x0, duration=170.514s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x673eb6 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.66.140.77->tun_dst,output:1 cookie=0x0, duration=1507.103s, table=100, n_packets=0, n_bytes=0, priority=0 actions=goto_table:101 After the egress node IP changed: # ovs-ofctl dump-flows br0 -O openflow13 | grep table=100 cookie=0x0, duration=2294.917s, table=100, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=10.66.140.72,tp_dst=53 actions=output:2 cookie=0x0, duration=2294.917s, table=100, n_packets=0, n_bytes=0, priority=200,udp,nw_dst=10.66.140.72,tp_dst=53 actions=output:2 cookie=0x0, duration=190.496s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x673eb6 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.66.140.211->tun_dst,output:1 cookie=0x0, duration=2294.917s, table=100, n_packets=0, n_bytes=0, priority=0 actions=goto_table:101
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.