Description of problem: $ cat oc-hostsubnet.txt | grep infra0[12] infra01-<removed> infra01-<removed> 192.168.52.153 10.129.4.0/23 [] [192.168.52.166, 192.168.52.168, 192.168.52.233] infra02-<removed> infra02-<removed> 192.168.52.154 10.129.6.0/23 [] [192.168.52.167, 192.168.52.174, 192.168.52.234] $ grep <project> oc-netnamespace.txt <project> 1716405 [192.168.52.168, 192.168.52.174] 1716405 = 0x1a30b5 $ cat ovs.dump-flows.txt | grep 0x1a30b5 | grep table=100 cookie=0x0, duration=42740.175s, table=100, n_packets=65820, n_bytes=4911204, priority=100,ip,reg0=0x1a30b5 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:192.168.52.154->tun_dst,output:1 Traffic is sent through the other node. checking the logs, the node detects itself as offline $ grep egress -i sdn.logs.txt | grep -v 'firewall egress network policy' | fgrep -v 'Watch close - *v1.EgressNetworkPolicy' | tail -5 I0130 09:06:22.339162 12757 egressip.go:419] VNID 420850 cannot use egress IP 192.168.52.233 on offline node 192.168.52.153 I0130 09:06:22.339190 12757 egressip.go:419] VNID 12988830 cannot use egress IP 192.168.52.166 on offline node 192.168.52.153 I0130 09:36:22.338826 12757 egressip.go:419] VNID 1716405 cannot use egress IP 192.168.52.168 on offline node 192.168.52.153 I0130 09:36:22.338874 12757 egressip.go:419] VNID 420850 cannot use egress IP 192.168.52.233 on offline node 192.168.52.153 I0130 09:36:22.338959 12757 egressip.go:419] VNID 12988830 cannot use egress IP 192.168.52.166 on offline node 192.168.52.153 Version-Release number of selected component (if applicable): 3.11.129 I still have to check if there are fixes for this in ocp 4 and newer 3.11 releases How reproducible: Happens every now and then, unknown at this point Steps to Reproduce: ???? Actual results: infra01 must always detects itself as offline Expected results: infra01 must always detect infra01 as online.
The egress IP code only monitors the health of *other* nodes (pkg/network/node/egressip.go:ClaimEgressIP(); a node is only added to the vxlanMonitor if its IP is not eip.localIP). So... the node is failing to recognize its own IP here and considering its own egressIPs to be foreign... They're probably doing something slightly unusual with internal vs external node IPs or something which is confusing the egressip code. (Not sure if this is going to be something they can fix by changing their configuration or if this will require a bugfix to the egressIP code.)
New logs with the nodeName and nodeIP oc logs sdn-prw9k -n openshift-sdn I0323 07:29:22.272826 292779 node.go:147] Initializing SDN node "ip-10-0-173-216.us-east-2.compute.internal" (10.0.173.216) of type "redhat/openshift-ovs-networkpolicy"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409