Bug 1802557

Summary: EgressIP multiple static IPs, node with the egressIP will detect itself as offline
Product: OpenShift Container Platform Reporter: Juan Luis de Sousa-Valadas <jdesousa>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: openshift-sdn QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: bbennett, danw, gparente
Version: 3.11.0   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Node detects its self IP incorrectly. Consequence: Node won't own the egressIP it's assigned. Fix: Get the nodeIP from the K8S API instead. Result: Problem fixed in 4.5
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:15:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Juan Luis de Sousa-Valadas 2020-02-13 11:58:01 UTC
Description of problem:
$ cat oc-hostsubnet.txt | grep infra0[12]
infra01-<removed>   infra01-<removed>    []             [,,]
infra02-<removed>   infra02-<removed>    []             [,,]

$ grep <project> oc-netnamespace.txt 
<project>   1716405    [,]

1716405 = 0x1a30b5

$ cat ovs.dump-flows.txt | grep 0x1a30b5 | grep table=100
 cookie=0x0, duration=42740.175s, table=100, n_packets=65820, n_bytes=4911204, priority=100,ip,reg0=0x1a30b5 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:>tun_dst,output:1

Traffic is sent through the other node.

checking the logs, the node detects itself as offline

$ grep egress -i sdn.logs.txt | grep -v 'firewall egress network policy' | fgrep -v 'Watch close - *v1.EgressNetworkPolicy' | tail -5
I0130 09:06:22.339162   12757 egressip.go:419] VNID 420850 cannot use egress IP on offline node
I0130 09:06:22.339190   12757 egressip.go:419] VNID 12988830 cannot use egress IP on offline node
I0130 09:36:22.338826   12757 egressip.go:419] VNID 1716405 cannot use egress IP on offline node
I0130 09:36:22.338874   12757 egressip.go:419] VNID 420850 cannot use egress IP on offline node
I0130 09:36:22.338959   12757 egressip.go:419] VNID 12988830 cannot use egress IP on offline node

Version-Release number of selected component (if applicable):
3.11.129 I still have to check if there are fixes for this in ocp 4 and newer 3.11 releases

How reproducible:
Happens every now and then, unknown at this point

Steps to Reproduce:

Actual results:
infra01 must always detects itself as offline

Expected results:
infra01 must always detect infra01 as online.

Comment 6 Dan Winship 2020-03-03 21:07:48 UTC
The egress IP code only monitors the health of *other* nodes (pkg/network/node/egressip.go:ClaimEgressIP(); a node is only added to the vxlanMonitor if its IP is not eip.localIP). So... the node is failing to recognize its own IP here and considering its own egressIPs to be foreign...

They're probably doing something slightly unusual with internal vs external node IPs or something which is confusing the egressip code. (Not sure if this is going to be something they can fix by changing their configuration or if this will require a bugfix to the egressIP code.)

Comment 12 huirwang 2020-03-23 07:56:37 UTC
New logs with the nodeName and nodeIP
oc logs sdn-prw9k -n openshift-sdn
I0323 07:29:22.272826  292779 node.go:147] Initializing SDN node "ip-10-0-173-216.us-east-2.compute.internal" ( of type "redhat/openshift-ovs-networkpolicy"

Comment 19 errata-xmlrpc 2020-07-13 17:15:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.