Bug 1972481

Summary: [OVN] ARP Response from multiple Nodes for single EgressIP
Product: OpenShift Container Platform Reporter: Michael Washer <mwasher>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: aconstan, andreas.weise, cldavey, openshift-bugs-escalate, rjamadar, tidawson
Version: 4.6   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-02 14:11:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Washer 2021-06-16 03:10:57 UTC
Description of problem:
We have two egress IP's assigned to a project and three nodes in the cluster labelled with the egress-assignable label. We are expecting that the MAC address for any given egress IP is that of the assigned node. However, we have noted that multiple nodes are responding to ARP requests for the same egress IP address. This is causing ‘flapping’ in the ARP tables.

Version-Release number of selected component (if applicable):
OpenShift 4.6.30
OCP cluster is UPI with VMware VM's provisioned as the OCP nodes. OCP cluster is using OVN-Kubernetes.
Vmware version is 6.7 and network layer is NSX-T. 
NSX-T is a tenant of Cisco ACI (version 4.2(3j)) environment.

How reproducible:
The problem occurs intermittently. We have noted that this happens more frequently after a node crashes.

Steps to Reproduce:
1) Install cluster with OVN-Kubernetes matching the environment described above
2) Create a number of Pods and allocate EgressIPs according to the description
3) Crash a Node
4) Inspect the Northbound DB and there are excess rules that for EgressIP that do not align with the OpenShift state

Actual results:
Multiple nodes are responding to ARP requests

Expected results:
Only the nodes with current ownership of EgressIPs should respond to ARP requests for the given IP 

Additional info:
We can see the following rules in the NBDB database dump where logical_port shows attachment to two different logical routers. This was reproduced in a lab environment.
```
NAT table
_uuid                                external_ids          external_ip      external_mac        external_port_range logical_ip    logical_port               options             type
------------------------------------ --------------------- ---------------- ------------------- ------------------- ------------- -------------------------- ------------------- -------------
fabe9b46-672c-48ee-ab36-f2e612710290 {name=egressips-prod} "172.21.104.123" []                  ""                  "10.128.2.22" k8s-uat-tjp8f-worker-9bh8w {}                  snat
3a09e5bc-f0cd-4587-a432-99316cc813d9 {name=egressips-prod} "172.21.104.123" []                  ""                  "10.128.2.5"  k8s-uat-tjp8f-worker-r4qbl {}                  snat
```

Comment 8 Alexander Constantinescu 2021-07-02 14:11:54 UTC

*** This bug has been marked as a duplicate of bug 1976215 ***