Bug 1966969

Summary: [OVN] Race condition when updating virtual-port related openflows upon failover.
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dumitru Ceara <dceara>
Component: OVNAssignee: OVN Team <ovnteam>
Status: NEW --- QA Contact: ying xu <yinxu>
Severity: high Docs Contact:
Priority: medium    
Version: FDP 20.HCC: ctrautma, ihrachys, jiji, jlibosva, kforde, mmichels
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dumitru Ceara 2021-06-02 09:19:41 UTC
Description of problem:

When a virtual port changes ownership due to a GARP being seen for the virtual port IP on a different hypervisor, there is a window of time in which both the ovn-controller on the old "owner" of the virtual port and the ovn-controller on the new "owner" of the virtual port have flows installed for that virtual port.

Depending on other configurations, this may lead to both hypervisors handling traffic destined to the virtual port.

One example is when a Floating IP (dnat-and-snat) entry is defined on the virtual-port. In such cases OVN installs flows to reply to ARP requests on the hypervisor owning the virtual-port. During the time window mentioned above, both the old hypervisor and the new one will have such flows installed and will both reply (with the same source MAC) to ARP requests. This will cause the FDB on the ToR switch to move around and, depending on configuration, even get blocked on the wrong ToR switch port.

Version-Release number of selected component (if applicable):

How reproducible:
I didn't reproduce this locally yet but the logs on the problematic cluster seem to indicate the following sequence of events:

T0: virtual port VP is claimed by ovn-controller on HV0
T1: flows are added to reply to ARP requests for VP's floating IP on HV0
T2: the VIP behind the virtual port fails over.
T3: VP is claimed by ovn-controller on HV1 (a transaction is initiated to the SB to update the 'chassis' record of the VP Port_Binding) and flows are added to reply to ARP requests for VP's floating IP on HV1
T4: Southbound receives the transaction from HV1 but is relatively busy ("unreasonably long poll interval ~3s"), delaying sending the update to HV0 for a few seconds.
T5 (>T4+3s): HV0 receives the SB update and removes the flows that reply to ARP requests for VP's floating IP.

During the T3-T5 interval both hypervisors will reply to ARP requests for the virtual port's floating IP.

We need to investigate if there is a way to implement a synchronization mechanism between the old owner of the virtual port and the new one. Installation of openflows on the new owner should be delayed until the old owner has cleared the openflows corresponding to the virtual port.

Comment 2 Mark Michelson 2022-10-05 19:53:33 UTC
Hi Dumitru, I'm going through old issues and trying to determine their relevance. In this case, there are a few things I can think of that may have lessened the window for this race condition:

1) All OVN components have seen massive performance boosts since this issue was opened. It is more rare now to see the long poll intervals on the southbound database except in very large clusters.
2) Ihar's additions to allow for multiple requested chassis may have some relevance here, although I don't think that was targeting virtual ports.
3) ARP responder flows have gone through a lot of changes.

My assumption is that this race condition is still present and relevant, though much less likely to be seen for as long as it was back in June 2021. So my questions are:

* Given the changes to OVN in the past year, do you suspect this race condition is still present?
* If so, then do you think the priority on this issue could be decreased due to the smaller window for the race condition to occur?

Comment 3 Dumitru Ceara 2022-10-06 10:10:11 UTC
(In reply to Mark Michelson from comment #2)
> Hi Dumitru, I'm going through old issues and trying to determine their

Hi Mark,

> relevance. In this case, there are a few things I can think of that may have
> lessened the window for this race condition:
> 
> 1) All OVN components have seen massive performance boosts since this issue
> was opened. It is more rare now to see the long poll intervals on the
> southbound database except in very large clusters.

That's likely what triggers the issue in the first place, SB busy.

> 2) Ihar's additions to allow for multiple requested chassis may have some
> relevance here, although I don't think that was targeting virtual ports.
> 3) ARP responder flows have gone through a lot of changes.
> 
> My assumption is that this race condition is still present and relevant,
> though much less likely to be seen for as long as it was back in June 2021.
> So my questions are:
> 
> * Given the changes to OVN in the past year, do you suspect this race
> condition is still present?

I think on the environment where this was seen there were lots of
changes made to reduce load on the SB so we probably didn't hit the
issue again since.  But I think the race condition is still there.

> * If so, then do you think the priority on this issue could be decreased due
> to the smaller window for the race condition to occur?

Sounds good to me, @jlibosva what do you think?

Comment 4 Jakub Libosvar 2023-05-02 14:13:54 UTC
(In reply to Dumitru Ceara from comment #3)
> (In reply to Mark Michelson from comment #2)
> > Hi Dumitru, I'm going through old issues and trying to determine their
> 
> Hi Mark,
> 
> > relevance. In this case, there are a few things I can think of that may have
> > lessened the window for this race condition:
> > 
> > 1) All OVN components have seen massive performance boosts since this issue
> > was opened. It is more rare now to see the long poll intervals on the
> > southbound database except in very large clusters.
> 
> That's likely what triggers the issue in the first place, SB busy.
> 
> > 2) Ihar's additions to allow for multiple requested chassis may have some
> > relevance here, although I don't think that was targeting virtual ports.
> > 3) ARP responder flows have gone through a lot of changes.
> > 
> > My assumption is that this race condition is still present and relevant,
> > though much less likely to be seen for as long as it was back in June 2021.
> > So my questions are:
> > 
> > * Given the changes to OVN in the past year, do you suspect this race
> > condition is still present?
> 
> I think on the environment where this was seen there were lots of
> changes made to reduce load on the SB so we probably didn't hit the
> issue again since.  But I think the race condition is still there.
> 
> > * If so, then do you think the priority on this issue could be decreased due
> > to the smaller window for the race condition to occur?
> 
> Sounds good to me, @jlibosva what do you think?

I agree, we haven't heard from the team about this issue happening after OVN perf was improved and environment was tweaked.

Comment 5 Ihar Hrachyshka 2023-05-03 11:04:43 UTC
(I will comment on a very specific point, perhaps irrelevant)

> 2) Ihar's additions to allow for multiple requested chassis may have some relevance here, although I don't think that was targeting virtual ports.

I don't think that multi-chassis port has any relevance here: it doesn't introduce any synchronization mechanism (except the rarp activation strategy that should be actively opted in anyway), and it didn't touch any types but "regular" VIF ports.