Bug 1966969
| Summary: | [OVN] Race condition when updating virtual-port related openflows upon failover. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Dumitru Ceara <dceara> |
| Component: | OVN | Assignee: | OVN Team <ovnteam> |
| Status: | NEW --- | QA Contact: | ying xu <yinxu> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | FDP 20.H | CC: | ctrautma, ihrachys, jiji, jlibosva, kforde, mmichels |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Hi Dumitru, I'm going through old issues and trying to determine their relevance. In this case, there are a few things I can think of that may have lessened the window for this race condition: 1) All OVN components have seen massive performance boosts since this issue was opened. It is more rare now to see the long poll intervals on the southbound database except in very large clusters. 2) Ihar's additions to allow for multiple requested chassis may have some relevance here, although I don't think that was targeting virtual ports. 3) ARP responder flows have gone through a lot of changes. My assumption is that this race condition is still present and relevant, though much less likely to be seen for as long as it was back in June 2021. So my questions are: * Given the changes to OVN in the past year, do you suspect this race condition is still present? * If so, then do you think the priority on this issue could be decreased due to the smaller window for the race condition to occur? (In reply to Mark Michelson from comment #2) > Hi Dumitru, I'm going through old issues and trying to determine their Hi Mark, > relevance. In this case, there are a few things I can think of that may have > lessened the window for this race condition: > > 1) All OVN components have seen massive performance boosts since this issue > was opened. It is more rare now to see the long poll intervals on the > southbound database except in very large clusters. That's likely what triggers the issue in the first place, SB busy. > 2) Ihar's additions to allow for multiple requested chassis may have some > relevance here, although I don't think that was targeting virtual ports. > 3) ARP responder flows have gone through a lot of changes. > > My assumption is that this race condition is still present and relevant, > though much less likely to be seen for as long as it was back in June 2021. > So my questions are: > > * Given the changes to OVN in the past year, do you suspect this race > condition is still present? I think on the environment where this was seen there were lots of changes made to reduce load on the SB so we probably didn't hit the issue again since. But I think the race condition is still there. > * If so, then do you think the priority on this issue could be decreased due > to the smaller window for the race condition to occur? Sounds good to me, @jlibosva what do you think? (In reply to Dumitru Ceara from comment #3) > (In reply to Mark Michelson from comment #2) > > Hi Dumitru, I'm going through old issues and trying to determine their > > Hi Mark, > > > relevance. In this case, there are a few things I can think of that may have > > lessened the window for this race condition: > > > > 1) All OVN components have seen massive performance boosts since this issue > > was opened. It is more rare now to see the long poll intervals on the > > southbound database except in very large clusters. > > That's likely what triggers the issue in the first place, SB busy. > > > 2) Ihar's additions to allow for multiple requested chassis may have some > > relevance here, although I don't think that was targeting virtual ports. > > 3) ARP responder flows have gone through a lot of changes. > > > > My assumption is that this race condition is still present and relevant, > > though much less likely to be seen for as long as it was back in June 2021. > > So my questions are: > > > > * Given the changes to OVN in the past year, do you suspect this race > > condition is still present? > > I think on the environment where this was seen there were lots of > changes made to reduce load on the SB so we probably didn't hit the > issue again since. But I think the race condition is still there. > > > * If so, then do you think the priority on this issue could be decreased due > > to the smaller window for the race condition to occur? > > Sounds good to me, @jlibosva what do you think? I agree, we haven't heard from the team about this issue happening after OVN perf was improved and environment was tweaked. (I will comment on a very specific point, perhaps irrelevant)
> 2) Ihar's additions to allow for multiple requested chassis may have some relevance here, although I don't think that was targeting virtual ports.
I don't think that multi-chassis port has any relevance here: it doesn't introduce any synchronization mechanism (except the rarp activation strategy that should be actively opted in anyway), and it didn't touch any types but "regular" VIF ports.
|
Description of problem: When a virtual port changes ownership due to a GARP being seen for the virtual port IP on a different hypervisor, there is a window of time in which both the ovn-controller on the old "owner" of the virtual port and the ovn-controller on the new "owner" of the virtual port have flows installed for that virtual port. Depending on other configurations, this may lead to both hypervisors handling traffic destined to the virtual port. One example is when a Floating IP (dnat-and-snat) entry is defined on the virtual-port. In such cases OVN installs flows to reply to ARP requests on the hypervisor owning the virtual-port. During the time window mentioned above, both the old hypervisor and the new one will have such flows installed and will both reply (with the same source MAC) to ARP requests. This will cause the FDB on the ToR switch to move around and, depending on configuration, even get blocked on the wrong ToR switch port. Version-Release number of selected component (if applicable): How reproducible: I didn't reproduce this locally yet but the logs on the problematic cluster seem to indicate the following sequence of events: T0: virtual port VP is claimed by ovn-controller on HV0 T1: flows are added to reply to ARP requests for VP's floating IP on HV0 T2: the VIP behind the virtual port fails over. T3: VP is claimed by ovn-controller on HV1 (a transaction is initiated to the SB to update the 'chassis' record of the VP Port_Binding) and flows are added to reply to ARP requests for VP's floating IP on HV1 T4: Southbound receives the transaction from HV1 but is relatively busy ("unreasonably long poll interval ~3s"), delaying sending the update to HV0 for a few seconds. T5 (>T4+3s): HV0 receives the SB update and removes the flows that reply to ARP requests for VP's floating IP. During the T3-T5 interval both hypervisors will reply to ARP requests for the virtual port's floating IP. We need to investigate if there is a way to implement a synchronization mechanism between the old owner of the virtual port and the new one. Installation of openflows on the new owner should be delayed until the old owner has cleared the openflows corresponding to the virtual port.