Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2066413

Summary: Port binding chassis change messes up multicast group tunnel endpoints
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jakub Libosvar <jlibosva>
Component: ovn-2021Assignee: OVN Team <ovnteam>
Status: CLOSED DUPLICATE QA Contact: Jianlin Shi <jishi>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: FDP 21.HCC: bdobreli, ctrautma, dceara, enothen, jiji, ykarel
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-22 16:01:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jakub Libosvar 2022-03-21 17:08:58 UTC
Description of problem:
If a port binding moves to a different chassis ovn-controller updates openflow rules to use a different tunnel ovs port based on its destination.
2022-03-21T15:09:12.566Z|1910624|ofctrl|DBG|ofctrl_remove_flow flow: cookie=2170a5f0, table_id=37, priority=100, reg15=0x2,metadata=0x6, actions=set_field:0x6/0xffffff->tun_id,set_field:0x2/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:10
2022-03-21T15:09:12.566Z|1910625|ofctrl|DBG|ofctrl_add_flow flow: cookie=2170a5f0, table_id=37, priority=100, reg15=0x2,metadata=0x6, actions=set_field:0x6/0xffffff->tun_id,set_field:0x2/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:6

On the example above it changes the output:10 to output:6 - which is correct since the port moved. However, when calculating flows for multicast group the port is part of ovn-controller doesn't take into account other ports from the multicast group and updates only based on the destination of the update port-binding. Here is an example of multicast group with tunnel key 0x8000:

2022-03-21T15:09:12.566Z|1910633|ofctrl|DBG|ofctrl_remove_flow flow: cookie=f2a7aec6, table_id=37, priority=100, reg15=0x8000,metadata=0x6, actions=set_field:0x1->reg15,resubmit(,39),set_field:0x3->reg15,resubmit(,39),set_field:0x8000->reg15,set_field:0x6/0xffffff->tun_id,set_field:0x8000/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:10,resubmit(,38)
2022-03-21T15:09:12.566Z|1910635|ofctrl|DBG|ofctrl_add_flow flow: cookie=f2a7aec6, table_id=37, priority=100, reg15=0x8000,metadata=0x6, actions=set_field:0x1->reg15,resubmit(,39),set_field:0x3->reg15,resubmit(,39),set_field:0x8000->reg15,set_field:0x6/0xffffff->tun_id,set_field:0x8000/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:6,resubmit(,38)

The example is taken from OCP on OSP when VIP moved from one master node to another hosted on a different OSP compute node. The environment has 3 compute nodes and each is hosting one OCP master node. The ports of the OCP master node VMs are bound to all three chassis and are attached to the same logical switch as the virtual ports. This means the right flow should have output:10,output:6 and not just output:6 .


Version-Release number of selected component (if applicable):
ovn-2021-21.09.1-23.el8fdp.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Install OCP on OSP
2. Bind port on a chassis (either by failing over a VIP or just by creating a new VM)


Actual results:
Because the keepalived in OCP uses multicast address, once the port binding is bound to a chassis, the mutlicast group gets tunneled only to that one particular chassis. That means vrrp advertisements from master are delivered only to a single node and causes VIP failover since the node that didn't get the advertisement starts a new election. That triggers vip port binding chassis change which triggers the issue because the port binding change is actually the trigger. The whole OCP cluster falls apart.

Expected results:
Tunnel endpoints of the multicast group should account for chassis of all ports that are part of the group.


Additional info:
This is a regression from ovn-2021-21.06 and I suspect this is the patch that introduced the regression

Comment 1 Jakub Libosvar 2022-03-21 18:37:11 UTC
(In reply to Jakub Libosvar from comment #0)
> Additional info:
> This is a regression from ovn-2021-21.06 and I suspect this is the patch
> that introduced the regression

https://github.com/ovn-org/ovn/commit/3d2bea7ab4b74ba61575e639008bab7229c07172

Comment 2 Dumitru Ceara 2022-03-22 11:32:20 UTC
(In reply to Jakub Libosvar from comment #0)
> 
> Version-Release number of selected component (if applicable):
> ovn-2021-21.09.1-23.el8fdp.x86_64
> 
> 
> How reproducible:
> Always
> 

I don't have an OCP on OSP installation at hand but I tried to set up something similar with plain OVN but I'm not seeing the issue (not on upstream main code nor on the version on which the BZ was reported).  I'm probably doing something different than what's happening in the OSP scenario.

> Steps to Reproduce:
> 1. Install OCP on OSP
> 2. Bind port on a chassis (either by failing over a VIP or just by creating
> a new VM)
> 

Do you mean trigger a GARP to move the virtual port to a new chassis?

Also, "or just by creating a new VM", do you mean any random VM attached to the same logical switch?

Comment 5 Dumitru Ceara 2022-03-22 16:01:35 UTC
After loading the NB/SB DBs in a local sandbox and investigating the resulting openflows, a git bisect pointed to this fix:

https://github.com/ovn-org/ovn/commit/e101e45f355a91e277630243e64897f91f13f8bc

This patch the fix for bug 2036970 and is available downstream starting with ovn-2021-21.12.0-11.el8fdp.

*** This bug has been marked as a duplicate of bug 2036970 ***

Comment 6 Daniel Alvarez Sanchez 2022-03-29 12:52:51 UTC
*** Bug 2069668 has been marked as a duplicate of this bug. ***