2066413 – Port binding chassis change messes up multicast group tunnel endpoints

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2066413 - Port binding chassis change messes up multicast group tunnel endpoints

Summary: Port binding chassis change messes up multicast group tunnel endpoints

Keywords:
Status:	CLOSED DUPLICATE of bug 2036970
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	ovn-2021
Sub Component:
Version:	FDP 21.H
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	OVN Team
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-21 17:08 UTC by Jakub Libosvar
Modified:	2022-03-29 12:52 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-22 16:01:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-1848	0	None	None	None	2022-03-21 17:21:04 UTC

Description Jakub Libosvar 2022-03-21 17:08:58 UTC

Description of problem:
If a port binding moves to a different chassis ovn-controller updates openflow rules to use a different tunnel ovs port based on its destination.
2022-03-21T15:09:12.566Z|1910624|ofctrl|DBG|ofctrl_remove_flow flow: cookie=2170a5f0, table_id=37, priority=100, reg15=0x2,metadata=0x6, actions=set_field:0x6/0xffffff->tun_id,set_field:0x2/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:10
2022-03-21T15:09:12.566Z|1910625|ofctrl|DBG|ofctrl_add_flow flow: cookie=2170a5f0, table_id=37, priority=100, reg15=0x2,metadata=0x6, actions=set_field:0x6/0xffffff->tun_id,set_field:0x2/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:6

On the example above it changes the output:10 to output:6 - which is correct since the port moved. However, when calculating flows for multicast group the port is part of ovn-controller doesn't take into account other ports from the multicast group and updates only based on the destination of the update port-binding. Here is an example of multicast group with tunnel key 0x8000:

2022-03-21T15:09:12.566Z|1910633|ofctrl|DBG|ofctrl_remove_flow flow: cookie=f2a7aec6, table_id=37, priority=100, reg15=0x8000,metadata=0x6, actions=set_field:0x1->reg15,resubmit(,39),set_field:0x3->reg15,resubmit(,39),set_field:0x8000->reg15,set_field:0x6/0xffffff->tun_id,set_field:0x8000/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:10,resubmit(,38)
2022-03-21T15:09:12.566Z|1910635|ofctrl|DBG|ofctrl_add_flow flow: cookie=f2a7aec6, table_id=37, priority=100, reg15=0x8000,metadata=0x6, actions=set_field:0x1->reg15,resubmit(,39),set_field:0x3->reg15,resubmit(,39),set_field:0x8000->reg15,set_field:0x6/0xffffff->tun_id,set_field:0x8000/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:6,resubmit(,38)

The example is taken from OCP on OSP when VIP moved from one master node to another hosted on a different OSP compute node. The environment has 3 compute nodes and each is hosting one OCP master node. The ports of the OCP master node VMs are bound to all three chassis and are attached to the same logical switch as the virtual ports. This means the right flow should have output:10,output:6 and not just output:6 .


Version-Release number of selected component (if applicable):
ovn-2021-21.09.1-23.el8fdp.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Install OCP on OSP
2. Bind port on a chassis (either by failing over a VIP or just by creating a new VM)


Actual results:
Because the keepalived in OCP uses multicast address, once the port binding is bound to a chassis, the mutlicast group gets tunneled only to that one particular chassis. That means vrrp advertisements from master are delivered only to a single node and causes VIP failover since the node that didn't get the advertisement starts a new election. That triggers vip port binding chassis change which triggers the issue because the port binding change is actually the trigger. The whole OCP cluster falls apart.

Expected results:
Tunnel endpoints of the multicast group should account for chassis of all ports that are part of the group.


Additional info:
This is a regression from ovn-2021-21.06 and I suspect this is the patch that introduced the regression

Comment 1 Jakub Libosvar 2022-03-21 18:37:11 UTC

(In reply to Jakub Libosvar from comment #0)
> Additional info:
> This is a regression from ovn-2021-21.06 and I suspect this is the patch
> that introduced the regression

https://github.com/ovn-org/ovn/commit/3d2bea7ab4b74ba61575e639008bab7229c07172

Comment 2 Dumitru Ceara 2022-03-22 11:32:20 UTC

(In reply to Jakub Libosvar from comment #0)
> 
> Version-Release number of selected component (if applicable):
> ovn-2021-21.09.1-23.el8fdp.x86_64
> 
> 
> How reproducible:
> Always
> 

I don't have an OCP on OSP installation at hand but I tried to set up something similar with plain OVN but I'm not seeing the issue (not on upstream main code nor on the version on which the BZ was reported).  I'm probably doing something different than what's happening in the OSP scenario.

> Steps to Reproduce:
> 1. Install OCP on OSP
> 2. Bind port on a chassis (either by failing over a VIP or just by creating
> a new VM)
> 

Do you mean trigger a GARP to move the virtual port to a new chassis?

Also, "or just by creating a new VM", do you mean any random VM attached to the same logical switch?

Comment 5 Dumitru Ceara 2022-03-22 16:01:35 UTC

After loading the NB/SB DBs in a local sandbox and investigating the resulting openflows, a git bisect pointed to this fix:

https://github.com/ovn-org/ovn/commit/e101e45f355a91e277630243e64897f91f13f8bc

This patch the fix for bug 2036970 and is available downstream starting with ovn-2021-21.12.0-11.el8fdp.

*** This bug has been marked as a duplicate of bug 2036970 ***

Comment 6 Daniel Alvarez Sanchez 2022-03-29 12:52:51 UTC

*** Bug 2069668 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.