Description of problem: If a port binding moves to a different chassis ovn-controller updates openflow rules to use a different tunnel ovs port based on its destination. 2022-03-21T15:09:12.566Z|1910624|ofctrl|DBG|ofctrl_remove_flow flow: cookie=2170a5f0, table_id=37, priority=100, reg15=0x2,metadata=0x6, actions=set_field:0x6/0xffffff->tun_id,set_field:0x2/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:10 2022-03-21T15:09:12.566Z|1910625|ofctrl|DBG|ofctrl_add_flow flow: cookie=2170a5f0, table_id=37, priority=100, reg15=0x2,metadata=0x6, actions=set_field:0x6/0xffffff->tun_id,set_field:0x2/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:6 On the example above it changes the output:10 to output:6 - which is correct since the port moved. However, when calculating flows for multicast group the port is part of ovn-controller doesn't take into account other ports from the multicast group and updates only based on the destination of the update port-binding. Here is an example of multicast group with tunnel key 0x8000: 2022-03-21T15:09:12.566Z|1910633|ofctrl|DBG|ofctrl_remove_flow flow: cookie=f2a7aec6, table_id=37, priority=100, reg15=0x8000,metadata=0x6, actions=set_field:0x1->reg15,resubmit(,39),set_field:0x3->reg15,resubmit(,39),set_field:0x8000->reg15,set_field:0x6/0xffffff->tun_id,set_field:0x8000/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:10,resubmit(,38) 2022-03-21T15:09:12.566Z|1910635|ofctrl|DBG|ofctrl_add_flow flow: cookie=f2a7aec6, table_id=37, priority=100, reg15=0x8000,metadata=0x6, actions=set_field:0x1->reg15,resubmit(,39),set_field:0x3->reg15,resubmit(,39),set_field:0x8000->reg15,set_field:0x6/0xffffff->tun_id,set_field:0x8000/0xffffffff->tun_metadata0,move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30],output:6,resubmit(,38) The example is taken from OCP on OSP when VIP moved from one master node to another hosted on a different OSP compute node. The environment has 3 compute nodes and each is hosting one OCP master node. The ports of the OCP master node VMs are bound to all three chassis and are attached to the same logical switch as the virtual ports. This means the right flow should have output:10,output:6 and not just output:6 . Version-Release number of selected component (if applicable): ovn-2021-21.09.1-23.el8fdp.x86_64 How reproducible: Always Steps to Reproduce: 1. Install OCP on OSP 2. Bind port on a chassis (either by failing over a VIP or just by creating a new VM) Actual results: Because the keepalived in OCP uses multicast address, once the port binding is bound to a chassis, the mutlicast group gets tunneled only to that one particular chassis. That means vrrp advertisements from master are delivered only to a single node and causes VIP failover since the node that didn't get the advertisement starts a new election. That triggers vip port binding chassis change which triggers the issue because the port binding change is actually the trigger. The whole OCP cluster falls apart. Expected results: Tunnel endpoints of the multicast group should account for chassis of all ports that are part of the group. Additional info: This is a regression from ovn-2021-21.06 and I suspect this is the patch that introduced the regression
(In reply to Jakub Libosvar from comment #0) > Additional info: > This is a regression from ovn-2021-21.06 and I suspect this is the patch > that introduced the regression https://github.com/ovn-org/ovn/commit/3d2bea7ab4b74ba61575e639008bab7229c07172
(In reply to Jakub Libosvar from comment #0) > > Version-Release number of selected component (if applicable): > ovn-2021-21.09.1-23.el8fdp.x86_64 > > > How reproducible: > Always > I don't have an OCP on OSP installation at hand but I tried to set up something similar with plain OVN but I'm not seeing the issue (not on upstream main code nor on the version on which the BZ was reported). I'm probably doing something different than what's happening in the OSP scenario. > Steps to Reproduce: > 1. Install OCP on OSP > 2. Bind port on a chassis (either by failing over a VIP or just by creating > a new VM) > Do you mean trigger a GARP to move the virtual port to a new chassis? Also, "or just by creating a new VM", do you mean any random VM attached to the same logical switch?
After loading the NB/SB DBs in a local sandbox and investigating the resulting openflows, a git bisect pointed to this fix: https://github.com/ovn-org/ovn/commit/e101e45f355a91e277630243e64897f91f13f8bc This patch the fix for bug 2036970 and is available downstream starting with ovn-2021-21.12.0-11.el8fdp. *** This bug has been marked as a duplicate of bug 2036970 ***
*** Bug 2069668 has been marked as a duplicate of this bug. ***