Bug 1927369

Summary: Removing ha_chassis_group and external port from LSP stucks OVN cluster
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jakub Libosvar <jlibosva>
Component: ovn2.13Assignee: Numan Siddique <nusiddiq>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: unspecified    
Version: FDP 20.ICC: amuller, apevec, chrisw, ctrautma, ekuris, irichart, jishi, jlibosva, ldenny, lhh, majopela, mmichels, nusiddiq, ralongi, rcernin, scohen, twilson, yrachman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1927348 Environment:
Last Closed: 2021-04-12 18:09:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1927348    
Attachments:
Description Flags
SB DB
none
NB DB none

Description Jakub Libosvar 2021-02-10 15:50:10 UTC
Created attachment 1756245 [details]
SB DB

When there is an external port with ha_chassis_group, it's bound to some chassis. Removing the ha_chassis_group with "external" type from the LSP causes northd to stuck and it holds lock over SB DB, so no other ovn-controllers can write to it.

Example:
 $ sudo ovn-nbctl set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 "ha_chassis_group=[]" -- set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 'options:requested-chassis=""' -- set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 'type=""'

northd logs:
2021-02-10T14:44:26.520Z|00065|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 112b6224-9ae1-4e42-8697-ae5bb4f38a04 because of 1 remaining reference(s)","error":"referential integrity violation"}

I will attach NB and SB DBs so it's easier to reproduce outside of OpenStack.


This may be used in OpenStack:
  openstack port create  --network hwoffload_net_nic1_129 --vnic-type direct port_name

networking-ovn thinks this is an SR-IOV port and because such a port is not plugged through br-int, we create and external HA port on controllers for DHCP and metadata services. Hence the corresponding logical switch port in OVN NB DB is type: external and has associated ha_chassis_group. Also such a port is bound to controller with highest priority, it means there is a port_binding entry in SB DB also with ha_chassis_group.

  openstack port set --binding-profile "capabilities=['switchdev']" port_name

This means the port will be plugged through br-int, there will be its representator port. Because of that, we no longer need the external port because DHCP and metadata can be done directly on the hosting hypervisor. The update triggers a call to NB DB to remove the type: external port and deletes the ha_chassis_group because we no longer need those. If this is the last logical switch port associated with the default_ha_chassis group, OVN northd tries to delete it from the SB DB because it's no longer used. However, there is still the port_binding for the external port using that ha_chassis_group and it's referenced form the ha_Chassis_group - thus northd attempt to delete the ha_chassis_group fails and is retried. Because northd is stuck retrying to delete the ha_chassis_group, the port binding for the external port can't be removed and northd holds a lock over the SB DB. This causes all ovn-controllers in the whole cluster unable to create new entries in the database. Entries like port_binding or mac_bindings. It means the cluster is unusable until someone deletes the port_binding for no longer existing external logical switch port.

Comment 1 Jakub Libosvar 2021-02-10 15:50:35 UTC
Created attachment 1756246 [details]
NB DB

Comment 2 Terry Wilson 2021-03-11 23:22:54 UTC
*** Bug 1937872 has been marked as a duplicate of this bug. ***

Comment 7 Jianlin Shi 2021-03-18 03:15:24 UTC
tested with following script:

#!/bin/bash

systemctl start openvswitch
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.174.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.174.25
systemctl restart ovn-controller                                                                      

ovn-nbctl ls-add sw0
ovn-nbctl lsp-add sw0 sw0-p1
ovn-nbctl lsp-set-type sw0-p1 external                                                                

ovn-nbctl ha-chassis-group-add hagrp1
ovn-nbctl ha-chassis-group-add-chassis hagrp1 hv1 20                                                  

ha_grp1_uuid=$(ovn-nbctl find ha_chassis_group name=hagrp1 | awk '/_uuid/{print $3}')                 

ovn-nbctl set logical_switch_port sw0-p1 ha_chassis_group=$ha_grp1_uuid                               
ovn-nbctl list logical_switch_port sw0-p1                                                             

ovn-nbctl clear logical_switch_port sw0-p1 ha_chassis_group
ovn-nbctl set logical_switch_port sw0-p1 ha_chassis_group=$ha_grp1_uuid
ovn-nbctl clear logical_switch_port sw0-p1 ha_chassis_group -- set logical_switch_port sw0-p1 'type=""'
                                                                                                      
ovn-nbctl list logical_switch_port sw0-p1                                                             
ovn-nbctl list ha_chassis_group                                                                       
                                                                                                      
grep "transaction error" /var/log/ovn/ovn-northd.log

reproduced on 20.12.0-24:

[root@wsfd-advnetlab18 bz1927369]# rpm -qa | grep -E "openvswitch2.13|ovn2.13"
python3-openvswitch2.13-2.13.0-85.el7fdp.x86_64
ovn2.13-central-20.12.0-24.el7fdp.x86_64
openvswitch2.13-2.13.0-85.el7fdp.x86_64
ovn2.13-20.12.0-24.el7fdp.x86_64
ovn2.13-host-20.12.0-24.el7fdp.x86_64

+ grep 'transaction error' /var/log/ovn/ovn-northd.log
2021-03-18T03:14:24.688Z|00008|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}
2021-03-18T03:14:24.689Z|00009|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}
2021-03-18T03:14:24.689Z|00010|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}
2021-03-18T03:14:24.690Z|00011|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}
2021-03-18T03:14:24.690Z|00012|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}

Verified on 20.12.0-85:

[root@wsfd-advnetlab18 bz1927369]# rpm -qa | grep -E "openvswitch2.13|ovn2.13"
python3-openvswitch2.13-2.13.0-85.el7fdp.x86_64
ovn2.13-host-20.12.0-85.el7fdp.x86_64
openvswitch2.13-2.13.0-85.el7fdp.x86_64
ovn2.13-central-20.12.0-85.el7fdp.x86_64
ovn2.13-20.12.0-85.el7fdp.x86_64

+ grep 'transaction error' /var/log/ovn/ovn-northd.log

<=== no error

Comment 9 errata-xmlrpc 2021-04-12 18:09:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1164

Comment 10 ldenny 2022-03-21 04:21:32 UTC
*** Bug 2065897 has been marked as a duplicate of this bug. ***