Created attachment 1756245 [details] SB DB When there is an external port with ha_chassis_group, it's bound to some chassis. Removing the ha_chassis_group with "external" type from the LSP causes northd to stuck and it holds lock over SB DB, so no other ovn-controllers can write to it. Example: $ sudo ovn-nbctl set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 "ha_chassis_group=[]" -- set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 'options:requested-chassis=""' -- set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 'type=""' northd logs: 2021-02-10T14:44:26.520Z|00065|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 112b6224-9ae1-4e42-8697-ae5bb4f38a04 because of 1 remaining reference(s)","error":"referential integrity violation"} I will attach NB and SB DBs so it's easier to reproduce outside of OpenStack. This may be used in OpenStack: openstack port create --network hwoffload_net_nic1_129 --vnic-type direct port_name networking-ovn thinks this is an SR-IOV port and because such a port is not plugged through br-int, we create and external HA port on controllers for DHCP and metadata services. Hence the corresponding logical switch port in OVN NB DB is type: external and has associated ha_chassis_group. Also such a port is bound to controller with highest priority, it means there is a port_binding entry in SB DB also with ha_chassis_group. openstack port set --binding-profile "capabilities=['switchdev']" port_name This means the port will be plugged through br-int, there will be its representator port. Because of that, we no longer need the external port because DHCP and metadata can be done directly on the hosting hypervisor. The update triggers a call to NB DB to remove the type: external port and deletes the ha_chassis_group because we no longer need those. If this is the last logical switch port associated with the default_ha_chassis group, OVN northd tries to delete it from the SB DB because it's no longer used. However, there is still the port_binding for the external port using that ha_chassis_group and it's referenced form the ha_Chassis_group - thus northd attempt to delete the ha_chassis_group fails and is retried. Because northd is stuck retrying to delete the ha_chassis_group, the port binding for the external port can't be removed and northd holds a lock over the SB DB. This causes all ovn-controllers in the whole cluster unable to create new entries in the database. Entries like port_binding or mac_bindings. It means the cluster is unusable until someone deletes the port_binding for no longer existing external logical switch port.
Created attachment 1756246 [details] NB DB
*** Bug 1937872 has been marked as a duplicate of this bug. ***
tested with following script: #!/bin/bash systemctl start openvswitch systemctl start ovn-northd ovn-nbctl set-connection ptcp:6641 ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.174.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.174.25 systemctl restart ovn-controller ovn-nbctl ls-add sw0 ovn-nbctl lsp-add sw0 sw0-p1 ovn-nbctl lsp-set-type sw0-p1 external ovn-nbctl ha-chassis-group-add hagrp1 ovn-nbctl ha-chassis-group-add-chassis hagrp1 hv1 20 ha_grp1_uuid=$(ovn-nbctl find ha_chassis_group name=hagrp1 | awk '/_uuid/{print $3}') ovn-nbctl set logical_switch_port sw0-p1 ha_chassis_group=$ha_grp1_uuid ovn-nbctl list logical_switch_port sw0-p1 ovn-nbctl clear logical_switch_port sw0-p1 ha_chassis_group ovn-nbctl set logical_switch_port sw0-p1 ha_chassis_group=$ha_grp1_uuid ovn-nbctl clear logical_switch_port sw0-p1 ha_chassis_group -- set logical_switch_port sw0-p1 'type=""' ovn-nbctl list logical_switch_port sw0-p1 ovn-nbctl list ha_chassis_group grep "transaction error" /var/log/ovn/ovn-northd.log reproduced on 20.12.0-24: [root@wsfd-advnetlab18 bz1927369]# rpm -qa | grep -E "openvswitch2.13|ovn2.13" python3-openvswitch2.13-2.13.0-85.el7fdp.x86_64 ovn2.13-central-20.12.0-24.el7fdp.x86_64 openvswitch2.13-2.13.0-85.el7fdp.x86_64 ovn2.13-20.12.0-24.el7fdp.x86_64 ovn2.13-host-20.12.0-24.el7fdp.x86_64 + grep 'transaction error' /var/log/ovn/ovn-northd.log 2021-03-18T03:14:24.688Z|00008|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"} 2021-03-18T03:14:24.689Z|00009|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"} 2021-03-18T03:14:24.689Z|00010|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"} 2021-03-18T03:14:24.690Z|00011|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"} 2021-03-18T03:14:24.690Z|00012|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"} Verified on 20.12.0-85: [root@wsfd-advnetlab18 bz1927369]# rpm -qa | grep -E "openvswitch2.13|ovn2.13" python3-openvswitch2.13-2.13.0-85.el7fdp.x86_64 ovn2.13-host-20.12.0-85.el7fdp.x86_64 openvswitch2.13-2.13.0-85.el7fdp.x86_64 ovn2.13-central-20.12.0-85.el7fdp.x86_64 ovn2.13-20.12.0-85.el7fdp.x86_64 + grep 'transaction error' /var/log/ovn/ovn-northd.log <=== no error
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1164
*** Bug 2065897 has been marked as a duplicate of this bug. ***