The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1927369 - Removing ha_chassis_group and external port from LSP stucks OVN cluster
Summary: Removing ha_chassis_group and external port from LSP stucks OVN cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.13
Version: FDP 20.I
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Numan Siddique
QA Contact: Jianlin Shi
URL:
Whiteboard:
: 1937872 2065897 (view as bug list)
Depends On:
Blocks: 1927348
TreeView+ depends on / blocked
 
Reported: 2021-02-10 15:50 UTC by Jakub Libosvar
Modified: 2022-03-22 13:21 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1927348
Environment:
Last Closed: 2021-04-12 18:09:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
SB DB (295.54 KB, text/plain)
2021-02-10 15:50 UTC, Jakub Libosvar
no flags Details
NB DB (60.95 KB, text/plain)
2021-02-10 15:50 UTC, Jakub Libosvar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1079 0 None None None 2022-03-19 19:41:24 UTC
Red Hat Knowledge Base (Solution) 6830491 0 None None None 2022-03-22 13:21:15 UTC
Red Hat Product Errata RHBA-2021:1164 0 None None None 2021-04-12 18:09:30 UTC

Description Jakub Libosvar 2021-02-10 15:50:10 UTC
Created attachment 1756245 [details]
SB DB

When there is an external port with ha_chassis_group, it's bound to some chassis. Removing the ha_chassis_group with "external" type from the LSP causes northd to stuck and it holds lock over SB DB, so no other ovn-controllers can write to it.

Example:
 $ sudo ovn-nbctl set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 "ha_chassis_group=[]" -- set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 'options:requested-chassis=""' -- set logical_switch_port b221ff12-3b3d-4935-a2cf-1d63485407d6 'type=""'

northd logs:
2021-02-10T14:44:26.520Z|00065|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 112b6224-9ae1-4e42-8697-ae5bb4f38a04 because of 1 remaining reference(s)","error":"referential integrity violation"}

I will attach NB and SB DBs so it's easier to reproduce outside of OpenStack.


This may be used in OpenStack:
  openstack port create  --network hwoffload_net_nic1_129 --vnic-type direct port_name

networking-ovn thinks this is an SR-IOV port and because such a port is not plugged through br-int, we create and external HA port on controllers for DHCP and metadata services. Hence the corresponding logical switch port in OVN NB DB is type: external and has associated ha_chassis_group. Also such a port is bound to controller with highest priority, it means there is a port_binding entry in SB DB also with ha_chassis_group.

  openstack port set --binding-profile "capabilities=['switchdev']" port_name

This means the port will be plugged through br-int, there will be its representator port. Because of that, we no longer need the external port because DHCP and metadata can be done directly on the hosting hypervisor. The update triggers a call to NB DB to remove the type: external port and deletes the ha_chassis_group because we no longer need those. If this is the last logical switch port associated with the default_ha_chassis group, OVN northd tries to delete it from the SB DB because it's no longer used. However, there is still the port_binding for the external port using that ha_chassis_group and it's referenced form the ha_Chassis_group - thus northd attempt to delete the ha_chassis_group fails and is retried. Because northd is stuck retrying to delete the ha_chassis_group, the port binding for the external port can't be removed and northd holds a lock over the SB DB. This causes all ovn-controllers in the whole cluster unable to create new entries in the database. Entries like port_binding or mac_bindings. It means the cluster is unusable until someone deletes the port_binding for no longer existing external logical switch port.

Comment 1 Jakub Libosvar 2021-02-10 15:50:35 UTC
Created attachment 1756246 [details]
NB DB

Comment 2 Terry Wilson 2021-03-11 23:22:54 UTC
*** Bug 1937872 has been marked as a duplicate of this bug. ***

Comment 7 Jianlin Shi 2021-03-18 03:15:24 UTC
tested with following script:

#!/bin/bash

systemctl start openvswitch
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.174.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.174.25
systemctl restart ovn-controller                                                                      

ovn-nbctl ls-add sw0
ovn-nbctl lsp-add sw0 sw0-p1
ovn-nbctl lsp-set-type sw0-p1 external                                                                

ovn-nbctl ha-chassis-group-add hagrp1
ovn-nbctl ha-chassis-group-add-chassis hagrp1 hv1 20                                                  

ha_grp1_uuid=$(ovn-nbctl find ha_chassis_group name=hagrp1 | awk '/_uuid/{print $3}')                 

ovn-nbctl set logical_switch_port sw0-p1 ha_chassis_group=$ha_grp1_uuid                               
ovn-nbctl list logical_switch_port sw0-p1                                                             

ovn-nbctl clear logical_switch_port sw0-p1 ha_chassis_group
ovn-nbctl set logical_switch_port sw0-p1 ha_chassis_group=$ha_grp1_uuid
ovn-nbctl clear logical_switch_port sw0-p1 ha_chassis_group -- set logical_switch_port sw0-p1 'type=""'
                                                                                                      
ovn-nbctl list logical_switch_port sw0-p1                                                             
ovn-nbctl list ha_chassis_group                                                                       
                                                                                                      
grep "transaction error" /var/log/ovn/ovn-northd.log

reproduced on 20.12.0-24:

[root@wsfd-advnetlab18 bz1927369]# rpm -qa | grep -E "openvswitch2.13|ovn2.13"
python3-openvswitch2.13-2.13.0-85.el7fdp.x86_64
ovn2.13-central-20.12.0-24.el7fdp.x86_64
openvswitch2.13-2.13.0-85.el7fdp.x86_64
ovn2.13-20.12.0-24.el7fdp.x86_64
ovn2.13-host-20.12.0-24.el7fdp.x86_64

+ grep 'transaction error' /var/log/ovn/ovn-northd.log
2021-03-18T03:14:24.688Z|00008|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}
2021-03-18T03:14:24.689Z|00009|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}
2021-03-18T03:14:24.689Z|00010|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}
2021-03-18T03:14:24.690Z|00011|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}
2021-03-18T03:14:24.690Z|00012|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 9a25a1c3-bda5-49fd-8a7c-7eaacdef9451 because of 1 remaining reference(s)","error":"referential integrity violation"}

Verified on 20.12.0-85:

[root@wsfd-advnetlab18 bz1927369]# rpm -qa | grep -E "openvswitch2.13|ovn2.13"
python3-openvswitch2.13-2.13.0-85.el7fdp.x86_64
ovn2.13-host-20.12.0-85.el7fdp.x86_64
openvswitch2.13-2.13.0-85.el7fdp.x86_64
ovn2.13-central-20.12.0-85.el7fdp.x86_64
ovn2.13-20.12.0-85.el7fdp.x86_64

+ grep 'transaction error' /var/log/ovn/ovn-northd.log

<=== no error

Comment 9 errata-xmlrpc 2021-04-12 18:09:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1164

Comment 10 ldenny 2022-03-21 04:21:32 UTC
*** Bug 2065897 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.