Description of problem: Updating direct port with binding profile causes the following 2021-02-10T09:13:23.817Z|01290|ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 24f09c3b-3031-419e-92e0-d57843ca684e because of 1 remaining reference(s)","error":"referential integrity violation"} After that vm is active but no ips ackuired Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20210129.n.0(overcloud) ovn2.13-20.09.0-17.el8fdp.x86_64 How reproducible: Permanent Steps to Reproduce: 1. create direct port openstack port create --network hwoffload_net_nic1_129 --vnic-type direct port_name 2. openstack port set --binding-profile "capabilities=['switchdev']" port_name The error pops up 3. openstack port delete port name The error pop up 4. openstack server create ... --nic port-id=port_name test server Actual results: VMS is up no IPS Expected results: No errors, VMS is up with IPs Additional info: When creating port in the regular way, vm is up with IPs openstack port create --network network_id --vnic-type direct --binding-profile '{"capabilities": ["switchdev"]}' port_name
The reason this happens is as follows: openstack port create --network hwoffload_net_nic1_129 --vnic-type direct port_name networking-ovn thinks this is an SR-IOV port and because such a port is not plugged through br-int, we create and external HA port on controllers for DHCP and metadata services. Hence the corresponding logical switch port in OVN NB DB is type: external and has associated ha_chassis_group. Also such a port is bound to controller with highest priority, it means there is a port_binding entry in SB DB also with ha_chassis_group. openstack port set --binding-profile "capabilities=['switchdev']" port_name This means the port will be plugged through br-int, there will be its representator port. Because of that, we no longer need the external port because DHCP and metadata can be done directly on the hosting hypervisor. The update triggers a call to NB DB to remove the type: external port and deletes the ha_chassis_group because we no longer need those. If this is the last logical switch port associated with the default_ha_chassis group, OVN northd tries to delete it from the SB DB because it's no longer used. However, there is still the port_binding for the external port using that ha_chassis_group and it's referenced form the ha_Chassis_group - thus northd attempt to delete the ha_chassis_group fails and is retried. Because northd is stuck retrying to delete the ha_chassis_group, the port binding for the external port can't be removed and northd holds a lock over the SB DB. This causes all ovn-controllers in the whole cluster unable to create new entries in the database. Entries like port_binding or mac_bindings. It means the cluster is unusable until someone deletes the port_binding for no longer existing external logical switch port.
Fixed in ovn2.13-20.12.0-85
Hi, NFV team has verified this on compose 'RHOS-16.1-RHEL-8-20211126.n.1' with OVN 'ovn2.13-20.12.0-189.el8fdp.x86_64'.
According to our records, this should be resolved by python-networking-ovn-7.3.1-1.20220113183502.el8ost. This build is available now.