Description of problem: Same symptom described in BZ-1655177. OSP 13, neutron ml2-ovs, DPDK - stop/wait/start of nova instance sometimes deletes tbr and yields Error executing command: RowNotFound: Cannot find Bridge with name=tbr-d7213964-f Deployment is OSP 13 with ML2/OVS and DPDK. Instance ports are neutron trunk ports with DPDK type=dpdkvhostuserclient. Error message: ~~~ 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: RowNotFound: Cannot find Bridge with name=tbr-d7213964-f 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last): 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 37, in execute 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command self.run_idl(None) 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/schema/open_vswitch/commands.py", line 335, in run_idl 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command br = idlutils.row_by_value(self.api.idl, 'Bridge', 'name', self.bridge) 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 63, in row_by_value 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command raise RowNotFound(table=table, col=column, match=match) 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-d7213964-f 2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command 2020-03-02 10:37:45.932 6278 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Cannot obtain interface list for bridge tbr-d7213964-f: Cannot find Bridge with name=tbr-d7213964-f: RowNotFound: Cannot find Bridge with name=tbr-d7213964-f ~~~ Version-Release number of selected component (if applicable): [igarciam@supportshell sosreport-compute-5-02598174-2020-03-02-dbknkkg]$ cat installed-rpms | grep neutron openstack-neutron-12.1.0-2.el7ost.noarch Fri Dec 6 17:39:59 2019 openstack-neutron-common-12.1.0-2.el7ost.noarch Fri Dec 6 17:39:58 2019 openstack-neutron-l2gw-agent-12.0.2-0.20190420004620.270972f.el7ost.noarch Fri Dec 6 17:40:02 2019 openstack-neutron-lbaas-12.0.1-0.20190803015156.b86fcef.el7ost.noarch Fri Dec 6 17:40:01 2019 openstack-neutron-lbaas-ui-4.0.1-0.20190723082436.ccf8621.el7ost.noarch Fri Dec 6 17:40:20 2019 openstack-neutron-linuxbridge-12.1.0-2.el7ost.noarch Fri Dec 6 17:40:02 2019 openstack-neutron-metering-agent-12.1.0-2.el7ost.noarch Fri Dec 6 17:40:03 2019 openstack-neutron-ml2-12.1.0-2.el7ost.noarch Fri Dec 6 17:40:00 2019 openstack-neutron-openvswitch-12.1.0-2.el7ost.noarch Fri Dec 6 17:40:03 2019 openstack-neutron-sriov-nic-agent-12.1.0-2.el7ost.noarch Fri Dec 6 17:40:03 2019 puppet-neutron-12.4.1-8.el7ost.noarch Fri Dec 6 17:43:51 2019 python2-neutronclient-6.7.0-1.el7ost.noarch Fri Dec 6 17:38:35 2019 python2-neutron-lib-1.13.0-1.el7ost.noarch Fri Dec 6 17:38:38 2019 python-neutron-12.1.0-2.el7ost.noarch Fri Dec 6 17:39:57 2019 How reproducible: Customer execute a Robustness test where it is reproducible. Additional info: sosreports available in supportshell
What is happening here is an artifact of the interplay between Neutron trunking and DPDK vhostuser mode. DPDK/vhu mode means that when an instance is powered off the port is deleted, and when an instance is powered on a port is created. This means a reboot is functionally a super fast delete-then-create. Neutron trunking mode in combination with DPDK/vhu implements a trunk bridge for each tenant, and the ports for the instances are created as subports of that bridge. The standard way a trunk bridge works is that when all the subports are deleted, a thread is spawned to delete the trunk bridge, because that is an expensive and time-consuming operation. That means that if the port in question is the only port on the trunk on that compute node, this happens: 1. The port is deleted (Mar 24 11:33:13) 2. A thread is spawned to delete the trunk 3. The port is recreated (Mar 24 11:33:15) If the trunk is deleted after #3 happens then the instance has no networking and is inaccessible; this is the scenario that was dealt with in the previous bug, BZ-1655177. What is happening here is that the trunk is being deleted in the middle of the execution of #3, so that it stops existing in the middle of the port creation logic but before the port is actually created. Since this is a timing issue between two different threads it's difficult to stamp out entirely, but I think the best way to do it is to add a slight delay in the trunk deletion thread, just a second or two. That will give the port time to come back online and avoid the trunk deletion entirely.
Filed launchpad bug [1] per review feedback [2]. [1] https://bugs.launchpad.net/neutron/+bug/1869244 [2] https://review.opendev.org/#/c/714783/2/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2724