Bug 1810451 - OSP 13, neutron ml2-ovs, DPDK - stop/wait/start of nova instance sometimes deletes tbr and yields Error executing command: RowNotFound: Cannot find Bridge with name=
Summary: OSP 13, neutron ml2-ovs, DPDK - stop/wait/start of nova instance sometimes de...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: z12
: 13.0 (Queens)
Assignee: Nate Johnston
QA Contact: Yariv
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-05 09:53 UTC by Ignacio
Modified: 2023-10-06 19:23 UTC (History)
7 users (show)

Fixed In Version: openstack-neutron-12.1.1-15.el7ost
Doc Type: No Doc Update
Doc Text:
Clone Of: 1655177
Environment:
Last Closed: 2020-06-24 11:53:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1869244 0 None None None 2020-03-26 18:00:02 UTC
OpenStack gerrit 714783 0 None MERGED Wait before deleting trunk bridges for DPDK vhu 2021-02-08 12:12:41 UTC
OpenStack gerrit 717394 0 None MERGED Wait before deleting trunk bridges for DPDK vhu 2021-02-08 12:12:41 UTC
OpenStack gerrit 769714 0 None MERGED New test, extends test_subport_connectivity 2021-03-25 11:05:58 UTC
Red Hat Issue Tracker OSP-29409 0 None None None 2023-10-06 19:23:57 UTC
Red Hat Product Errata RHBA-2020:2724 0 None None None 2020-06-24 11:53:32 UTC

Description Ignacio 2020-03-05 09:53:13 UTC
Description of problem:
Same symptom described in BZ-1655177.

OSP 13, neutron ml2-ovs, DPDK - stop/wait/start of nova instance sometimes deletes tbr and yields Error executing command: RowNotFound: Cannot find Bridge with name=tbr-d7213964-f

Deployment is OSP 13 with ML2/OVS and DPDK. Instance ports are neutron trunk ports with DPDK type=dpdkvhostuserclient.

Error message:
~~~
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: RowNotFound: Cannot find Bridge with name=tbr-d7213964-f
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last):
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 37, in execute
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command     self.run_idl(None)
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/schema/open_vswitch/commands.py", line 335, in run_idl
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command     br = idlutils.row_by_value(self.api.idl, 'Bridge', 'name', self.bridge)
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 63, in row_by_value
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command     raise RowNotFound(table=table, col=column, match=match)
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-d7213964-f
2020-03-02 10:37:45.929 6278 ERROR ovsdbapp.backend.ovs_idl.command 
2020-03-02 10:37:45.932 6278 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Cannot obtain interface list for bridge tbr-d7213964-f: Cannot find Bridge with name=tbr-d7213964-f: RowNotFound: Cannot find Bridge with name=tbr-d7213964-f
~~~

Version-Release number of selected component (if applicable):
[igarciam@supportshell sosreport-compute-5-02598174-2020-03-02-dbknkkg]$ cat installed-rpms | grep neutron
openstack-neutron-12.1.0-2.el7ost.noarch                    Fri Dec  6 17:39:59 2019
openstack-neutron-common-12.1.0-2.el7ost.noarch             Fri Dec  6 17:39:58 2019
openstack-neutron-l2gw-agent-12.0.2-0.20190420004620.270972f.el7ost.noarch Fri Dec  6 17:40:02 2019
openstack-neutron-lbaas-12.0.1-0.20190803015156.b86fcef.el7ost.noarch Fri Dec  6 17:40:01 2019
openstack-neutron-lbaas-ui-4.0.1-0.20190723082436.ccf8621.el7ost.noarch Fri Dec  6 17:40:20 2019
openstack-neutron-linuxbridge-12.1.0-2.el7ost.noarch        Fri Dec  6 17:40:02 2019
openstack-neutron-metering-agent-12.1.0-2.el7ost.noarch     Fri Dec  6 17:40:03 2019
openstack-neutron-ml2-12.1.0-2.el7ost.noarch                Fri Dec  6 17:40:00 2019
openstack-neutron-openvswitch-12.1.0-2.el7ost.noarch        Fri Dec  6 17:40:03 2019
openstack-neutron-sriov-nic-agent-12.1.0-2.el7ost.noarch    Fri Dec  6 17:40:03 2019
puppet-neutron-12.4.1-8.el7ost.noarch                       Fri Dec  6 17:43:51 2019
python2-neutronclient-6.7.0-1.el7ost.noarch                 Fri Dec  6 17:38:35 2019
python2-neutron-lib-1.13.0-1.el7ost.noarch                  Fri Dec  6 17:38:38 2019
python-neutron-12.1.0-2.el7ost.noarch                       Fri Dec  6 17:39:57 2019


How reproducible:
Customer execute a Robustness test where it is reproducible.

Additional info:
sosreports available in supportshell

Comment 6 Nate Johnston 2020-03-24 22:09:53 UTC
What is happening here is an artifact of the interplay between Neutron trunking and DPDK vhostuser mode.  DPDK/vhu mode means that when an instance is powered off the port is deleted, and when an instance is powered on a port is created.  This means a reboot is functionally a super fast delete-then-create.  Neutron trunking mode in combination with DPDK/vhu implements a trunk bridge for each tenant, and the ports for the instances are created as subports of that bridge.  The standard way a trunk bridge works is that when all the subports are deleted, a thread is spawned to delete the trunk bridge, because that is an expensive and time-consuming operation.  That means that if the port in question is the only port on the trunk on that compute node, this happens:

1. The port is deleted (Mar 24 11:33:13)
2. A thread is spawned to delete the trunk
3. The port is recreated (Mar 24 11:33:15)

If the trunk is deleted after #3 happens then the instance has no networking and is inaccessible; this is the scenario that was dealt with in the previous bug, BZ-1655177.  What is happening here is that the trunk is being deleted in the middle of the execution of #3, so that it stops existing in the middle of the port creation logic but before the port is actually created.

Since this is a timing issue between two different threads it's difficult to stamp out entirely, but I think the best way to do it is to add a slight delay in the trunk deletion thread, just a second or two.  That will give the port time to come back online and avoid the trunk deletion entirely.

Comment 8 Nate Johnston 2020-03-26 18:00:02 UTC
Filed launchpad bug [1] per review feedback [2].

[1] https://bugs.launchpad.net/neutron/+bug/1869244
[2] https://review.opendev.org/#/c/714783/2/

Comment 21 errata-xmlrpc 2020-06-24 11:53:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2724


Note You need to log in before you can comment on or make changes to this bug.