Bug 1655177 - OSP 13, neutron ml2-ovs, DPDK - stop/wait/start of nova instance sometimes deletes tbr and yields Error executing command: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
Summary: OSP 13, neutron ml2-ovs, DPDK - stop/wait/start of nova instance sometimes de...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z4
: 13.0 (Queens)
Assignee: Nate Johnston
QA Contact: Yariv
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-30 21:08 UTC by Andreas Karis
Modified: 2019-10-22 08:16 UTC (History)
17 users (show)

Fixed In Version: openstack-neutron-12.0.5-3.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-14 13:51:31 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Launchpad 1807239 None None None 2018-12-06 19:00:39 UTC
OpenStack gerrit 623275 None MERGED Do not delete trunk bridges if service port attached 2020-02-14 14:38:15 UTC
OpenStack gerrit 625578 None MERGED Do not delete trunk bridges if service port attached 2020-02-14 14:38:15 UTC
Red Hat Product Errata RHBA-2019:0555 None None None 2019-03-14 13:51:40 UTC

Description Andreas Karis 2018-11-30 21:08:53 UTC
Description of problem:
OSP 13, neutron ml2-ovs, DPDK - stop/wait/start of nova instance sometimes deletes tbr and yields Error executing command: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b

Deployment is OSP 13 with ML2/OVS and DPDK. Instance ports are neutron trunk ports with DPDK vhu in vhostuserclient mode.

Customer runs into an issue which does happen seemingly randomly. He has 20 instances with that setup. He stops all of the instances and then verifies `ovs-vsctl show | grep tbr`.  The normal behavior at this point is that the tbr is kept on OVS and is not cleaned up/deleted. However, for some instances in this test (about 2 our of 20), the tbr will be deleted. These instances will not come up the next time he runs nova start. The instance will show in virsh as paused and in ps aux as interruptible sleep. 

In fact, I verified that the instance is waiting on a connection to the vhu port that's attached to tbr. I manually created tbr and the vhu port and the instance came up in virsh and in nova. 

I then stopped and started the same instance several times. On one of these tries, I could reproduce the issue, however, the tbr was *not* deleted before I started the instance. Instead, from OVS logs, we see that the tbr and vhu were deleted right when the instance came up, but were not created, hence producing the same issue and ERROR message.

We also tried to deliberately stop the instance, then execute `ovs-vsctl del-br tbr<...>` and nova start the instance. However, this works like a charm.

I suspect some race condition, perhaps with neutron's cleanup service, but currently have no proof of this.

Error message:
~~~
[root@compute-1 ~]# grep ERROR /var/log/containers/neutron/openvswitch-agent.log | tail -n10
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command     self.gen.next()
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command     self._nested_txn = None
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command     self.result = self.commit()
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 62, in commit
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command     raise result.ex
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command
2018-11-30 21:03:20.467 61270 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Failed to store metadata for trunk f4554113-b695-4e7d-b550-6927765c6679: Cannot find Bridge with name=tbr-f4554113-b: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
~~~

And (when grepping for tbr-f4554113-b):
~~~
2018-11-30 21:03:20.465 61270 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "tbr-f4554113-b" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2018-11-30 21:03:20.466 61270 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): ListPortsCommand(bridge=tbr-f4554113-b) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
2018-11-30 21:03:20.467 61270 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Failed to store metadata for trunk f4554113-b695-4e7d-b550-6927765c6679: Cannot find Bridge with name=tbr-f4554113-b: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
2018-11-30 21:03:20.469 61270 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): BridgeExistsCommand(name=tbr-f4554113-b) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-11-30 21:03:20.470 61270 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "tbr-f4554113-b" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228
~~~


Version-Release number of selected component (if applicable):


Additional info:
Adding more data in private comments

Comment 1 Andreas Karis 2018-11-30 21:10:12 UTC
[akaris@collab-shell compute-1]$ grep neutron installed-rpms 
openstack-neutron-12.0.4-2.el7ost.noarch                    Wed Nov  7 22:04:28 2018
openstack-neutron-common-12.0.4-2.el7ost.noarch             Wed Nov  7 22:04:20 2018
openstack-neutron-l2gw-agent-12.0.2-0.20180412115803.a9f8009.el7ost.noarch Wed Nov  7 22:04:31 2018
openstack-neutron-lbaas-12.0.1-0.20180825022802.ed663f3.el7ost.noarch Wed Nov  7 22:04:29 2018
openstack-neutron-lbaas-ui-4.0.1-0.20180914200335.dc7d9e7.el7ost.noarch Wed Nov  7 22:04:50 2018
openstack-neutron-linuxbridge-12.0.4-2.el7ost.noarch        Wed Nov  7 22:04:33 2018
openstack-neutron-metering-agent-12.0.4-2.el7ost.noarch     Wed Nov  7 22:04:32 2018
openstack-neutron-ml2-12.0.4-2.el7ost.noarch                Wed Nov  7 22:04:21 2018
openstack-neutron-openvswitch-12.0.4-2.el7ost.noarch        Wed Nov  7 22:04:33 2018
openstack-neutron-sriov-nic-agent-12.0.4-2.el7ost.noarch    Wed Nov  7 22:04:32 2018
puppet-neutron-12.4.1-1.3aa3109git.el7ost.noarch            Wed Nov  7 22:07:52 2018
python2-neutronclient-6.7.0-1.el7ost.noarch                 Wed Nov  7 22:03:25 2018
python2-neutron-lib-1.13.0-1.el7ost.noarch                  Wed Nov  7 22:03:28 2018
python-neutron-12.0.4-2.el7ost.noarch                       Wed Nov  7 22:04:19 2018
python-neutron-lbaas-12.0.1-0.20180825022802.ed663f3.el7ost.noarch Wed Nov  7 22:04:22 2018
[akaris@collab-shell compute-1]$

Comment 3 Andreas Karis 2018-11-30 21:21:50 UTC
Test 1 - reproduce issue:
~~~
nova stop vsbg_1_PL-1
~~~

~~~
[root@compute-1 ~]# date
Fri Nov 30 21:02:23 CET 2018
[root@compute-1 ~]# ovs-vsctl show | grep tbr
    Bridge "tbr-c7d83e51-b"
        Port "tbr-c7d83e51-b"
            Interface "tbr-c7d83e51-b"
    Bridge "tbr-f4554113-b"
        Port "tbr-f4554113-b"
            Interface "tbr-f4554113-b"
[root@compute-1 ~]# ovs-vsctl show | grep tbr
    Bridge "tbr-c7d83e51-b"
        Port "tbr-c7d83e51-b"
            Interface "tbr-c7d83e51-b"
    Bridge "tbr-f4554113-b"
        Port "tbr-f4554113-b"
            Interface "tbr-f4554113-b"
[root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:51.679Z|02319|bridge|INFO|bridge tbr-f4554113-b: added interface vhu44d242e1-30 on port 1
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:53.028Z|02320|bridge|INFO|bridge tbr-f4554113-b: added interface tpt-44d242e1-30 on port 2
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:54.324Z|02322|bridge|INFO|bridge tbr-f4554113-b: added interface spt-7df8776f-4e on port 3
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:54.342Z|02323|bridge|INFO|bridge tbr-f4554113-b: added interface spt-e7476312-fc on port 4
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:54.350Z|02324|bridge|INFO|bridge tbr-f4554113-b: added interface spt-605ebd34-0e on port 5
~~~

~~~
nova stop vsbg_1_PL-1
~~~

This deletion here is not normal:
~~~
[root@compute-1 ~]# date
Fri Nov 30 21:02:54 CET 2018
[root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:18.986Z|02529|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-7df8776f-4e on port 3
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.021Z|02531|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-e7476312-fc on port 4
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.113Z|02533|bridge|INFO|bridge tbr-f4554113-b: deleted interface tpt-44d242e1-30 on port 2
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02534|bridge|INFO|bridge tbr-f4554113-b: deleted interface tbr-f4554113-b on port 65534
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02535|bridge|INFO|bridge tbr-f4554113-b: deleted interface vhu44d242e1-30 on port 6
[root@compute-1 ~]# date
Fri Nov 30 21:03:31 CET 2018
[root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:18.986Z|02529|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-7df8776f-4e on port 3
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.021Z|02531|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-e7476312-fc on port 4
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.113Z|02533|bridge|INFO|bridge tbr-f4554113-b: deleted interface tpt-44d242e1-30 on port 2
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02534|bridge|INFO|bridge tbr-f4554113-b: deleted interface tbr-f4554113-b on port 65534
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02535|bridge|INFO|bridge tbr-f4554113-b: deleted interface vhu44d242e1-30 on port 6
[root@compute-1 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 4     instance-00000035              running
 9     instance-00000038              paused

[root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:18.986Z|02529|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-7df8776f-4e on port 3
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.021Z|02531|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-e7476312-fc on port 4
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.113Z|02533|bridge|INFO|bridge tbr-f4554113-b: deleted interface tpt-44d242e1-30 on port 2
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02534|bridge|INFO|bridge tbr-f4554113-b: deleted interface tbr-f4554113-b on port 65534
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02535|bridge|INFO|bridge tbr-f4554113-b: deleted interface vhu44d242e1-30 on port 6
[root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:18.986Z|02529|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-7df8776f-4e on port 3
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.021Z|02531|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-e7476312-fc on port 4
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.113Z|02533|bridge|INFO|bridge tbr-f4554113-b: deleted interface tpt-44d242e1-30 on port 2
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02534|bridge|INFO|bridge tbr-f4554113-b: deleted interface tbr-f4554113-b on port 65534
/var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02535|bridge|INFO|bridge tbr-f4554113-b: deleted interface vhu44d242e1-30 on port 


[root@compute-1 ~]# grep ERROR /var/log/containers/neutron/openvswitch-agent.log | tail -n10
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command     self.gen.next()
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command     self._nested_txn = None
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command     self.result = self.commit()
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 62, in commit
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command     raise result.ex
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command
2018-11-30 21:03:20.467 61270 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Failed to store metadata for trunk f4554113-b695-4e7d-b550-6927765c6679: Cannot find Bridge with name=tbr-f4554113-b: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
[root@compute-1 ~]# date
Fri Nov 30 21:06:21 CET 2018
[root@compute-1 ~]#
~~~

Comment 4 Andreas Karis 2018-11-30 21:24:32 UTC
Test 2 - do not reproduce issue, expected behavior:

[root@controller-0 EnvironmentFiles]# date
Fri Nov 30 21:14:58 CET 2018
(overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova stop vsbg_1_PL-1
Request to stop server vsbg_1_PL-1 has been accepted.
(overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova list | grep PL-1
| 1f23fafb-e370-4e53-9bf1-8d5b1d8f7387 | vsbg_1_PL-1  | ACTIVE  | powering-off | Running     | vsbg_conf_sp=10.2.0.26; vsbg_sig_sp=10.1.0.12; vsbg_om_sp=192.168.158.92; vsbg_1_vsbg_int_sp=192.168.0.29 |
(overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova list | grep PL-1
| 1f23fafb-e370-4e53-9bf1-8d5b1d8f7387 | vsbg_1_PL-1  | SHUTOFF | -           | Shutdown    | vsbg_conf_sp=10.2.0.26; vsbg_sig_sp=10.1.0.12; vsbg_om_sp=192.168.158.92; vsbg_1_vsbg_int_sp=192.168.0.29 |
(overcloud-Queens) [root@controller-0 EnvironmentFiles]# date
Fri Nov 30 21:15:19 CET 2018
(overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova start vsbg_1_PL-1
Request to start server vsbg_1_PL-1 has been accepted.
(overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova list | grep PL-1
| 1f23fafb-e370-4e53-9bf1-8d5b1d8f7387 | vsbg_1_PL-1  | SHUTOFF | powering-on | Shutdown    | vsbg_conf_sp=10.2.0.26; vsbg_sig_sp=10.1.0.12; vsbg_om_sp=192.168.158.92; vsbg_1_vsbg_int_sp=192.168.0.29 |
(overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova list | grep PL-1
| 1f23fafb-e370-4e53-9bf1-8d5b1d8f7387 | vsbg_1_PL-1  | ACTIVE  | -           | Running     | vsbg_conf_sp=10.2.0.26; vsbg_sig_sp=10.1.0.12; vsbg_om_sp=192.168.158.92; vsbg_1_vsbg_int_sp=192.168.0.29 |
(overcloud-Queens) [root@controller-0 EnvironmentFiles]# date
Fri Nov 30 21:16:29 CET 2018
(overcloud-Queens) [root@controller-0 EnvironmentFiles]#

Comment 7 Andreas Karis 2018-11-30 22:40:43 UTC
A correction to this issue description. 

The customer made all VMs up and running and with trunk port attached, then stopped all VMs, checked tbrs: all tbr bridges were there;
he waited for 15 mins, checked again, all tbr bridges were still there
then he started all VMs with nova and  checked again:  1 tbr disappeared, it caused 1 PL VM to be stuck in "powering-on" status.

Based on the above, we think that tbr disappearing only happens when nova starts the VMs.

Comment 16 Nate Johnston 2018-12-06 19:00:40 UTC
Determined the right codepath to mitigate this issue.  Filed upstream bug #1807239  for working this issue upstream.

Comment 17 Nate Johnston 2018-12-06 19:01:39 UTC
Filed change 623275 against neutron upstream master.

Comment 36 Nate Johnston 2019-01-11 17:00:04 UTC
Can you provide the steps that are taken to create the trunk bridge and then create VMs on it?  I want to make sure that when I verify this I am testing the scenario te customer is facing as closely as possible.

Comment 49 Andreas Karis 2019-02-15 16:21:20 UTC
Hi,

What's the status wrt this bug?

Thanks,

Andreas

Comment 57 Nate Johnston 2019-03-11 14:08:44 UTC
At this point should this bug go into RELEASE_PENDING or CLOSED?  Apologies if I betray my ignorance of the post-hotfix part of the process.

Comment 58 Andreas Karis 2019-03-12 15:20:38 UTC
Hi,

I messed up the hotfix part, anyway. This is a containerized environment, and hotfixes should be delivered with a special method. However, the official release is soon and the customer agreed to wait for the official release. So please feel free to go forward as if the hotfix had not been issued and as if this was a normal ticket.

- Andreas

Comment 60 errata-xmlrpc 2019-03-14 13:51:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0555


Note You need to log in before you can comment on or make changes to this bug.