Description of problem: OSP 13, neutron ml2-ovs, DPDK - stop/wait/start of nova instance sometimes deletes tbr and yields Error executing command: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b Deployment is OSP 13 with ML2/OVS and DPDK. Instance ports are neutron trunk ports with DPDK vhu in vhostuserclient mode. Customer runs into an issue which does happen seemingly randomly. He has 20 instances with that setup. He stops all of the instances and then verifies `ovs-vsctl show | grep tbr`. The normal behavior at this point is that the tbr is kept on OVS and is not cleaned up/deleted. However, for some instances in this test (about 2 our of 20), the tbr will be deleted. These instances will not come up the next time he runs nova start. The instance will show in virsh as paused and in ps aux as interruptible sleep. In fact, I verified that the instance is waiting on a connection to the vhu port that's attached to tbr. I manually created tbr and the vhu port and the instance came up in virsh and in nova. I then stopped and started the same instance several times. On one of these tries, I could reproduce the issue, however, the tbr was *not* deleted before I started the instance. Instead, from OVS logs, we see that the tbr and vhu were deleted right when the instance came up, but were not created, hence producing the same issue and ERROR message. We also tried to deliberately stop the instance, then execute `ovs-vsctl del-br tbr<...>` and nova start the instance. However, this works like a charm. I suspect some race condition, perhaps with neutron's cleanup service, but currently have no proof of this. Error message: ~~~ [root@compute-1 ~]# grep ERROR /var/log/containers/neutron/openvswitch-agent.log | tail -n10 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command self.gen.next() 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command self._nested_txn = None 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__ 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command self.result = self.commit() 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 62, in commit 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command raise result.ex 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-f4554113-b 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command 2018-11-30 21:03:20.467 61270 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Failed to store metadata for trunk f4554113-b695-4e7d-b550-6927765c6679: Cannot find Bridge with name=tbr-f4554113-b: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b ~~~ And (when grepping for tbr-f4554113-b): ~~~ 2018-11-30 21:03:20.465 61270 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "tbr-f4554113-b" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212 2018-11-30 21:03:20.466 61270 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): ListPortsCommand(bridge=tbr-f4554113-b) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-f4554113-b 2018-11-30 21:03:20.467 61270 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Failed to store metadata for trunk f4554113-b695-4e7d-b550-6927765c6679: Cannot find Bridge with name=tbr-f4554113-b: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b 2018-11-30 21:03:20.469 61270 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): BridgeExistsCommand(name=tbr-f4554113-b) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2018-11-30 21:03:20.470 61270 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "tbr-f4554113-b" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228 ~~~ Version-Release number of selected component (if applicable): Additional info: Adding more data in private comments
[akaris@collab-shell compute-1]$ grep neutron installed-rpms openstack-neutron-12.0.4-2.el7ost.noarch Wed Nov 7 22:04:28 2018 openstack-neutron-common-12.0.4-2.el7ost.noarch Wed Nov 7 22:04:20 2018 openstack-neutron-l2gw-agent-12.0.2-0.20180412115803.a9f8009.el7ost.noarch Wed Nov 7 22:04:31 2018 openstack-neutron-lbaas-12.0.1-0.20180825022802.ed663f3.el7ost.noarch Wed Nov 7 22:04:29 2018 openstack-neutron-lbaas-ui-4.0.1-0.20180914200335.dc7d9e7.el7ost.noarch Wed Nov 7 22:04:50 2018 openstack-neutron-linuxbridge-12.0.4-2.el7ost.noarch Wed Nov 7 22:04:33 2018 openstack-neutron-metering-agent-12.0.4-2.el7ost.noarch Wed Nov 7 22:04:32 2018 openstack-neutron-ml2-12.0.4-2.el7ost.noarch Wed Nov 7 22:04:21 2018 openstack-neutron-openvswitch-12.0.4-2.el7ost.noarch Wed Nov 7 22:04:33 2018 openstack-neutron-sriov-nic-agent-12.0.4-2.el7ost.noarch Wed Nov 7 22:04:32 2018 puppet-neutron-12.4.1-1.3aa3109git.el7ost.noarch Wed Nov 7 22:07:52 2018 python2-neutronclient-6.7.0-1.el7ost.noarch Wed Nov 7 22:03:25 2018 python2-neutron-lib-1.13.0-1.el7ost.noarch Wed Nov 7 22:03:28 2018 python-neutron-12.0.4-2.el7ost.noarch Wed Nov 7 22:04:19 2018 python-neutron-lbaas-12.0.1-0.20180825022802.ed663f3.el7ost.noarch Wed Nov 7 22:04:22 2018 [akaris@collab-shell compute-1]$
Test 1 - reproduce issue: ~~~ nova stop vsbg_1_PL-1 ~~~ ~~~ [root@compute-1 ~]# date Fri Nov 30 21:02:23 CET 2018 [root@compute-1 ~]# ovs-vsctl show | grep tbr Bridge "tbr-c7d83e51-b" Port "tbr-c7d83e51-b" Interface "tbr-c7d83e51-b" Bridge "tbr-f4554113-b" Port "tbr-f4554113-b" Interface "tbr-f4554113-b" [root@compute-1 ~]# ovs-vsctl show | grep tbr Bridge "tbr-c7d83e51-b" Port "tbr-c7d83e51-b" Interface "tbr-c7d83e51-b" Bridge "tbr-f4554113-b" Port "tbr-f4554113-b" Interface "tbr-f4554113-b" [root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:51.679Z|02319|bridge|INFO|bridge tbr-f4554113-b: added interface vhu44d242e1-30 on port 1 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:53.028Z|02320|bridge|INFO|bridge tbr-f4554113-b: added interface tpt-44d242e1-30 on port 2 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:54.324Z|02322|bridge|INFO|bridge tbr-f4554113-b: added interface spt-7df8776f-4e on port 3 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:54.342Z|02323|bridge|INFO|bridge tbr-f4554113-b: added interface spt-e7476312-fc on port 4 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:01:54.350Z|02324|bridge|INFO|bridge tbr-f4554113-b: added interface spt-605ebd34-0e on port 5 ~~~ ~~~ nova stop vsbg_1_PL-1 ~~~ This deletion here is not normal: ~~~ [root@compute-1 ~]# date Fri Nov 30 21:02:54 CET 2018 [root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:18.986Z|02529|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-7df8776f-4e on port 3 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.021Z|02531|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-e7476312-fc on port 4 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.113Z|02533|bridge|INFO|bridge tbr-f4554113-b: deleted interface tpt-44d242e1-30 on port 2 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02534|bridge|INFO|bridge tbr-f4554113-b: deleted interface tbr-f4554113-b on port 65534 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02535|bridge|INFO|bridge tbr-f4554113-b: deleted interface vhu44d242e1-30 on port 6 [root@compute-1 ~]# date Fri Nov 30 21:03:31 CET 2018 [root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:18.986Z|02529|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-7df8776f-4e on port 3 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.021Z|02531|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-e7476312-fc on port 4 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.113Z|02533|bridge|INFO|bridge tbr-f4554113-b: deleted interface tpt-44d242e1-30 on port 2 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02534|bridge|INFO|bridge tbr-f4554113-b: deleted interface tbr-f4554113-b on port 65534 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02535|bridge|INFO|bridge tbr-f4554113-b: deleted interface vhu44d242e1-30 on port 6 [root@compute-1 ~]# virsh list Id Name State ---------------------------------------------------- 4 instance-00000035 running 9 instance-00000038 paused [root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:18.986Z|02529|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-7df8776f-4e on port 3 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.021Z|02531|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-e7476312-fc on port 4 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.113Z|02533|bridge|INFO|bridge tbr-f4554113-b: deleted interface tpt-44d242e1-30 on port 2 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02534|bridge|INFO|bridge tbr-f4554113-b: deleted interface tbr-f4554113-b on port 65534 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02535|bridge|INFO|bridge tbr-f4554113-b: deleted interface vhu44d242e1-30 on port 6 [root@compute-1 ~]# grep "tbr-f4554113-b" /var/log/openvswitch/ovs* | tail -n 5 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:18.986Z|02529|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-7df8776f-4e on port 3 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.021Z|02531|bridge|INFO|bridge tbr-f4554113-b: deleted interface spt-e7476312-fc on port 4 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.113Z|02533|bridge|INFO|bridge tbr-f4554113-b: deleted interface tpt-44d242e1-30 on port 2 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02534|bridge|INFO|bridge tbr-f4554113-b: deleted interface tbr-f4554113-b on port 65534 /var/log/openvswitch/ovs-vswitchd.log:2018-11-30T20:03:19.114Z|02535|bridge|INFO|bridge tbr-f4554113-b: deleted interface vhu44d242e1-30 on port [root@compute-1 ~]# grep ERROR /var/log/containers/neutron/openvswitch-agent.log | tail -n10 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command self.gen.next() 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command self._nested_txn = None 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__ 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command self.result = self.commit() 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 62, in commit 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command raise result.ex 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-f4554113-b 2018-11-30 21:03:20.466 61270 ERROR ovsdbapp.backend.ovs_idl.command 2018-11-30 21:03:20.467 61270 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Failed to store metadata for trunk f4554113-b695-4e7d-b550-6927765c6679: Cannot find Bridge with name=tbr-f4554113-b: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b [root@compute-1 ~]# date Fri Nov 30 21:06:21 CET 2018 [root@compute-1 ~]# ~~~
Test 2 - do not reproduce issue, expected behavior: [root@controller-0 EnvironmentFiles]# date Fri Nov 30 21:14:58 CET 2018 (overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova stop vsbg_1_PL-1 Request to stop server vsbg_1_PL-1 has been accepted. (overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova list | grep PL-1 | 1f23fafb-e370-4e53-9bf1-8d5b1d8f7387 | vsbg_1_PL-1 | ACTIVE | powering-off | Running | vsbg_conf_sp=10.2.0.26; vsbg_sig_sp=10.1.0.12; vsbg_om_sp=192.168.158.92; vsbg_1_vsbg_int_sp=192.168.0.29 | (overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova list | grep PL-1 | 1f23fafb-e370-4e53-9bf1-8d5b1d8f7387 | vsbg_1_PL-1 | SHUTOFF | - | Shutdown | vsbg_conf_sp=10.2.0.26; vsbg_sig_sp=10.1.0.12; vsbg_om_sp=192.168.158.92; vsbg_1_vsbg_int_sp=192.168.0.29 | (overcloud-Queens) [root@controller-0 EnvironmentFiles]# date Fri Nov 30 21:15:19 CET 2018 (overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova start vsbg_1_PL-1 Request to start server vsbg_1_PL-1 has been accepted. (overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova list | grep PL-1 | 1f23fafb-e370-4e53-9bf1-8d5b1d8f7387 | vsbg_1_PL-1 | SHUTOFF | powering-on | Shutdown | vsbg_conf_sp=10.2.0.26; vsbg_sig_sp=10.1.0.12; vsbg_om_sp=192.168.158.92; vsbg_1_vsbg_int_sp=192.168.0.29 | (overcloud-Queens) [root@controller-0 EnvironmentFiles]# nova list | grep PL-1 | 1f23fafb-e370-4e53-9bf1-8d5b1d8f7387 | vsbg_1_PL-1 | ACTIVE | - | Running | vsbg_conf_sp=10.2.0.26; vsbg_sig_sp=10.1.0.12; vsbg_om_sp=192.168.158.92; vsbg_1_vsbg_int_sp=192.168.0.29 | (overcloud-Queens) [root@controller-0 EnvironmentFiles]# date Fri Nov 30 21:16:29 CET 2018 (overcloud-Queens) [root@controller-0 EnvironmentFiles]#
A correction to this issue description. The customer made all VMs up and running and with trunk port attached, then stopped all VMs, checked tbrs: all tbr bridges were there; he waited for 15 mins, checked again, all tbr bridges were still there then he started all VMs with nova and checked again: 1 tbr disappeared, it caused 1 PL VM to be stuck in "powering-on" status. Based on the above, we think that tbr disappearing only happens when nova starts the VMs.
Determined the right codepath to mitigate this issue. Filed upstream bug #1807239 for working this issue upstream.
Filed change 623275 against neutron upstream master.
Can you provide the steps that are taken to create the trunk bridge and then create VMs on it? I want to make sure that when I verify this I am testing the scenario te customer is facing as closely as possible.
Hi, What's the status wrt this bug? Thanks, Andreas
At this point should this bug go into RELEASE_PENDING or CLOSED? Apologies if I betray my ignorance of the post-hotfix part of the process.
Hi, I messed up the hotfix part, anyway. This is a containerized environment, and hotfixes should be delivered with a special method. However, the official release is soon and the customer agreed to wait for the official release. So please feel free to go forward as if the hotfix had not been issued and as if this was a normal ticket. - Andreas
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0555