Bug 1806623

Summary: [OVN][openvswitch 2.9->2.11 update, 13z8->13z11] Multiple port binding errors after overcloud update
Product: Red Hat OpenStack Reporter: Roman Safronov <rsafrono>
Component: python-networking-ovnAssignee: Jakub Libosvar <jlibosva>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: apevec, jlibosva, lhh, majopela, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-11 14:14:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Safronov 2020-02-24 16:20:08 UTC
Description of problem:

After overcloud update (but before overcloud reboot) there are many failed tempest tests complaining that some valid neutron API call returned 'Server error'.

Link to the job:
http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/2/


There are multiple errors like  "Logical port XXXX is not bound to a chassis: MechanismDriverError"   in neutron server log, see below


2020-02-23 20:02:19.649 36 ERROR neutron.plugins.ml2.managers 
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event [req-cbcf2731-9877-4d3c-bd36-5d612bd019ac - - - - -] Unexpected exception in notify_loop: MechanismDriverError
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event Traceback (most recent call last):
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/ovsdbapp/event.py", line 117, in notify_loop
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     match.run(event, row, updates)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py", line 211, in run
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     self.driver.set_port_status_up(row.name)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/networking_ovn/ml2/mech_driver.py", line 775, in set_port_status_up
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     const.PORT_STATUS_ACTIVE)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1855, in update_port_status
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     context, {port_id: status}, host)[port_id]
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 634, in inner
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     return f(self, context, *args, **kwargs)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 161, in wrapped
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     return method(*args, **kwargs)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 91, in wrapped
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     setattr(e, '_RETRY_EXCEEDED', True)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     self.force_reraise()
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     six.reraise(self.type_, self.value, self.tb)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 87, in wrapped
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     return f(*args, **kwargs)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 147, in wrapper
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     ectxt.value = e.inner_exc
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     self.force_reraise()
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     six.reraise(self.type_, self.value, self.tb)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 135, in wrapper
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     return f(*args, **kwargs)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 126, in wrapped
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     LOG.debug("Retry wrapper got retriable exception: %s", e)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     self.force_reraise()
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     six.reraise(self.type_, self.value, self.tb)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/db/api.py", line 122, in wrapped
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     return f(*dup_args, **dup_kwargs)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1870, in update_port_statuses
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     context, port_dbs_by_id[port_id], status, host)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1882, in _safe_update_individual_port_db_status
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     ectx.reraise = bool(db.get_port(context, port_id))
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     self.force_reraise()
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     six.reraise(self.type_, self.value, self.tb)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1878, in _safe_update_individual_port_db_status
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     context, port, status, host)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1946, in _update_individual_port_db_status
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     self.mechanism_manager.update_port_postcommit(mech_context)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py", line 723, in update_port_postcommit
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     continue_on_failure=True)
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py", line 473, in _call_on_drivers
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event     errors=errors
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event MechanismDriverError
2020-02-23 20:02:19.700 36 ERROR ovsdbapp.event 
2020-02-23 20:02:21.789 36 WARNING networking_ovn.ml2.mech_driver [req-cbcf2731-9877-4d3c-bd36-5d612bd019ac - - - - -] Logical port 82d73d17-f93d-4a0c-9619-234a162dc564 is not bound to a chassis: MechanismDriverError



Note: Minor update job consists of the following steps:
1. Overcloud deployment 
2. Tempest run   - there were no failed tests
3. Undercloud update
4. Overcloud update
5. 2nd tempest run  - there were 74 failed tests, customers may experience issues
6. Overcloud reboot
7. 3rd tempest run   - there were no failed tests

Version-Release number of selected component (if applicable):

Minor update from 13z8 (2019-08-27.2, last release with openvswitch 2.9) to  2020-02-20.4 (latest puddle with openvswitch 2.11)
python-networking-ovn-4.0.4-2.el7ost.noarch
openvswitch2.11-2.11.0-35.el7fdp.x86_64


How reproducible:
100%

Steps to Reproduce:
Run this job:
http://staging-jenkins2-qe-playground.usersys.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/

Note: this is staging jenkins, the job is identical to https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/
However I had to run the job with ir_tripleo_upgrade_floatingip_check: 'no' in order to avoid fail due to https://bugzilla.redhat.com/show_bug.cgi?id=1786468


Actual results:
74 tests failed on 2nd tempest run

Expected results:
No failed tests on 2nd tempest run

Additional info:

Logs can be found at http://file.tlv.redhat.com/~rsafrono/bug_sosreport/osp13_minor_update/

Comment 4 Jakub Libosvar 2020-08-31 16:07:56 UTC
The reason binding fails is because the VMs' security groups try to use port groups while the maintenance task hasn't successfully migrated from the address sets:

2020-02-23 20:27:18.459 [c2] 37 DEBUG networking_ovn.common.maintenance [req-fe795db1-bb36-4e90-bfef-b35a9cff0f7b - - - - -] Maintenance task: Fixing resource a940c8ac-f657-45b5-a849-42730a6b41be (type: security_groups) at delete check_for_inconsistencies /usr/lib/python2.7/site-packages/networking_ovn/common/maintenance.py:325
2020-02-23 20:27:18.461 [c2] 37 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DelAddrSetCommand(if_exists=True, name=as_ip4_a940c8ac_f657_45b5_a849_42730a6b41be) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2020-02-23 20:27:18.461 [c2] 37 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=1): DelAddrSetCommand(if_exists=True, name=as_ip6_a940c8ac_f657_45b5_a849_42730a6b41be) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2020-02-23 20:27:18.463 [c2] 37 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: The transaction failed because the IDL has been configured to require a database lock but didn't get it yet or has already lost it: RuntimeError: OVSDB Error: The transaction failed because the IDL has been configured to require a database lock but didn't get it yet or has already lost it
2020-02-23 20:27:18.463 [c2] 37 ERROR ovsdbapp.backend.ovs_idl.transaction [req-fe795db1-bb36-4e90-bfef-b35a9cff0f7b - - - - -] Traceback (most recent call last):
                                          File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 122, in run
                                            txn.results.put(txn.do_commit())
                                          File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 115, in do_commit
                                            raise RuntimeError(msg)
                                        RuntimeError: OVSDB Error: The transaction failed because the IDL has been configured to require a database lock but didn't get it yet or has already lost it
                                        : RuntimeError: OVSDB Error: The transaction failed because the IDL has been configured to require a database lock but didn't get it yet or has already lost it

I need to investigate why the DB lock has been lost.

Comment 5 Jakub Libosvar 2022-01-11 14:14:53 UTC
I have never found the reason why the lock got lost. Since 13 is ELS, I'm going to close it but feel free to re-open in case the problem is hit again.