Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2249079

Summary: Overloaded OVN DB can break port binding process, leave port in inconsistent state and break other operations
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-neutronAssignee: OSP Team <rhos-maint>
Status: CLOSED ERRATA QA Contact: Bharath M V <bmv>
Severity: high Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: apevec, bmv, chrisw, ihrachys, jlibosva, lhh, majopela, mariel, scohen, skaplons, yocha
Target Milestone: z4Keywords: Triaged
Target Release: 17.1   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: openstack-neutron-18.6.1-17.1.20240822200817.85ff760.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2252218 (view as bug list) Environment:
Last Closed: 2024-11-21 09:39:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2252218    

Description Alex Stupnikov 2023-11-10 15:13:11 UTC
Description of problem:

Nova resize operation failed because of "nova.exception.InternalError: Unexpected vif_type=binding_failed". Unexpected vif_type was obtained from Neutron Server: "'binding:vif_type': 'binding_failed'" was returned.

From Neutron Server logs it looks like Server tried to bind port when OVN wasn't responsive and the following log message was logged 10 times in a row:

2023-11-07 19:40:39.481 39 DEBUG networking_ovn.ml2.mech_driver [req-d40a1acc-8016-4606-8b53-702571b161dd ] Refusing to bind port PORT_ID due to no OVN chassis for host: HYPERVISOR bind_port /usr/lib/python3.6/site-packages/networking_ovn/ml2/mech_driver.py:740

From OVN mech_driver code it looks like the following sequence of calls caused this situation:
https://github.com/openstack/networking-ovn/blob/stable/train/networking_ovn/ml2/mech_driver.py#L852-L860


I am not 100% sure, but it looks like fresh branches are not affected by similar problems because there is caching mechanism for agents (AgentCache class) and information is no longer obtained from OVN each and every time. Example: https://github.com/openstack/neutron/blob/stable/yoga/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L959

I understand that it is unrealistic to expect such a massive backport for RHOSP 16.1, but 17.1 looks like a good target and 16.2 would also benefit from it.


Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.3 GA (Train)


How reproducible:
Initiate instance resize operation when OVN DB is overloaded and can't process some requests from Neutron Server in time.

Actual results:
Resize operation fails because of "nova.exception.InternalError: Unexpected vif_type=binding_failed"

Expected results:
Resize operation is successful

Additional info: information about collected data and log extracts will be provided privately

Comment 24 errata-xmlrpc 2024-11-21 09:39:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHOSP 17.1.4 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:9974