Bug 2249079 - Overloaded OVN DB can break port binding process, leave port in inconsistent state and break other operations
Summary: Overloaded OVN DB can break port binding process, leave port in inconsistent ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 16.1 (Train)
Hardware: All
OS: All
medium
high
Target Milestone: z4
: 17.1
Assignee: OSP Team
QA Contact: Bharath M V
URL:
Whiteboard:
Depends On:
Blocks: 2252218
TreeView+ depends on / blocked
 
Reported: 2023-11-10 15:13 UTC by Alex Stupnikov
Modified: 2024-11-21 09:39 UTC (History)
11 users (show)

Fixed In Version: openstack-neutron-18.6.1-17.1.20240822200817.85ff760.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2252218 (view as bug list)
Environment:
Last Closed: 2024-11-21 09:39:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 825428 0 None MERGED [ovn]Refusing to bind port to dead agent 2023-11-15 18:24:26 UTC
OpenStack gerrit 853479 0 None MERGED [OVN] Try to bind ports only to the ovn-controller agents 2023-11-15 18:24:25 UTC
OpenStack gerrit 901025 0 None NEW [ovn]Refusing to bind port to dead agent 2023-11-15 18:24:24 UTC
OpenStack gerrit 901026 0 None NEW [OVN] Try to bind ports only to the ovn-controller agents 2023-11-15 18:24:24 UTC
Red Hat Issue Tracker OSP-30358 0 None None None 2023-11-10 15:13:55 UTC
Red Hat Product Errata RHBA-2024:9974 0 None None None 2024-11-21 09:39:07 UTC

Description Alex Stupnikov 2023-11-10 15:13:11 UTC
Description of problem:

Nova resize operation failed because of "nova.exception.InternalError: Unexpected vif_type=binding_failed". Unexpected vif_type was obtained from Neutron Server: "'binding:vif_type': 'binding_failed'" was returned.

From Neutron Server logs it looks like Server tried to bind port when OVN wasn't responsive and the following log message was logged 10 times in a row:

2023-11-07 19:40:39.481 39 DEBUG networking_ovn.ml2.mech_driver [req-d40a1acc-8016-4606-8b53-702571b161dd ] Refusing to bind port PORT_ID due to no OVN chassis for host: HYPERVISOR bind_port /usr/lib/python3.6/site-packages/networking_ovn/ml2/mech_driver.py:740

From OVN mech_driver code it looks like the following sequence of calls caused this situation:
https://github.com/openstack/networking-ovn/blob/stable/train/networking_ovn/ml2/mech_driver.py#L852-L860


I am not 100% sure, but it looks like fresh branches are not affected by similar problems because there is caching mechanism for agents (AgentCache class) and information is no longer obtained from OVN each and every time. Example: https://github.com/openstack/neutron/blob/stable/yoga/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L959

I understand that it is unrealistic to expect such a massive backport for RHOSP 16.1, but 17.1 looks like a good target and 16.2 would also benefit from it.


Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.3 GA (Train)


How reproducible:
Initiate instance resize operation when OVN DB is overloaded and can't process some requests from Neutron Server in time.

Actual results:
Resize operation fails because of "nova.exception.InternalError: Unexpected vif_type=binding_failed"

Expected results:
Resize operation is successful

Additional info: information about collected data and log extracts will be provided privately

Comment 24 errata-xmlrpc 2024-11-21 09:39:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHOSP 17.1.4 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:9974


Note You need to log in before you can comment on or make changes to this bug.