Bug 2221927 - For some scenarios and deployments OVN DB sync script causes outages when executed in "repair" mode
Summary: For some scenarios and deployments OVN DB sync script causes outages when exe...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.2 (Train)
Hardware: All
OS: All
medium
high
Target Milestone: z6
: ---
Assignee: Rodolfo Alonso
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-11 10:09 UTC by Alex Stupnikov
Modified: 2023-07-21 09:57 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-21 09:57:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-26497 0 None None None 2023-07-11 10:13:07 UTC

Description Alex Stupnikov 2023-07-11 10:09:13 UTC
Description of problem:

While investigating OVN and Neutron DB in bug #2216702, the following problems were isolated (https://bugzilla.redhat.com/show_bug.cgi?id=2216702#c21):

* Ports found in Neutron but not in OVN: these ports can be deleted and created again (if possible). That is an easy and manual way to fix this problem.
* Ports found in OVN but not in Neutron: these ports need to be deleted from the OVN DB (one of then is causing the issue reported in this BZ). To manually fix this issue, the customer can destroy the registers:
    $ ovn-nbctl destroy Logical_Switch_Port <id>
* Router ports that "needs to be updated for networks changed": it is needed to execute the DB sync tool in "repair" mode.
* Routers with static routes in OVN but not in Neutron: it is possible to delete them in the OVN database, but it is highly recommended to execute the DB sync tool in "repair" mode.

Customer addressed first two issues and original bug was closed because root cause was addressed. At the same time, customer was worried about last two points because OVN DB sync script caused serious outages for their production deployment when executed in "repair" mode.

At this point we understand that there are no other options, but running OVN DB sync script in "repair" mode (maybe during some maintenance window). At the same time, some improvements are needed to reduce downtimes.


Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.2.4 (Train)

How reproducible:
Use OVN DB sync script in "repair" mode to address DB disparities

Actual results:
Severe outage is possible, extra steps may be required to restore tenant networking

Expected results:
Short outage or no outage at all

Comment 2 Rodolfo Alonso 2023-07-21 09:57:32 UTC
Hello Alex:

After debating this with the Networking team, the output is that the DB sync tool is a recovery tool that should be executed only in case of DB discrepancies. The execution of this tool should be done with minimal or no API activity. It is not recommended that the Neutron API creates/destroy resources during the execution of the DB sync tool. There is an article [1] explaining how to block the Neutron API ports (using iptables) to prevent any interaction with the Neutron DB.

The recommendation is to execute the DB sync tool **only during a maintenance window** and never during normal operation. Also it is recommendable to follow the article [1] procedure to block/unblock the Neutron API ports.

Regards.

[1]https://access.redhat.com/solutions/6775251


Note You need to log in before you can comment on or make changes to this bug.