Bug 2229761

Summary: [OSP16.2 -> 17.1] Packet loss during controller upgrade for OVN
Product: Red Hat OpenStack Reporter: Khomesh Thakre <kthakre>
Component: python-networking-ovnAssignee: Lukas Bezdicka <lbezdick>
Status: ON_DEV --- QA Contact: Khomesh Thakre <kthakre>
Severity: medium Docs Contact:
Priority: urgent    
Version: 17.1 (Wallaby)CC: apevec, ekuris, jelynch, jpretori, lbezdick, lhh, lsvaty, majopela, mburns, pgrist, scohen
Target Milestone: z1Keywords: Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
There is currently a known issue with a race condition in the deployment steps for `ovn_controller` and `ovn_dbs`, which causes `ovn_dbs` to be upgraded before `ovn_controller`. If `ovn_controller` is not upgraded before `ovn_dbs`, an error before the restart to the new version causes packet loss. There is an estimated one-minute network outage if the race condition occurs during the Open Virtual Network (OVN) upgrade. A fix is expected in a later RHOSP release.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Khomesh Thakre 2023-08-07 15:05:14 UTC
Description of problem:

During controller upgrade when oven service starts at step 3, sometime ovn dbs starts before ovn controller causing packet loss. 

~~~
2023-08-03 16:30:38 | 2023-08-03 16:30:38.457 340702 INFO tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Completed Overcloud Major Upgrade Run.[00m
2023-08-03 16:30:38 | 2023-08-03 16:30:38.457 340702 INFO osc_lib.shell [-] END return value: None[00m
2023-08-03 16:30:38 | [Thu Aug  3 16:30:38 UTC 2023] Finished major upgrade for computehci-0,computehci-1,computehci-2,controller-0,controller-1,controller-2,database-0,database-1,database-2,messaging-0,messaging-1,messaging-2,networker-0,networker-1,undercloud hosts
2023-08-03 16:30:38 | 3120 packets transmitted, 3066 received, +15 errors, 1.73077% packet loss, time 3124473ms
2023-08-03 16:30:38 | rtt min/avg/max/mdev = 0.689/2.618/2077.599/41.846 ms, pipe 4
2023-08-03 16:30:38 | Ping loss higher than 1 % detected (2 %) 
~~~

Version-Release number of selected component (if applicable):
RHOSP 17 on rhel 8 (Puddle RHOS-17.1-RHEL-8-20230802.n.1)

How reproducible:
Random issue whenever ovn dbs starts before ovn controller.