Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1869295

Summary: Restarting ovn-controller should not interrupt connectivity
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Casey Callendrello <cdc>
Component: OVNAssignee: Numan Siddique <nusiddiq>
Status: CLOSED NOTABUG QA Contact: Jianlin Shi <jishi>
Severity: medium Docs Contact:
Priority: medium    
Version: RHEL 8.0CC: ctrautma, dcbw, mmichels, nusiddiq
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-10-05 15:22:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ovn-northd backup
none
ovn-northd backup none

Description Casey Callendrello 2020-08-17 12:41:21 UTC
Description of problem: Restarting ovn-controller when there are an appreciable number of flows causes (new) connections to be interrupted.


Version-Release number of selected component (if applicable): ovn2.13-2.13.0-39.el7fdp.x86_64


How reproducible: Very reproducible; scale dependent


Steps to Reproduce:
1. Create a lot of flows. I did this by creating 100 pods, then creating 20 services that all referenced those pods. If you like, I can share a copy-and-paste reproducer.

2. In a new pod a simple loop. Something like
    while true; do curl http://>service ip<; sleep 0.5; done


3. Restart ovn-controller on the node hosting the curl. e.g.
    oc -n openshift-ovn-kubernetes delete pod ovnkube-node-dw9km

Actual results:

When ovn-controller restarts, new connections are interrupted for, in my test, about 5 seconds. And this is a small cluster.


Expected results: New connections (almost) always succeed.


Additional info:

Users at higher scale are punished much more by this, and can experience outages in the 10s-of-seconds. There is a thread about it on the ovs-devel / ovn-devel mailing lists.

Comment 1 Numan Siddique 2020-08-17 13:47:10 UTC
Hi Casey,

Can you please attach the OVN north db to the BZ ?

I think that would be helpful in reproducing the issue and while testing the fix.

Thanks

Comment 2 Casey Callendrello 2020-08-17 15:29:48 UTC
Created attachment 1711630 [details]
ovn-northd backup

Comment 3 Casey Callendrello 2020-08-17 15:40:24 UTC
Created attachment 1711634 [details]
ovn-northd backup

Comment 4 Dan Williams 2020-08-24 20:46:37 UTC
Wouldn't the recent discussions here https://mail.openvswitch.org/pipermail/ovs-discuss/2020-August/050520.html be relevant?

Also maybe Han's recent patches for incremental flow installation are also relevant? http://patchwork.ozlabs.org/project/openvswitch/list/?series=197009

Comment 5 Mark Michelson 2023-10-05 15:22:14 UTC
I'm closing this since we have made great efforts since this issue was opened to ensure that delays on the dataplane are reduced as much as possible. In particular, Han Zhou has done some great things to reduce dataplane downtime, including

* Using bundles for openflow operations.
* Adding a delay before clearing flows on ovn-controller startup to ensure that the database state stabilizes.

As such, we have minimal downtime from upgrades and restarts compared to when this issue was opened.