Bug 1869295

Summary: Restarting ovn-controller should not interrupt connectivity
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Casey Callendrello <cdc>
Component: OVNAssignee: Numan Siddique <nusiddiq>
Status: NEW --- QA Contact: Jianlin Shi <jishi>
Severity: medium Docs Contact:
Priority: medium    
Version: RHEL 8.0CC: ctrautma, dcbw, mmichels, nusiddiq
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ovn-northd backup
none
ovn-northd backup none

Description Casey Callendrello 2020-08-17 12:41:21 UTC
Description of problem: Restarting ovn-controller when there are an appreciable number of flows causes (new) connections to be interrupted.


Version-Release number of selected component (if applicable): ovn2.13-2.13.0-39.el7fdp.x86_64


How reproducible: Very reproducible; scale dependent


Steps to Reproduce:
1. Create a lot of flows. I did this by creating 100 pods, then creating 20 services that all referenced those pods. If you like, I can share a copy-and-paste reproducer.

2. In a new pod a simple loop. Something like
    while true; do curl http://>service ip<; sleep 0.5; done


3. Restart ovn-controller on the node hosting the curl. e.g.
    oc -n openshift-ovn-kubernetes delete pod ovnkube-node-dw9km

Actual results:

When ovn-controller restarts, new connections are interrupted for, in my test, about 5 seconds. And this is a small cluster.


Expected results: New connections (almost) always succeed.


Additional info:

Users at higher scale are punished much more by this, and can experience outages in the 10s-of-seconds. There is a thread about it on the ovs-devel / ovn-devel mailing lists.

Comment 1 Numan Siddique 2020-08-17 13:47:10 UTC
Hi Casey,

Can you please attach the OVN north db to the BZ ?

I think that would be helpful in reproducing the issue and while testing the fix.

Thanks

Comment 2 Casey Callendrello 2020-08-17 15:29:48 UTC
Created attachment 1711630 [details]
ovn-northd backup

Comment 3 Casey Callendrello 2020-08-17 15:40:24 UTC
Created attachment 1711634 [details]
ovn-northd backup

Comment 4 Dan Williams 2020-08-24 20:46:37 UTC
Wouldn't the recent discussions here https://mail.openvswitch.org/pipermail/ovs-discuss/2020-August/050520.html be relevant?

Also maybe Han's recent patches for incremental flow installation are also relevant? http://patchwork.ozlabs.org/project/openvswitch/list/?series=197009