Bug 1869295
| Summary: | Restarting ovn-controller should not interrupt connectivity | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Casey Callendrello <cdc> | ||||||
| Component: | OVN | Assignee: | Numan Siddique <nusiddiq> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Jianlin Shi <jishi> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | RHEL 8.0 | CC: | ctrautma, dcbw, mmichels, nusiddiq | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2023-10-05 15:22:14 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Hi Casey, Can you please attach the OVN north db to the BZ ? I think that would be helpful in reproducing the issue and while testing the fix. Thanks Created attachment 1711630 [details]
ovn-northd backup
Created attachment 1711634 [details]
ovn-northd backup
Wouldn't the recent discussions here https://mail.openvswitch.org/pipermail/ovs-discuss/2020-August/050520.html be relevant? Also maybe Han's recent patches for incremental flow installation are also relevant? http://patchwork.ozlabs.org/project/openvswitch/list/?series=197009 I'm closing this since we have made great efforts since this issue was opened to ensure that delays on the dataplane are reduced as much as possible. In particular, Han Zhou has done some great things to reduce dataplane downtime, including * Using bundles for openflow operations. * Adding a delay before clearing flows on ovn-controller startup to ensure that the database state stabilizes. As such, we have minimal downtime from upgrades and restarts compared to when this issue was opened. |
Description of problem: Restarting ovn-controller when there are an appreciable number of flows causes (new) connections to be interrupted. Version-Release number of selected component (if applicable): ovn2.13-2.13.0-39.el7fdp.x86_64 How reproducible: Very reproducible; scale dependent Steps to Reproduce: 1. Create a lot of flows. I did this by creating 100 pods, then creating 20 services that all referenced those pods. If you like, I can share a copy-and-paste reproducer. 2. In a new pod a simple loop. Something like while true; do curl http://>service ip<; sleep 0.5; done 3. Restart ovn-controller on the node hosting the curl. e.g. oc -n openshift-ovn-kubernetes delete pod ovnkube-node-dw9km Actual results: When ovn-controller restarts, new connections are interrupted for, in my test, about 5 seconds. And this is a small cluster. Expected results: New connections (almost) always succeed. Additional info: Users at higher scale are punished much more by this, and can experience outages in the 10s-of-seconds. There is a thread about it on the ovs-devel / ovn-devel mailing lists.