Bug 2226874

Summary: [OSP 16.2][OVN]Intermittent UDP latency on OSP site running ovn-2021-21.12.0-116.el8fdp.x86_64
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Matt Flusche <mflusche>
Component: ovn-2021Assignee: Ales Musil <amusil>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: high    
Version: RHEL 8.0CC: aasmith, anbs, aslaught, bcafarel, ctrautma, dabrown, dalvarez, dcbw, dceara, dhellard, ealcaniz, eshulman, gurpsing, hakhande, ihrachys, imahmed, jiji, johender, mmichels, vkhitrin
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2227901 (view as bug list) Environment:
Last Closed: 2023-08-17 14:10:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2227901    

Description Matt Flusche 2023-07-26 18:26:30 UTC
Description of problem:

This customer is running multiple sites (separate OSP deployments). The existing sites are operating normally and satisfy the 15ms latency requirements.  In this new deployment, there is an intermittent issue where ~1% of traffic is seeing higher latency - upwards of 250ms.

Originally it was suspected the issue was isolated to a single hypervisor; however, it is now reported to occur on multiple hosts and even between VMs on a single host.

It was expected that the sites were deployed with the same OSP versions; however, this new site is running a newer OVN versions.

The existing deployments without issues are running: ovn-2021-21.12.0-103.el8fdp.x86_64

It is believed sites are running the same openvswitch versions (to be confirmed)

The offending traffic is on the same L2 network.  These are VLAN provider networks.

In Progress:

- The consultant working on this deployment is confirming all the delta between these sites. 
- I've recommended that the new site is rolled back to matching version as the existing site to verify issue.
- The customer has a script that tests and reports these latency issues.  I've asked for tcpdumps at the source and destination hypervisors on the VM tap interfaces and external physical interfaces while reproducing the issue. 


This is an escalation to involve Neutron and OVN dev teams in this issue.


Version-Release number of selected component (if applicable):
OSP 16.2.4
ovn-2021-21.12.0-116.el8fdp.x86_64


How reproducible:
This specific deployment.


Steps to Reproduce:
1. As described above

Additional info:
Will add in private comments

Comment 24 Dan Williams 2023-08-01 15:37:00 UTC
I believe he means using the `coverage/show` request to ovn-appctl and ovs-appctl for various components.

ovn-appctl -t <pidfile of ovn-controller> coverage/show
ovs-appctl -t <pidfile of ovs-vswitchd> coverage/show