Description of problem: This customer is running multiple sites (separate OSP deployments). The existing sites are operating normally and satisfy the 15ms latency requirements. In this new deployment, there is an intermittent issue where ~1% of traffic is seeing higher latency - upwards of 250ms. Originally it was suspected the issue was isolated to a single hypervisor; however, it is now reported to occur on multiple hosts and even between VMs on a single host. It was expected that the sites were deployed with the same OSP versions; however, this new site is running a newer OVN versions. The existing deployments without issues are running: ovn-2021-21.12.0-103.el8fdp.x86_64 It is believed sites are running the same openvswitch versions (to be confirmed) The offending traffic is on the same L2 network. These are VLAN provider networks. In Progress: - The consultant working on this deployment is confirming all the delta between these sites. - I've recommended that the new site is rolled back to matching version as the existing site to verify issue. - The customer has a script that tests and reports these latency issues. I've asked for tcpdumps at the source and destination hypervisors on the VM tap interfaces and external physical interfaces while reproducing the issue. This is an escalation to involve Neutron and OVN dev teams in this issue. Version-Release number of selected component (if applicable): OSP 16.2.4 ovn-2021-21.12.0-116.el8fdp.x86_64 How reproducible: This specific deployment. Steps to Reproduce: 1. As described above Additional info: Will add in private comments
I believe he means using the `coverage/show` request to ovn-appctl and ovs-appctl for various components. ovn-appctl -t <pidfile of ovn-controller> coverage/show ovs-appctl -t <pidfile of ovs-vswitchd> coverage/show