2276671 – OVS introduces significant delay for some packets forwarded via ML2/OVN DPDK data path

Bug 2276671 - OVS introduces significant delay for some packets forwarded via ML2/OVN DPDK data path

Summary: OVS introduces significant delay for some packets forwarded via ML2/OVN DPDK ...

Keywords:
Status:	CLOSED COMPLETED
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	z4
Target Release:	17.1
Assignee:	Robin Jarry
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:	2293368 2293382 2317182
Blocks:
TreeView+	depends on / blocked

Reported:	2024-04-23 15:54 UTC by Alex Stupnikov
Modified:	2024-12-11 14:38 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-12-11 14:37:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-31957	0	None	None	None	2024-04-23 15:54:27 UTC

Description Alex Stupnikov 2024-04-23 15:54:11 UTC

Description of problem:
Customer deployed fresh RHOSP 17.1 ML2/OVN NFV deployment and faced unusual problem during tests: small subset of packets are forwarded with much higher delays (hundreds of milliseconds instead of microseconds).

Originally this problem was reproduced for VMs hosted on different computes, but then this scheme was improved: customer connected both VMs to provider networks with network switch connected with computes being a router and shutdown second DPDK slave of a bond. Then he collected tcpdump on both VM's VHU interfaces AND active DPDK slave.

From tcpdumps we can see clearly that ICMP echo replies are received from switch with ~35 us-level difference between ICMP echo request and ICMP echo reply. But in tcpdump running on VHU interface of VM running ping command we can see that sometimes delay between ICMP echo request and ICMP echo reply can be up to 177 ms. 

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 17.1.2 (Wallaby)


How reproducible:
Please see description.

Actual results:
It looks like openvswitch introduces significant delays when transmitting ICMP echo replies.

Expected results:
openvswitch doesn't introduce random significant delays for subset of forwarded packets.

Additional info:
Will be provided privately

Comment 3 David Hill 2024-04-25 12:25:15 UTC

Was this reproduced with port security disabled ?

Comment 33 Flavio Leitner 2024-05-09 17:44:01 UTC

Hello,

I've reviewed the data provided and identified a couple of things during the data capture and this is the worst scheduling latency times in milliseconds:
68.919, 66.702, 20.593, 17.560, 12.640, 5.401, 4.884, 4.295, 3.792, 3.685.

Focusing on the first 2 worse cases I can see this:
First case:
           time    cpu  task name                       wait time  sch delay   run time
7304045.209460 [0059]  pmd-c59/id:16[6162/3848]           68.919      0.000      0.168

$ perf script
   pmd-c59/id:16    6162 [059] 7304045.140372:       sched:sched_switch: pmd-c59/id:16:6162 [110] S ==> swapper/59:0 [120]
         swapper       0 [059] 7304045.209292:       sched:sched_switch: swapper/59:0 [120] R ==> pmd-c59/id:16:6162 [110]
   pmd-c59/id:16    6162 [059] 7304045.209460:       sched:sched_switch: pmd-c59/id:16:6162 [110] S ==> swapper/59:0 [120]

>>> 7304045.209292 - 7304045.140372
0.06892000045627356

and then the second case:
$ perf sched timehist
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
 7304045.209319 [0002]  pmd-c02/id:21[6158/3848]           66.702      0.005      0.016


$ perf script
   pmd-c02/id:21    6158 [002] 7304045.142599:       sched:sched_switch: pmd-c02/id:21:6158 [110] S ==> swapper/2:0 [120]
         swapper       0 [002] 7304045.209302:       sched:sched_switch: swapper/2:0 [120] R ==> pmd-c02/id:21:6158 [110]
   pmd-c02/id:21    6158 [002] 7304045.209319:       sched:sched_switch: pmd-c02/id:21:6158 [110] S ==> swapper/2:0 [120]


>>> 7304045.209302 - 7304045.142599
0.06670300010591745

What I can see is that the PMD thread processing the traffic is getting preempted and it takes that amount of time to run again.
We may need to trace some other parts to understand better what is causing the preemption and why it is taking that long to run it again.
I don't see a sched_waking event indicating the PMD thread was blocked on something.

I also noted that sometimes there is a delay between the waking up event and start running the PMD thread and this additional latency to run is because of a tool called "turbostat" running and monitoring the same CPU. Can we stop this tool to reduce the noise while we are trying to find other sources of latency?

The previous comment#32 suggests this could be FW related and that could very well be the case.
We should rule that out too before we continue with traces.

Once the turbostat and the NIC FW are ruled out, please share a fresh ping output for us to see how it is manifesting over time (pattern, spikes, averages, etc).

Thanks,
fbl

Comment 38 Flavio Leitner 2024-05-10 14:37:27 UTC

Hi,

We need confirmation whether the LSC mode is being used and 
is not blocking. The idea is to re-run the previous debugging
steps. Then we need a fresh latency capture and the
ping output to correlate them. Same with the CPU traces.

Please follow the debugging recipe below.

There are 3 tests, so keep the data separate and identified
so we can match which file/output belongs to each test.

1) Run the following commands before starting the reproducer
   to add all the probe points.

# perf probe -x $(which ovs-vswitchd) bond_run_enter=bond_run
# perf probe -x $(which ovs-vswitchd) bond_run_lock=bond_run:6
# perf probe -x $(which ovs-vswitchd) bond_run_unlock=bond_run:43
# perf probe -x $(which ovs-vswitchd) bond_check_admissibility_enter=bond_check_admissibility
# perf probe -x $(which ovs-vswitchd) bond_check_admissibility_lock=bond_check_admissibility:8
# perf probe -x $(which ovs-vswitchd) bond_check_admissibility_unlock=bond_check_admissibility:104
# perf probe -x $(which ovs-vswitchd) mlx5_link_enter=mlx5_link_update
# perf probe -x $(which ovs-vswitchd) mlx5_link_exit=mlx5_link_update%return
# perf probe -x $(which ovs-vswitchd) rte_eth_link_get_nowait_enter=rte_eth_link_get_nowait
# perf probe -x $(which ovs-vswitchd) rte_eth_link_get_nowait_update=rte_eth_link_get_nowait:19
# perf probe -x $(which ovs-vswitchd) rte_eth_link_get_nowait_exit=rte_eth_link_get_nowait%return

2) Capture a sosreport and take note that this sosreport is before the capture w/ probes.

3) The command below will start recording the events for 10min (600 secs).
   Ideally, we should see a few samples during that time, otherwise
   we may need to increase that. However, recording data for too long may
   capture a huge amount of data and then it is better to stop the
   current capture (CTRL+C) and start a new one:

# perf record -e 'probe_ovs:*' -aR sleep 600

4) Reproduce the issue at least one time. The more events, the better.
   Please record the ping output as well.

5) Stop the recording from step (#2) using CTRL+C

6) The command below restores the system by deleting all the probes from step #1
# perf probe -d 'probe_ovs:*'

7) Capture a fresh sosreport. Take note this sosreport is after the capture w/ probes.

8) Upload the 2 sosreport, the perf data, and the ping output.


Second debugging session
------------------------
1) Start capturing the scheduler latency stats.
   Same as before, this will run for 10min (600 secs).

# perf sched record -a sleep 600

2) Reproduce the issue at least one time. The more events, the better.
   Please record the ping output as well.

3) Stop the recording from step (#1) using CTRL+C

4) Upload the perf data file and the ping output.


Third debugging session
-----------------------
1) Start capturing the generic traces on the CPUs running the PMDs.
   (I am assuming it is the same system as previously where we
    have PMD threads running on CPUs 2, 3, 58, and 59)

# perf record -C 2,3,58,59

2) Reproduce the issue at least one time. The more events, the better.
   Please record the ping output as well.

3) Stop the recording from step (#1) using CTRL+C

4) Upload the perf data file and the ping output.

Thanks
fbl

Note You need to log in before you can comment on or make changes to this bug.

apevec
cfontain
chrisbro
chrisw
dhill
dmarchan
echaudro
fleitner
hakhande
jhardee
jmarti
kurathod
njohnston
parthee
riramos
rjarry
sapaul
ssigwald
sukar
xili