Bug 2276671
| Summary: | OVS introduces significant delay for some packets forwarded via ML2/OVN DPDK data path | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alex Stupnikov <astupnik> |
| Component: | openvswitch | Assignee: | Robin Jarry <rjarry> |
| Status: | CLOSED COMPLETED | QA Contact: | Eran Kuris <ekuris> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 17.1 (Wallaby) | CC: | apevec, cfontain, chrisbro, chrisw, dhill, dmarchan, echaudro, fleitner, hakhande, jhardee, jmarti, kurathod, njohnston, parthee, riramos, rjarry, sapaul, ssigwald, sukar, xili |
| Target Milestone: | z4 | Keywords: | Triaged |
| Target Release: | 17.1 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-12-11 14:37:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2293368, 2293382, 2317182 | ||
| Bug Blocks: | |||
|
Description
Alex Stupnikov
2024-04-23 15:54:11 UTC
Was this reproduced with port security disabled ? Hello,
I've reviewed the data provided and identified a couple of things during the data capture and this is the worst scheduling latency times in milliseconds:
68.919, 66.702, 20.593, 17.560, 12.640, 5.401, 4.884, 4.295, 3.792, 3.685.
Focusing on the first 2 worse cases I can see this:
First case:
time cpu task name wait time sch delay run time
7304045.209460 [0059] pmd-c59/id:16[6162/3848] 68.919 0.000 0.168
$ perf script
pmd-c59/id:16 6162 [059] 7304045.140372: sched:sched_switch: pmd-c59/id:16:6162 [110] S ==> swapper/59:0 [120]
swapper 0 [059] 7304045.209292: sched:sched_switch: swapper/59:0 [120] R ==> pmd-c59/id:16:6162 [110]
pmd-c59/id:16 6162 [059] 7304045.209460: sched:sched_switch: pmd-c59/id:16:6162 [110] S ==> swapper/59:0 [120]
>>> 7304045.209292 - 7304045.140372
0.06892000045627356
and then the second case:
$ perf sched timehist
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
7304045.209319 [0002] pmd-c02/id:21[6158/3848] 66.702 0.005 0.016
$ perf script
pmd-c02/id:21 6158 [002] 7304045.142599: sched:sched_switch: pmd-c02/id:21:6158 [110] S ==> swapper/2:0 [120]
swapper 0 [002] 7304045.209302: sched:sched_switch: swapper/2:0 [120] R ==> pmd-c02/id:21:6158 [110]
pmd-c02/id:21 6158 [002] 7304045.209319: sched:sched_switch: pmd-c02/id:21:6158 [110] S ==> swapper/2:0 [120]
>>> 7304045.209302 - 7304045.142599
0.06670300010591745
What I can see is that the PMD thread processing the traffic is getting preempted and it takes that amount of time to run again.
We may need to trace some other parts to understand better what is causing the preemption and why it is taking that long to run it again.
I don't see a sched_waking event indicating the PMD thread was blocked on something.
I also noted that sometimes there is a delay between the waking up event and start running the PMD thread and this additional latency to run is because of a tool called "turbostat" running and monitoring the same CPU. Can we stop this tool to reduce the noise while we are trying to find other sources of latency?
The previous comment#32 suggests this could be FW related and that could very well be the case.
We should rule that out too before we continue with traces.
Once the turbostat and the NIC FW are ruled out, please share a fresh ping output for us to see how it is manifesting over time (pattern, spikes, averages, etc).
Thanks,
fbl
Hi,
We need confirmation whether the LSC mode is being used and
is not blocking. The idea is to re-run the previous debugging
steps. Then we need a fresh latency capture and the
ping output to correlate them. Same with the CPU traces.
Please follow the debugging recipe below.
There are 3 tests, so keep the data separate and identified
so we can match which file/output belongs to each test.
1) Run the following commands before starting the reproducer
to add all the probe points.
# perf probe -x $(which ovs-vswitchd) bond_run_enter=bond_run
# perf probe -x $(which ovs-vswitchd) bond_run_lock=bond_run:6
# perf probe -x $(which ovs-vswitchd) bond_run_unlock=bond_run:43
# perf probe -x $(which ovs-vswitchd) bond_check_admissibility_enter=bond_check_admissibility
# perf probe -x $(which ovs-vswitchd) bond_check_admissibility_lock=bond_check_admissibility:8
# perf probe -x $(which ovs-vswitchd) bond_check_admissibility_unlock=bond_check_admissibility:104
# perf probe -x $(which ovs-vswitchd) mlx5_link_enter=mlx5_link_update
# perf probe -x $(which ovs-vswitchd) mlx5_link_exit=mlx5_link_update%return
# perf probe -x $(which ovs-vswitchd) rte_eth_link_get_nowait_enter=rte_eth_link_get_nowait
# perf probe -x $(which ovs-vswitchd) rte_eth_link_get_nowait_update=rte_eth_link_get_nowait:19
# perf probe -x $(which ovs-vswitchd) rte_eth_link_get_nowait_exit=rte_eth_link_get_nowait%return
2) Capture a sosreport and take note that this sosreport is before the capture w/ probes.
3) The command below will start recording the events for 10min (600 secs).
Ideally, we should see a few samples during that time, otherwise
we may need to increase that. However, recording data for too long may
capture a huge amount of data and then it is better to stop the
current capture (CTRL+C) and start a new one:
# perf record -e 'probe_ovs:*' -aR sleep 600
4) Reproduce the issue at least one time. The more events, the better.
Please record the ping output as well.
5) Stop the recording from step (#2) using CTRL+C
6) The command below restores the system by deleting all the probes from step #1
# perf probe -d 'probe_ovs:*'
7) Capture a fresh sosreport. Take note this sosreport is after the capture w/ probes.
8) Upload the 2 sosreport, the perf data, and the ping output.
Second debugging session
------------------------
1) Start capturing the scheduler latency stats.
Same as before, this will run for 10min (600 secs).
# perf sched record -a sleep 600
2) Reproduce the issue at least one time. The more events, the better.
Please record the ping output as well.
3) Stop the recording from step (#1) using CTRL+C
4) Upload the perf data file and the ping output.
Third debugging session
-----------------------
1) Start capturing the generic traces on the CPUs running the PMDs.
(I am assuming it is the same system as previously where we
have PMD threads running on CPUs 2, 3, 58, and 59)
# perf record -C 2,3,58,59
2) Reproduce the issue at least one time. The more events, the better.
Please record the ping output as well.
3) Stop the recording from step (#1) using CTRL+C
4) Upload the perf data file and the ping output.
Thanks
fbl
|