Bug 2270408

Summary: 20% performance regression on XDP drop with mlx5 in ELN kernels (compared to RHEL 9 candidates)
Product: [Fedora] Fedora Reporter: Samuel Dobroň <sdobron>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED MIGRATED QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rawhideCC: acaringi, adscvr, airlied, alciregi, aokuliar, atomasov, bskeggs, dzickus, hdegoede, hladky.jiri, hpa, jarod, jhladky, josef, kernel-maint, linville, masami256, mchehab, ptalbert, scweaver, steved
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-07-22 08:37:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Samuel Dobroň 2024-03-20 07:51:18 UTC
1. Please describe the problem:
We compared performance of ELN and RHEL9 candidate kernels and noticed significant drop in XDP drop [1] on mlx5 (25G).

On any rhel9 candidate kernel we are able to drop 19-20M pkts/sec but on an ELN kernels, we are reaching just 15M pkts/sec (CPU utillization remains the same - around 100%). 

We don't see such regression on ixgbe or i40e.


[1] https://github.com/xdp-project/xdp-tools/tree/master/xdp-bench#the-drop-command

2. What is the Version-Release number of the kernel:

kernel-6.4.0-0.rc6.20230616git40f71e7cd3c6.50.eln126
  https://koji.fedoraproject.org/koji/buildinfo?buildID=2217297

We tested just x86_64, not sure about other archs.

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Yes, it (most likely) comes from some patch in mentioned kernel, the previous one (kernel-6.4.0-0.rc6.20230614gitb6dad5178cea.49.eln126 - https://koji.fedoraproject.org/koji/taskinfo?taskID=102148156) is ok.



4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Reproducible always.

Steps:
- install affected kernel, kernel-modules-extra and kernel-modules-internal packages (for pktgen)
Traffic generator machine (wsfd-advnetlab65.anl.eng.rdu2.dc.redhat.com):
git clone https://github.com/torvalds/linux.git
cd linux/samples/pktgen/
./pktgen_sample03_burst_single_flow.sh -m MAC -d IP -i INF

DUT machine (receiver) (wsfd-advnetlab66.anl.eng.rdu2.dc.redhat.com):
dnf install -y git nano cmake clang llvm elfutils-libelf-devel libpcap-devel perf bpftool m4
git clone https://github.com/xdp-project/xdp-tools.git
cd xdp-tools/
./configure
make
./xdp-bench drop INF


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Not sure, i've not checked.


6. Are you running any modules that not shipped with directly Fedora's kernel?:
No.


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
-

Reproducible: Always

Comment 5 Samuel Dobroň 2024-07-22 08:37:03 UTC
Migrated to jira - https://issues.redhat.com/browse/FC-1196 

Closing.