1945040 – 18% rx pps performance regression with rhel9 guest compared with rhel8 guest

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1945040 - 18% rx pps performance regression with rhel9 guest compared with rhel8 guest

Summary: 18% rx pps performance regression with rhel9 guest compared with rhel8 guest

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	beta
Target Release:	---
Assignee:	Laurent Vivier
QA Contact:	Quan Wenli
Docs Contact:	Daniel Vozenilek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-31 09:42 UTC by Quan Wenli
Modified:	2023-03-14 15:17 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.Network traffic performance in virtual machines is no longer reduced when under heavy load Previously, RHEL virtual machines had, in some cases, decreased performance when handling high levels of network traffic. The underlying code has been fixed and network traffic performance now works as expected in the described circumstances.
Clone Of:
Environment:
Last Closed:	2022-08-23 10:34:08 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Quan Wenli 2021-03-31 09:55:01 UTC

The regression issue is happened between guest kernel 4.18.0-300.el8.x86_64 and 5.11.0-2.el9.x86_64

Comment 2 Quan Wenli 2021-04-15 08:04:45 UTC

After bisect, I found the first bad commit.

Bad commit: 

3226b158e67cfaa677fd180152bfb28989cb2fac is the first bad commit
commit 3226b158e67cfaa677fd180152bfb28989cb2fac
Author: Eric Dumazet <edumazet>
Date:   Wed Jan 13 08:18:19 2021 -0800

    net: avoid 32 x truesize under-estimation for tiny skbs

    Both virtio net and napi_get_frags() allocate skbs
    with a very small skb->head

    While using page fragments instead of a kmalloc backed skb->head might give
    a small performance improvement in some cases, there is a huge risk of
    under estimating memory usage.

    For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
    per page (order-3 page in x86), or even 64 on PowerPC

    We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
    but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

    Even if we force napi_alloc_skb() to only use order-0 pages, the issue
    would still be there on arches with PAGE_SIZE >= 32768

    This patch makes sure that small skb head are kmalloc backed, so that
    other objects in the slab page can be reused instead of being held as long
    as skbs are sitting in socket queues.

    Note that we might in the future use the sk_buff napi cache,
    instead of going through a more expensive __alloc_skb()

    Another idea would be to use separate page sizes depending
    on the allocated length (to never have more than 4 frags per page)

    I would like to thank Greg Thelen for his precious help on this matter,
    analysing crash dumps is always a time consuming task.

    Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
    Signed-off-by: Eric Dumazet <edumazet>
    Cc: Paolo Abeni <pabeni>
    Cc: Greg Thelen <gthelen>
    Reviewed-by: Alexander Duyck <alexanderduyck>
    Acked-by: Michael S. Tsirkin <mst>
    Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba>

 net/core/skbuff.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

With commit 3226b158e67cfaa677fd180152bfb28989cb2fac: 1.84 mpps on rx
With commit 7da17624e7948d5d9660b910f8079d26d26ce453: 2.37 mpps on rx


@Ariel, could you look at this ? Thanks, wenli

Comment 3 jason wang 2021-04-15 09:06:32 UTC

Note that this has been fixed with the following commits upstream:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=0f6925b3e8da0dbbb52447ca8a8b42b371aac7db
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=38ec4944b593fd90c5ef42aaaa53e66ae5769d04

Thanks

Comment 5 Quan Wenli 2021-04-20 08:12:42 UTC

(In reply to jason wang from comment #3)
> Note that this has been fixed with the following commits upstream:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/
> ?id=0f6925b3e8da0dbbb52447ca8a8b42b371aac7db
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/
> ?id=38ec4944b593fd90c5ef42aaaa53e66ae5769d04
> 
> Thanks

I apply above two patches in our latest downstream(5.12.0-rc5), the rx pps from 1.84 up to 1.93 mpps, but still not good as 2.37 mpps in comment#2.

Comment 6 jason wang 2021-04-20 08:30:27 UTC

(In reply to Quan Wenli from comment #5)
> (In reply to jason wang from comment #3)
> > Note that this has been fixed with the following commits upstream:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/
> > ?id=0f6925b3e8da0dbbb52447ca8a8b42b371aac7db
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/
> > ?id=38ec4944b593fd90c5ef42aaaa53e66ae5769d04
> > 
> > Thanks
> 
> I apply above two patches in our latest downstream(5.12.0-rc5), the rx pps
> from 1.84 up to 1.93 mpps, but still not good as 2.37 mpps in comment#2.

Thanks for the testing.

I've proposed another idea to increase the performance. Engineer from Ali Cloud is working on that.

Will give you the commit id when it was applied. (Actually the patch has been applied but has bugs, we're working on solving them).

Thanks

Comment 7 jason wang 2021-04-22 02:54:02 UTC

Here're the patches:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=fb32856b16ad9d5bcd75b76a274e2c515ac7b9d7
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=f5d7872a8b8a3176e65dc6f7f0705ce7e9a699e6
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=af39c8f72301b268ad8b04bae646b6025918b82b

Thanks

Comment 8 Quan Wenli 2021-04-23 03:35:46 UTC

(In reply to jason wang from comment #7)
> Here're the patches:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/
> ?id=fb32856b16ad9d5bcd75b76a274e2c515ac7b9d7
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/
> ?id=f5d7872a8b8a3176e65dc6f7f0705ce7e9a699e6
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/
> ?id=af39c8f72301b268ad8b04bae646b6025918b82b
> 
> Thanks

Apply above 3 patches base on kernel in comment#5, the rx pps turns back to 1.84 mpps.

Comment 9 jason wang 2021-04-23 04:10:00 UTC

Can you remove the check:

len > GOOD_COPY_LEN 

in page_to_skb() and retry?

This check bascially suppresses the optimization for small packet (e.g 64B).

Thanks

Comment 10 Quan Wenli 2021-04-25 06:47:31 UTC

(In reply to jason wang from comment #9)
> Can you remove the check:
> 
> len > GOOD_COPY_LEN 
> 
> in page_to_skb() and retry?
> 
> This check bascially suppresses the optimization for small packet (e.g 64B).
> 
> Thanks

Cool, after remove it, then re-build kernel, the performance is back to 2.33 mpps.

Comment 23 Laurent Vivier 2022-07-11 19:16:59 UTC

According to BZ 2069047 comment 10,

RHEL 9.0 guest on RHEL 8.6 guest works well
RHEL 9.0 guest on RHEL 9.0 meets the regression

      host               guest                    rx pps 

4.18.0-353              5.14.0-70.2.1.el9_0       2.39 mpps (8.6 host with 9.0 guest)
5.14.0-70.2.1.el9_0     5.14.0-70.2.1.el9_0       1.74 mpps  (9.0 host with 9.0 guest)

As the ITR is 9.1, could you test:

      host               guest

4.18.0-402.el8          5.14.0-127.el9
5.14.0-127.el9          5.14.0-127.el9

Thanks

Comment 24 Quan Wenli 2022-07-14 06:02:04 UTC

(In reply to Laurent Vivier from comment #23)
> According to BZ 2069047 comment 10,
> 
> RHEL 9.0 guest on RHEL 8.6 guest works well
> RHEL 9.0 guest on RHEL 9.0 meets the regression
> 
>       host               guest                    rx pps 
> 
> 4.18.0-353              5.14.0-70.2.1.el9_0       2.39 mpps (8.6 host with
> 9.0 guest)
> 5.14.0-70.2.1.el9_0     5.14.0-70.2.1.el9_0       1.74 mpps  (9.0 host with
> 9.0 guest)
> 
> As the ITR is 9.1, could you test:
> 
>       host               guest
> 
> 4.18.0-402.el8          5.14.0-127.el9
> 5.14.0-127.el9          5.14.0-127.el9
> 
> Thanks

I will update the results when I got.

Currently the ITM is 18. could you help review it and re-set the ITM?

Thanks, wenli

Comment 27 Quan Wenli 2022-07-19 09:55:49 UTC

(In reply to Laurent Vivier from comment #23)
> According to BZ 2069047 comment 10,
> 
> RHEL 9.0 guest on RHEL 8.6 guest works well
> RHEL 9.0 guest on RHEL 9.0 meets the regression
> 
>       host               guest                    rx pps 
> 
> 4.18.0-353              5.14.0-70.2.1.el9_0       2.39 mpps (8.6 host with
> 9.0 guest)
> 5.14.0-70.2.1.el9_0     5.14.0-70.2.1.el9_0       1.74 mpps  (9.0 host with
> 9.0 guest)
> 
> As the ITR is 9.1, could you test:
> 
>       host               guest
> 
> 4.18.0-402.el8          5.14.0-127.el9
> 5.14.0-127.el9          5.14.0-127.el9


       host               guest                 rx results

 4.18.0-402.el8          5.14.0-127.el9         1.97 mpps
 5.14.0-127.el9          5.14.0-127.el9         1.71 mpps

Detail results: 
http://10.73.60.69/results/request/Bug1945040/rhel9.0host/kernel-5.14.0-127/pktgen_perf.html




> 
> Thanks

Comment 29 Laurent Vivier 2022-08-05 18:28:21 UTC

I've not been able to reproduce the problem on my system, I have a better performance with latest upstream kernel (v5.19+, b2a88c212e65) than with kernel without 3226b158e67c (5.11.0-rc2+, 7da17624e794).

My command line is:

/usr/libexec/qemu-kvm \
-nodefaults \
-nographic \
-machine q35 \
-m 4066  \
-smp 4 \
-blockdev node-name=file_image1,driver=file,filename=$IMAGE \
-blockdev node-name=drive_image1,driver=qcow2,file=file_image1 \
-device virtio-blk,id=virtioblk0,drive=drive_image1 \
-enable-kvm \
-cpu host \
-serial mon:stdio \
-device virtio-net,mac=52:54:00:7b:3f:6b,id=virtionet0,netdev=tap0 \
-netdev tap,id=tap0,vhost=on

My results are:

HOST 5.11.0-rc2+

rhel870 4.18.0-411.el8.x86_64   TX tap0: 0 pkts/s RX tap0: 928046 pkts/s
rhel910 5.14.0-136.el9.x86_64   TX tap0: 0 pkts/s RX tap0: 927145 pkts/s

HOST 5.19.0+

rhel870 4.18.0-411.el8.x86_64   TX tap0: 1 pkts/s RX tap0: 1106153 pkts/s
rhel910 5.14.0-136.el9.x86_64   TX tap0: 1 pkts/s RX tap0: 1088796 pkts/s

What did I miss?

Comment 30 Quan Wenli 2022-08-10 08:08:14 UTC

(In reply to Laurent Vivier from comment #29)
> I've not been able to reproduce the problem on my system, I have a better
> performance with latest upstream kernel (v5.19+, b2a88c212e65) than with
> kernel without 3226b158e67c (5.11.0-rc2+, 7da17624e794).
> 
> My command line is:
> 
> /usr/libexec/qemu-kvm \
> -nodefaults \
> -nographic \
> -machine q35 \
> -m 4066  \
> -smp 4 \
> -blockdev node-name=file_image1,driver=file,filename=$IMAGE \
> -blockdev node-name=drive_image1,driver=qcow2,file=file_image1 \
> -device virtio-blk,id=virtioblk0,drive=drive_image1 \
> -enable-kvm \
> -cpu host \
> -serial mon:stdio \
> -device virtio-net,mac=52:54:00:7b:3f:6b,id=virtionet0,netdev=tap0 \
> -netdev tap,id=tap0,vhost=on
> 
> My results are:
> 
> HOST 5.11.0-rc2+
> 
> rhel870 4.18.0-411.el8.x86_64   TX tap0: 0 pkts/s RX tap0: 928046 pkts/s
> rhel910 5.14.0-136.el9.x86_64   TX tap0: 0 pkts/s RX tap0: 927145 pkts/s

your data are around 0.9 mpps, it maybe the rx performance issue can not reproduced with slowly pps rate ?  


> 
> HOST 5.19.0+
> 
> rhel870 4.18.0-411.el8.x86_64   TX tap0: 1 pkts/s RX tap0: 1106153 pkts/s
> rhel910 5.14.0-136.el9.x86_64   TX tap0: 1 pkts/s RX tap0: 1088796 pkts/s
> 
> What did I miss?

Comment 31 Laurent Vivier 2022-08-10 15:42:21 UTC

(In reply to Quan Wenli from comment #30)
> (In reply to Laurent Vivier from comment #29)
> > I've not been able to reproduce the problem on my system, I have a better
> > performance with latest upstream kernel (v5.19+, b2a88c212e65) than with
> > kernel without 3226b158e67c (5.11.0-rc2+, 7da17624e794).
> > 
> > My command line is:
> > 
> > /usr/libexec/qemu-kvm \
> > -nodefaults \
> > -nographic \
> > -machine q35 \
> > -m 4066  \
> > -smp 4 \
> > -blockdev node-name=file_image1,driver=file,filename=$IMAGE \
> > -blockdev node-name=drive_image1,driver=qcow2,file=file_image1 \
> > -device virtio-blk,id=virtioblk0,drive=drive_image1 \
> > -enable-kvm \
> > -cpu host \
> > -serial mon:stdio \
> > -device virtio-net,mac=52:54:00:7b:3f:6b,id=virtionet0,netdev=tap0 \
> > -netdev tap,id=tap0,vhost=on
> > 
> > My results are:
> > 
> > HOST 5.11.0-rc2+
> > 
> > rhel870 4.18.0-411.el8.x86_64   TX tap0: 0 pkts/s RX tap0: 928046 pkts/s
> > rhel910 5.14.0-136.el9.x86_64   TX tap0: 0 pkts/s RX tap0: 927145 pkts/s
> 
> your data are around 0.9 mpps, it maybe the rx performance issue can not
> reproduced with slowly pps rate ?  
>

So you mean it depends on the machine performance?
Could you try to reproduce the problem with QEMU command line above?
I try to have a simplified reproducer (no libvirt, minimum devices).

Comment 32 lulu@redhat.com 2022-08-22 09:13:47 UTC

(In reply to Laurent Vivier from comment #31)
> (In reply to Quan Wenli from comment #30)
> > (In reply to Laurent Vivier from comment #29)
> > > I've not been able to reproduce the problem on my system, I have a better
> > > performance with latest upstream kernel (v5.19+, b2a88c212e65) than with
> > > kernel without 3226b158e67c (5.11.0-rc2+, 7da17624e794).
> > > 
> > > My command line is:
> > > 
> > > /usr/libexec/qemu-kvm \
> > > -nodefaults \
> > > -nographic \
> > > -machine q35 \
> > > -m 4066  \
> > > -smp 4 \
> > > -blockdev node-name=file_image1,driver=file,filename=$IMAGE \
> > > -blockdev node-name=drive_image1,driver=qcow2,file=file_image1 \
> > > -device virtio-blk,id=virtioblk0,drive=drive_image1 \
> > > -enable-kvm \
> > > -cpu host \
> > > -serial mon:stdio \
> > > -device virtio-net,mac=52:54:00:7b:3f:6b,id=virtionet0,netdev=tap0 \
> > > -netdev tap,id=tap0,vhost=on
> > > 
> > > My results are:
> > > 
> > > HOST 5.11.0-rc2+
> > > 
> > > rhel870 4.18.0-411.el8.x86_64   TX tap0: 0 pkts/s RX tap0: 928046 pkts/s
> > > rhel910 5.14.0-136.el9.x86_64   TX tap0: 0 pkts/s RX tap0: 927145 pkts/s
> > 
> > your data are around 0.9 mpps, it maybe the rx performance issue can not
> > reproduced with slowly pps rate ?  
> >
> 
> So you mean it depends on the machine performance?
> Could you try to reproduce the problem with QEMU command line above?
> I try to have a simplified reproducer (no libvirt, minimum devices).

Hi wenli, 
I have tried the same step, but I met the same problem. I can not reproduce the issue 
would you help verify this? also would you help verify this in latest el9 kernel ?
since after the code sync, the commit mentioned in this bz are all included in rhel9 source code  

Thanks
cindy
host 5.11/ guest 5.11 (without the commits )
[root@localhost ~]# ./pps.sh eth1
TX eth1: 182062 pkts/s RX eth1: 0 pkts/s
TX eth1: 185255 pkts/s RX eth1: 0 pkts/s
TX eth1: 181486 pkts/s RX eth1: 0 pkts/s
TX eth1: 182976 pkts/s RX eth1: 0 pkts/s
TX eth1: 182016 pkts/s RX eth1: 0 pkts/s
TX eth1: 181440 pkts/s RX eth1: 0 pkts/s
TX eth1: 181858 pkts/s RX eth1: 0 pkts/s
TX eth1: 182922 pkts/s RX eth1: 0 pkts/s
TX eth1: 173002 pkts/s RX eth1: 0 pkts/s
TX eth1: 183056 pkts/s RX eth1: 0 pkts/s
TX eth1: 184366 pkts/s RX eth1: 0 pkts/s
TX eth1: 183168 pkts/s RX eth1: 0 pkts/s
TX eth1: 183045 pkts/s RX eth1: 0 pkts/s

host 5.12+/ guest 5.11 (after the commit merged)
TX eth1: 203760 pkts/s RX eth1: 0 pkts/s
TX eth1: 203136 pkts/s RX eth1: 0 pkts/s
TX eth1: 186802 pkts/s RX eth1: 0 pkts/s
TX eth1: 182638 pkts/s RX eth1: 0 pkts/s
TX eth1: 188829 pkts/s RX eth1: 0 pkts/s
TX eth1: 193052 pkts/s RX eth1: 0 pkts/s
TX eth1: 204710 pkts/s RX eth1: 0 pkts/s
TX eth1: 202590 pkts/s RX eth1: 0 pkts/s
TX eth1: 203758 pkts/s RX eth1: 0 pkts/s
TX eth1: 203273 pkts/s RX eth1: 0 pkts/s
TX eth1: 199357 pkts/s RX eth1: 0 pkts/s
TX eth1: 190603 pkts/s RX eth1: 0 pkts/s
TX eth1: 199098 pkts/s RX eth1: 0 pkts/s
TX eth1: 197656 pkts/s RX eth1: 0 pkts/s

Note You need to log in before you can comment on or make changes to this bug.