Bug 1401433 - Vhost tx batching
Summary: Vhost tx batching
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: 7.4
Assignee: Wei
QA Contact: Quan Wenli
Yehuda Zimmerman
URL:
Whiteboard:
Depends On: 1283257 1352741
Blocks: 1395265 1414627 1445257
TreeView+ depends on / blocked
 
Reported: 2016-12-05 09:10 UTC by jason wang
Modified: 2017-08-02 04:53 UTC (History)
9 users (show)

Fixed In Version: kernel-3.10.0-670.el7
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-08-02 04:53:19 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:1842 normal SHIPPED_LIVE Important: kernel security, bug fix, and enhancement update 2017-08-01 18:22:09 UTC

Description jason wang 2016-12-05 09:10:15 UTC
Description of problem:

Upstream will suport vhost tx batching, which can batching several tx packets before submitting to host stack.

For testing:
- modprobe tun rx_batched=0
- run pktgen/l2fwd in guest measure pps
- modprobe tun rx_batched=16
- run pktgen/l2fwd in guest measure pps

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Wei 2017-03-15 09:58:34 UTC
Downstream test result on my laptop:
Before:
tap2 RX  1564831 pkts/s RX Dropped: 0 pkts/s
tap1 TX  2180650 pkts/s TX Dropped: 1677842 pkts/s

After:
tap2 RX  1582509 pkts/s RX Dropped: 0 pkts/s
tap1 TX  2232357 pkts/s TX Dropped: 1702915 pkts/s

Comment 4 Wei 2017-05-08 04:24:01 UTC
It is a bit complicated because I posted 2 versions and the v2 didn't touch any 
change rather than comment part which disturbs the maintainer quite much due to 
the feedback for other BZs I had done, usually we only need tweak v1, and I
commented it to skip v2 and go back to v1 possibly last week but haven't got 
feedback so far.

I will ping the maintainer to make sure if the process is acceptable or not, and see if I should ask reviewer to review it or post a new series.

Comment 6 Wei 2017-05-19 13:54:39 UTC
This is a performance improvement which doesn't need a specific document for it.

Comment 7 Rafael Aquini 2017-05-19 23:22:26 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 9 Rafael Aquini 2017-05-22 13:54:12 UTC
Patch(es) available on kernel-3.10.0-670.el7

Comment 11 xiywang 2017-05-23 02:30:59 UTC
Hi Wenli,

Could you help to do performance test?

Thanks,
Xiyue

Comment 12 Quan Wenli 2017-05-23 03:27:58 UTC
(In reply to xiywang from comment #11)
> Hi Wenli,
> 
> Could you help to do performance test?
> 
> Thanks,
> Xiyue

Ok, I will test it tmr.

Comment 13 Quan Wenli 2017-05-23 09:54:23 UTC
Hi, jason

There is no parm named rx_batched with 3.10.0-670.el7.x86_64, how to check vhost tx batching valid, actually I did not see any tx pps difference between rx_batched=0 and rx_batched=16. 

# modinfo tun
filename:       /lib/modules/3.10.0-670.el7.x86_64/kernel/drivers/net/tun.ko.xz
alias:          devname:net/tun
alias:          char-major-10-200
license:        GPL
author:         (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
description:    Universal TUN/TAP device driver
rhelversion:    7.4
srcversion:     E0353EFA774E5AFD2FFCFD1
depends:        
intree:         Y
vermagic:       3.10.0-670.el7.x86_64 SMP mod_unload modversions 
signer:         Red Hat Enterprise Linux kernel signing key
sig_key:        69:FC:97:DA:41:C9:5D:8E:B0:F5:C4:10:8F:59:71:A9:DC:53:14:E9
sig_hashalgo:   sha256

Comment 14 jason wang 2017-05-24 03:26:56 UTC
(In reply to Quan Wenli from comment #13)
> Hi, jason
> 
> There is no parm named rx_batched with 3.10.0-670.el7.x86_64, how to check
> vhost tx batching valid, actually I did not see any tx pps difference
> between rx_batched=0 and rx_batched=16. 
> 
> # modinfo tun
> filename:      
> /lib/modules/3.10.0-670.el7.x86_64/kernel/drivers/net/tun.ko.xz
> alias:          devname:net/tun
> alias:          char-major-10-200
> license:        GPL
> author:         (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
> description:    Universal TUN/TAP device driver
> rhelversion:    7.4
> srcversion:     E0353EFA774E5AFD2FFCFD1
> depends:        
> intree:         Y
> vermagic:       3.10.0-670.el7.x86_64 SMP mod_unload modversions 
> signer:         Red Hat Enterprise Linux kernel signing key
> sig_key:        69:FC:97:DA:41:C9:5D:8E:B0:F5:C4:10:8F:59:71:A9:DC:53:14:E9
> sig_hashalgo:   sha256

You need enable it through:

ethtool -C tap0 rx-frames N

Thanks

And better to test it on VM2VM case.

Comment 15 Quan Wenli 2017-05-24 07:25:04 UTC
Hi, jason, wei 

Please check following performance results, pps is increased with 1 rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is it expected?



Steps: 
1. boot 2 vms in same bridge. 
2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on vm2 assigned in pktgen.sh script. 
3. gather pps result on vm2. 

 rx-frames      pkts/s
-----------+--------------+
     0         311290
-----------+--------------+
     1         311195
-----------+--------------+
     4         313300
-----------+--------------+
    16         315542
-----------+--------------+
    64         328584
-----------+--------------+
   128         329697
-----------+--------------+
   256         312774      ----------> drop compared rx-frames=128
-----------+--------------+

Comment 16 jason wang 2017-05-24 10:51:44 UTC
(In reply to Quan Wenli from comment #15)
> Hi, jason, wei 
> 
> Please check following performance results, pps is increased with 1
> rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> it expected?
> 
> 
> 
> Steps: 
> 1. boot 2 vms in same bridge. 
> 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> vm2 assigned in pktgen.sh script. 
> 3. gather pps result on vm2. 
> 
>  rx-frames      pkts/s
> -----------+--------------+
>      0         311290
> -----------+--------------+
>      1         311195
> -----------+--------------+
>      4         313300
> -----------+--------------+
>     16         315542
> -----------+--------------+
>     64         328584
> -----------+--------------+
>    128         329697
> -----------+--------------+
>    256         312774      ----------> drop compared rx-frames=128
> -----------+--------------+

Interesting, in my setup with 3.10.0-671.el7.x86_64.

rx-frames 0,   0.63Mpps
rx-frames 64,  0.99Mpps (+57%)
rx-frames 256, 0.99Mpps (+57%)

Have you pinned all threads in one numa nodes during testing?

Thanks

Comment 17 Quan Wenli 2017-05-25 09:03:30 UTC
(In reply to jason wang from comment #16)
> (In reply to Quan Wenli from comment #15)
> > Hi, jason, wei 
> > 
> > Please check following performance results, pps is increased with 1
> > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > it expected?
> > 
> > 
> > 
> > Steps: 
> > 1. boot 2 vms in same bridge. 
> > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > vm2 assigned in pktgen.sh script. 
> > 3. gather pps result on vm2. 
> > 
> >  rx-frames      pkts/s
> > -----------+--------------+
> >      0         311290
> > -----------+--------------+
> >      1         311195
> > -----------+--------------+
> >      4         313300
> > -----------+--------------+
> >     16         315542
> > -----------+--------------+
> >     64         328584
> > -----------+--------------+
> >    128         329697
> > -----------+--------------+
> >    256         312774      ----------> drop compared rx-frames=128
> > -----------+--------------+
> 
> Interesting, in my setup with 3.10.0-671.el7.x86_64.
> 
> rx-frames 0,   0.63Mpps
> rx-frames 64,  0.99Mpps (+57%)
> rx-frames 256, 0.99Mpps (+57%)
> 
> Have you pinned all threads in one numa nodes during testing?
> 

Pinned all thread in one numa node, just slight improvement not obviously and not pps drops with 256 rx-frames. use "ethtool -c tap0" to check everytime, the rx-frames indeed valid.

rx-frames 0,    330543
rx-frames 64,   334737
rx-frames 256,  334277


> Thanks

Comment 18 Quan Wenli 2017-06-01 03:33:39 UTC
(In reply to jason wang from comment #16)
> (In reply to Quan Wenli from comment #15)
> > Hi, jason, wei 
> > 
> > Please check following performance results, pps is increased with 1
> > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > it expected?
> > 
> > 
> > 
> > Steps: 
> > 1. boot 2 vms in same bridge. 
> > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > vm2 assigned in pktgen.sh script. 
> > 3. gather pps result on vm2. 
> > 
> >  rx-frames      pkts/s
> > -----------+--------------+
> >      0         311290
> > -----------+--------------+
> >      1         311195
> > -----------+--------------+
> >      4         313300
> > -----------+--------------+
> >     16         315542
> > -----------+--------------+
> >     64         328584
> > -----------+--------------+
> >    128         329697
> > -----------+--------------+
> >    256         312774      ----------> drop compared rx-frames=128
> > -----------+--------------+
> 
> Interesting, in my setup with 3.10.0-671.el7.x86_64.
> 
> rx-frames 0,   0.63Mpps
> rx-frames 64,  0.99Mpps (+57%)
> rx-frames 256, 0.99Mpps (+57%)
> 
> Have you pinned all threads in one numa nodes during testing?
> 
> Thanks

I tried with your image which guest is using 4.10.0+ kernel, the performance is up to 0.5Mpps, then I also tried with latest upstream 4.11.0-rc5+, the performance is still 0.33Mpps.

So no performance difference between rhel7.4 guest and latest upstream guest,but it seems an existed regression issue between 4.10.0+ and 4.11.0+rc5+ in upstream.

Comment 20 jason wang 2017-06-08 04:09:40 UTC
(In reply to Quan Wenli from comment #18)
> (In reply to jason wang from comment #16)
> > (In reply to Quan Wenli from comment #15)
> > > Hi, jason, wei 
> > > 
> > > Please check following performance results, pps is increased with 1
> > > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > > it expected?
> > > 
> > > 
> > > 
> > > Steps: 
> > > 1. boot 2 vms in same bridge. 
> > > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > > vm2 assigned in pktgen.sh script. 
> > > 3. gather pps result on vm2. 
> > > 
> > >  rx-frames      pkts/s
> > > -----------+--------------+
> > >      0         311290
> > > -----------+--------------+
> > >      1         311195
> > > -----------+--------------+
> > >      4         313300
> > > -----------+--------------+
> > >     16         315542
> > > -----------+--------------+
> > >     64         328584
> > > -----------+--------------+
> > >    128         329697
> > > -----------+--------------+
> > >    256         312774      ----------> drop compared rx-frames=128
> > > -----------+--------------+
> > 
> > Interesting, in my setup with 3.10.0-671.el7.x86_64.
> > 
> > rx-frames 0,   0.63Mpps
> > rx-frames 64,  0.99Mpps (+57%)
> > rx-frames 256, 0.99Mpps (+57%)
> > 
> > Have you pinned all threads in one numa nodes during testing?
> > 
> > Thanks
> 
> I tried with your image which guest is using 4.10.0+ kernel, the performance
> is up to 0.5Mpps, then I also tried with latest upstream 4.11.0-rc5+, the
> performance is still 0.33Mpps.
> 
> So no performance difference between rhel7.4 guest and latest upstream
> guest,but it seems an existed regression issue between 4.10.0+ and
> 4.11.0+rc5+ in upstream.

Can you try net.git or linux.git. My image use net-next which is in fact a development tree.

Thanks

Comment 21 Quan Wenli 2017-06-12 06:31:31 UTC
(In reply to jason wang from comment #20)
> (In reply to Quan Wenli from comment #18)
> > (In reply to jason wang from comment #16)
> > > (In reply to Quan Wenli from comment #15)
> > > > Hi, jason, wei 
> > > > 
> > > > Please check following performance results, pps is increased with 1
> > > > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > > > it expected?
> > > > 
> > > > 
> > > > 
> > > > Steps: 
> > > > 1. boot 2 vms in same bridge. 
> > > > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > > > vm2 assigned in pktgen.sh script. 
> > > > 3. gather pps result on vm2. 
> > > > 
> > > >  rx-frames      pkts/s
> > > > -----------+--------------+
> > > >      0         311290
> > > > -----------+--------------+
> > > >      1         311195
> > > > -----------+--------------+
> > > >      4         313300
> > > > -----------+--------------+
> > > >     16         315542
> > > > -----------+--------------+
> > > >     64         328584
> > > > -----------+--------------+
> > > >    128         329697
> > > > -----------+--------------+
> > > >    256         312774      ----------> drop compared rx-frames=128
> > > > -----------+--------------+
> > > 
> > > Interesting, in my setup with 3.10.0-671.el7.x86_64.
> > > 
> > > rx-frames 0,   0.63Mpps
> > > rx-frames 64,  0.99Mpps (+57%)
> > > rx-frames 256, 0.99Mpps (+57%)
> > > 
> > > Have you pinned all threads in one numa nodes during testing?
> > > 
> > > Thanks
> > 
> > I tried with your image which guest is using 4.10.0+ kernel, the performance
> > is up to 0.5Mpps, then I also tried with latest upstream 4.11.0-rc5+, the
> > performance is still 0.33Mpps.
> > 
> > So no performance difference between rhel7.4 guest and latest upstream
> > guest,but it seems an existed regression issue between 4.10.0+ and
> > 4.11.0+rc5+ in upstream.
> 
> Can you try net.git or linux.git. My image use net-next which is in fact a
> development tree.
> 
> Thanks

Tried again with guest kernel-4.11.0-rc5+ from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, the result is still bad which is 0.33Mpps. 

So it's a upstream bug ? May I open one bug for tracking it and close this bug ?

Comment 22 Quan Wenli 2017-06-15 06:51:57 UTC
(In reply to Quan Wenli from comment #21)
> (In reply to jason wang from comment #20)
> > (In reply to Quan Wenli from comment #18)
> > > (In reply to jason wang from comment #16)
> > > > (In reply to Quan Wenli from comment #15)
> > > > > Hi, jason, wei 
> > > > > 
> > > > > Please check following performance results, pps is increased with 1
> > > > > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > > > > it expected?
> > > > > 
> > > > > 
> > > > > 
> > > > > Steps: 
> > > > > 1. boot 2 vms in same bridge. 
> > > > > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > > > > vm2 assigned in pktgen.sh script. 
> > > > > 3. gather pps result on vm2. 
> > > > > 
> > > > >  rx-frames      pkts/s
> > > > > -----------+--------------+
> > > > >      0         311290
> > > > > -----------+--------------+
> > > > >      1         311195
> > > > > -----------+--------------+
> > > > >      4         313300
> > > > > -----------+--------------+
> > > > >     16         315542
> > > > > -----------+--------------+
> > > > >     64         328584
> > > > > -----------+--------------+
> > > > >    128         329697
> > > > > -----------+--------------+
> > > > >    256         312774      ----------> drop compared rx-frames=128
> > > > > -----------+--------------+
> > > > 
> > > > Interesting, in my setup with 3.10.0-671.el7.x86_64.
> > > > 
> > > > rx-frames 0,   0.63Mpps
> > > > rx-frames 64,  0.99Mpps (+57%)
> > > > rx-frames 256, 0.99Mpps (+57%)
> > > > 
> > > > Have you pinned all threads in one numa nodes during testing?
> > > > 
> > > > Thanks
> > > 
> > > I tried with your image which guest is using 4.10.0+ kernel, the performance
> > > is up to 0.5Mpps, then I also tried with latest upstream 4.11.0-rc5+, the
> > > performance is still 0.33Mpps.
> > > 
> > > So no performance difference between rhel7.4 guest and latest upstream
> > > guest,but it seems an existed regression issue between 4.10.0+ and
> > > 4.11.0+rc5+ in upstream.
> > 
> > Can you try net.git or linux.git. My image use net-next which is in fact a
> > development tree.
> > 
> > Thanks
> 
> Tried again with guest kernel-4.11.0-rc5+ from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, the result
> is still bad which is 0.33Mpps. 
> 
> So it's a upstream bug ? May I open one bug for tracking it and close this
> bug ?

After check again, I found there is no regression in upstream, the root cause for regression pps between 4.10 to 4.11 is the different param in pktgen.sh. 

1. Both enabled dst (IP) and dst_mac, the pps performance was minium with 0.25. 

2. Only enabled dst_mac, the pps performance was middle with 0.32 which I got with 4.11-rc5+ kernel.

3. Only enabled dst(IP), the pps performance was maxium with 0.5 which I got with 4.10 kernel.

So there is regression in upstream.

And for this bug with only dst (IP), the pps performance was indeed improved with enlarging rx-frames. 

rx-frames 0,    0.50
rx-frames 1,    0.53
rx-frames 4,    0.56
rx-frames 64,   0.64


Base on above, change it to verified.

Comment 23 Quan Wenli 2017-06-19 05:17:18 UTC
> 
> After check again, I found there is no regression in upstream, the root
> cause for regression pps between 4.10 to 4.11 is the different param in
> pktgen.sh. 
> 
> 1. Both enabled dst (IP) and dst_mac, the pps performance was minium with
> 0.25. 
> 
> 2. Only enabled dst_mac, the pps performance was middle with 0.32 which I
> got with 4.11-rc5+ kernel.
> 
> 3. Only enabled dst(IP), the pps performance was maxium with 0.5 which I got
> with 4.10 kernel.
> 
> So there is regression in upstream.

Should be no regression in upstream

Comment 25 errata-xmlrpc 2017-08-02 04:53:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842


Note You need to log in before you can comment on or make changes to this bug.