Bug 1401433

Summary:	Vhost tx batching
Product:	Red Hat Enterprise Linux 7	Reporter:	jason wang <jasowang>
Component:	kernel	Assignee:	Wei <wexu>
kernel sub component:	Virtualization	QA Contact:	Quan Wenli <wquan>
Status:	CLOSED ERRATA	Docs Contact:	Yehuda Zimmerman <yzimmerm>
Severity:	unspecified
Priority:	high	CC:	ailan, chayang, juzhang, michen, mtessun, pezhang, weliao, wexu, wquan
Version:	7.4	Keywords:	FutureFeature
Target Milestone:	rc
Target Release:	7.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-3.10.0-670.el7	Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-02 04:53:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1283257, 1352741
Bug Blocks:	1395265, 1414627, 1445257

Description jason wang 2016-12-05 09:10:15 UTC

Description of problem:

Upstream will suport vhost tx batching, which can batching several tx packets before submitting to host stack.

For testing:
- modprobe tun rx_batched=0
- run pktgen/l2fwd in guest measure pps
- modprobe tun rx_batched=16
- run pktgen/l2fwd in guest measure pps

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 jason wang 2017-01-19 03:48:38 UTC

In net-next:

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e3e37e701713731b22f8cebfa1f5deed455cad8a

Comment 2 Wei 2017-03-15 09:58:34 UTC

Downstream test result on my laptop:
Before:
tap2 RX  1564831 pkts/s RX Dropped: 0 pkts/s
tap1 TX  2180650 pkts/s TX Dropped: 1677842 pkts/s

After:
tap2 RX  1582509 pkts/s RX Dropped: 0 pkts/s
tap1 TX  2232357 pkts/s TX Dropped: 1702915 pkts/s

Comment 4 Wei 2017-05-08 04:24:01 UTC

It is a bit complicated because I posted 2 versions and the v2 didn't touch any 
change rather than comment part which disturbs the maintainer quite much due to 
the feedback for other BZs I had done, usually we only need tweak v1, and I
commented it to skip v2 and go back to v1 possibly last week but haven't got 
feedback so far.

I will ping the maintainer to make sure if the process is acceptable or not, and see if I should ask reviewer to review it or post a new series.

Comment 6 Wei 2017-05-19 13:54:39 UTC

This is a performance improvement which doesn't need a specific document for it.

Comment 7 Rafael Aquini 2017-05-19 23:22:26 UTC

Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 9 Rafael Aquini 2017-05-22 13:54:12 UTC

Patch(es) available on kernel-3.10.0-670.el7

Comment 11 xiywang 2017-05-23 02:30:59 UTC

Hi Wenli,

Could you help to do performance test?

Thanks,
Xiyue

Comment 12 Quan Wenli 2017-05-23 03:27:58 UTC

(In reply to xiywang from comment #11)
> Hi Wenli,
> 
> Could you help to do performance test?
> 
> Thanks,
> Xiyue

Ok, I will test it tmr.

Comment 13 Quan Wenli 2017-05-23 09:54:23 UTC

Hi, jason

There is no parm named rx_batched with 3.10.0-670.el7.x86_64, how to check vhost tx batching valid, actually I did not see any tx pps difference between rx_batched=0 and rx_batched=16. 

# modinfo tun
filename:       /lib/modules/3.10.0-670.el7.x86_64/kernel/drivers/net/tun.ko.xz
alias:          devname:net/tun
alias:          char-major-10-200
license:        GPL
author:         (C) 1999-2004 Max Krasnyansky <maxk>
description:    Universal TUN/TAP device driver
rhelversion:    7.4
srcversion:     E0353EFA774E5AFD2FFCFD1
depends:        
intree:         Y
vermagic:       3.10.0-670.el7.x86_64 SMP mod_unload modversions 
signer:         Red Hat Enterprise Linux kernel signing key
sig_key:        69:FC:97:DA:41:C9:5D:8E:B0:F5:C4:10:8F:59:71:A9:DC:53:14:E9
sig_hashalgo:   sha256

Comment 14 jason wang 2017-05-24 03:26:56 UTC

(In reply to Quan Wenli from comment #13)
> Hi, jason
> 
> There is no parm named rx_batched with 3.10.0-670.el7.x86_64, how to check
> vhost tx batching valid, actually I did not see any tx pps difference
> between rx_batched=0 and rx_batched=16. 
> 
> # modinfo tun
> filename:      
> /lib/modules/3.10.0-670.el7.x86_64/kernel/drivers/net/tun.ko.xz
> alias:          devname:net/tun
> alias:          char-major-10-200
> license:        GPL
> author:         (C) 1999-2004 Max Krasnyansky <maxk>
> description:    Universal TUN/TAP device driver
> rhelversion:    7.4
> srcversion:     E0353EFA774E5AFD2FFCFD1
> depends:        
> intree:         Y
> vermagic:       3.10.0-670.el7.x86_64 SMP mod_unload modversions 
> signer:         Red Hat Enterprise Linux kernel signing key
> sig_key:        69:FC:97:DA:41:C9:5D:8E:B0:F5:C4:10:8F:59:71:A9:DC:53:14:E9
> sig_hashalgo:   sha256

You need enable it through:

ethtool -C tap0 rx-frames N

Thanks

And better to test it on VM2VM case.

Comment 15 Quan Wenli 2017-05-24 07:25:04 UTC

Hi, jason, wei 

Please check following performance results, pps is increased with 1 rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is it expected?



Ｓteps: 
1. boot 2 vms in same bridge. 
2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on vm2 assigned in pktgen.sh script. 
3. gather pps result on vm2. 

 rx-frames      pkts/s
-----------+--------------+
     0         311290
-----------+--------------+
     1         311195
-----------+--------------+
     4         313300
-----------+--------------+
    16         315542
-----------+--------------+
    64         328584
-----------+--------------+
   128         329697
-----------+--------------+
   256         312774      ----------> drop compared rx-frames=128
-----------+--------------+

Comment 16 jason wang 2017-05-24 10:51:44 UTC

(In reply to Quan Wenli from comment #15)
> Hi, jason, wei 
> 
> Please check following performance results, pps is increased with 1
> rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> it expected?
> 
> 
> 
> Ｓteps: 
> 1. boot 2 vms in same bridge. 
> 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> vm2 assigned in pktgen.sh script. 
> 3. gather pps result on vm2. 
> 
>  rx-frames      pkts/s
> -----------+--------------+
>      0         311290
> -----------+--------------+
>      1         311195
> -----------+--------------+
>      4         313300
> -----------+--------------+
>     16         315542
> -----------+--------------+
>     64         328584
> -----------+--------------+
>    128         329697
> -----------+--------------+
>    256         312774      ----------> drop compared rx-frames=128
> -----------+--------------+

Interesting, in my setup with 3.10.0-671.el7.x86_64.

rx-frames 0,   0.63Mpps
rx-frames 64,  0.99Mpps (+57%)
rx-frames 256, 0.99Mpps (+57%)

Have you pinned all threads in one numa nodes during testing?

Thanks

Comment 17 Quan Wenli 2017-05-25 09:03:30 UTC

(In reply to jason wang from comment #16)
> (In reply to Quan Wenli from comment #15)
> > Hi, jason, wei 
> > 
> > Please check following performance results, pps is increased with 1
> > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > it expected?
> > 
> > 
> > 
> > Ｓteps: 
> > 1. boot 2 vms in same bridge. 
> > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > vm2 assigned in pktgen.sh script. 
> > 3. gather pps result on vm2. 
> > 
> >  rx-frames      pkts/s
> > -----------+--------------+
> >      0         311290
> > -----------+--------------+
> >      1         311195
> > -----------+--------------+
> >      4         313300
> > -----------+--------------+
> >     16         315542
> > -----------+--------------+
> >     64         328584
> > -----------+--------------+
> >    128         329697
> > -----------+--------------+
> >    256         312774      ----------> drop compared rx-frames=128
> > -----------+--------------+
> 
> Interesting, in my setup with 3.10.0-671.el7.x86_64.
> 
> rx-frames 0,   0.63Mpps
> rx-frames 64,  0.99Mpps (+57%)
> rx-frames 256, 0.99Mpps (+57%)
> 
> Have you pinned all threads in one numa nodes during testing?
> 

Pinned all thread in one numa node, just slight improvement not obviously and not pps drops with 256 rx-frames. use "ethtool -c tap0" to check everytime, the rx-frames indeed valid.

rx-frames 0,    330543
rx-frames 64,   334737
rx-frames 256,  334277


> Thanks

Comment 18 Quan Wenli 2017-06-01 03:33:39 UTC

(In reply to jason wang from comment #16)
> (In reply to Quan Wenli from comment #15)
> > Hi, jason, wei 
> > 
> > Please check following performance results, pps is increased with 1
> > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > it expected?
> > 
> > 
> > 
> > Ｓteps: 
> > 1. boot 2 vms in same bridge. 
> > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > vm2 assigned in pktgen.sh script. 
> > 3. gather pps result on vm2. 
> > 
> >  rx-frames      pkts/s
> > -----------+--------------+
> >      0         311290
> > -----------+--------------+
> >      1         311195
> > -----------+--------------+
> >      4         313300
> > -----------+--------------+
> >     16         315542
> > -----------+--------------+
> >     64         328584
> > -----------+--------------+
> >    128         329697
> > -----------+--------------+
> >    256         312774      ----------> drop compared rx-frames=128
> > -----------+--------------+
> 
> Interesting, in my setup with 3.10.0-671.el7.x86_64.
> 
> rx-frames 0,   0.63Mpps
> rx-frames 64,  0.99Mpps (+57%)
> rx-frames 256, 0.99Mpps (+57%)
> 
> Have you pinned all threads in one numa nodes during testing?
> 
> Thanks

I tried with your image which guest is using 4.10.0+ kernel, the performance is up to 0.5Mpps, then I also tried with latest upstream 4.11.0-rc5+, the performance is still 0.33Mpps.

So no performance difference between rhel7.4 guest and latest upstream guest,but it seems an existed regression issue between 4.10.0+ and 4.11.0+rc5+ in upstream.

Comment 20 jason wang 2017-06-08 04:09:40 UTC

(In reply to Quan Wenli from comment #18)
> (In reply to jason wang from comment #16)
> > (In reply to Quan Wenli from comment #15)
> > > Hi, jason, wei 
> > > 
> > > Please check following performance results, pps is increased with 1
> > > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > > it expected?
> > > 
> > > 
> > > 
> > > Ｓteps: 
> > > 1. boot 2 vms in same bridge. 
> > > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > > vm2 assigned in pktgen.sh script. 
> > > 3. gather pps result on vm2. 
> > > 
> > >  rx-frames      pkts/s
> > > -----------+--------------+
> > >      0         311290
> > > -----------+--------------+
> > >      1         311195
> > > -----------+--------------+
> > >      4         313300
> > > -----------+--------------+
> > >     16         315542
> > > -----------+--------------+
> > >     64         328584
> > > -----------+--------------+
> > >    128         329697
> > > -----------+--------------+
> > >    256         312774      ----------> drop compared rx-frames=128
> > > -----------+--------------+
> > 
> > Interesting, in my setup with 3.10.0-671.el7.x86_64.
> > 
> > rx-frames 0,   0.63Mpps
> > rx-frames 64,  0.99Mpps (+57%)
> > rx-frames 256, 0.99Mpps (+57%)
> > 
> > Have you pinned all threads in one numa nodes during testing?
> > 
> > Thanks
> 
> I tried with your image which guest is using 4.10.0+ kernel, the performance
> is up to 0.5Mpps, then I also tried with latest upstream 4.11.0-rc5+, the
> performance is still 0.33Mpps.
> 
> So no performance difference between rhel7.4 guest and latest upstream
> guest,but it seems an existed regression issue between 4.10.0+ and
> 4.11.0+rc5+ in upstream.

Can you try net.git or linux.git. My image use net-next which is in fact a development tree.

Thanks

Comment 21 Quan Wenli 2017-06-12 06:31:31 UTC

(In reply to jason wang from comment #20)
> (In reply to Quan Wenli from comment #18)
> > (In reply to jason wang from comment #16)
> > > (In reply to Quan Wenli from comment #15)
> > > > Hi, jason, wei 
> > > > 
> > > > Please check following performance results, pps is increased with 1
> > > > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > > > it expected?
> > > > 
> > > > 
> > > > 
> > > > Ｓteps: 
> > > > 1. boot 2 vms in same bridge. 
> > > > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > > > vm2 assigned in pktgen.sh script. 
> > > > 3. gather pps result on vm2. 
> > > > 
> > > >  rx-frames      pkts/s
> > > > -----------+--------------+
> > > >      0         311290
> > > > -----------+--------------+
> > > >      1         311195
> > > > -----------+--------------+
> > > >      4         313300
> > > > -----------+--------------+
> > > >     16         315542
> > > > -----------+--------------+
> > > >     64         328584
> > > > -----------+--------------+
> > > >    128         329697
> > > > -----------+--------------+
> > > >    256         312774      ----------> drop compared rx-frames=128
> > > > -----------+--------------+
> > > 
> > > Interesting, in my setup with 3.10.0-671.el7.x86_64.
> > > 
> > > rx-frames 0,   0.63Mpps
> > > rx-frames 64,  0.99Mpps (+57%)
> > > rx-frames 256, 0.99Mpps (+57%)
> > > 
> > > Have you pinned all threads in one numa nodes during testing?
> > > 
> > > Thanks
> > 
> > I tried with your image which guest is using 4.10.0+ kernel, the performance
> > is up to 0.5Mpps, then I also tried with latest upstream 4.11.0-rc5+, the
> > performance is still 0.33Mpps.
> > 
> > So no performance difference between rhel7.4 guest and latest upstream
> > guest,but it seems an existed regression issue between 4.10.0+ and
> > 4.11.0+rc5+ in upstream.
> 
> Can you try net.git or linux.git. My image use net-next which is in fact a
> development tree.
> 
> Thanks

Tried again with guest kernel-4.11.0-rc5+ from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, the result is still bad which is 0.33Mpps. 

So it's a upstream bug ? May I open one bug for tracking it and close this bug ?

Comment 22 Quan Wenli 2017-06-15 06:51:57 UTC

(In reply to Quan Wenli from comment #21)
> (In reply to jason wang from comment #20)
> > (In reply to Quan Wenli from comment #18)
> > > (In reply to jason wang from comment #16)
> > > > (In reply to Quan Wenli from comment #15)
> > > > > Hi, jason, wei 
> > > > > 
> > > > > Please check following performance results, pps is increased with 1
> > > > > rx-frames to 128 rx-frames, but with performance drop with 256 rx-frames. is
> > > > > it expected?
> > > > > 
> > > > > 
> > > > > 
> > > > > Ｓteps: 
> > > > > 1. boot 2 vms in same bridge. 
> > > > > 2. run pktgen.sh on device eth0 on vm1. make sure the eth0's mac address on
> > > > > vm2 assigned in pktgen.sh script. 
> > > > > 3. gather pps result on vm2. 
> > > > > 
> > > > >  rx-frames      pkts/s
> > > > > -----------+--------------+
> > > > >      0         311290
> > > > > -----------+--------------+
> > > > >      1         311195
> > > > > -----------+--------------+
> > > > >      4         313300
> > > > > -----------+--------------+
> > > > >     16         315542
> > > > > -----------+--------------+
> > > > >     64         328584
> > > > > -----------+--------------+
> > > > >    128         329697
> > > > > -----------+--------------+
> > > > >    256         312774      ----------> drop compared rx-frames=128
> > > > > -----------+--------------+
> > > > 
> > > > Interesting, in my setup with 3.10.0-671.el7.x86_64.
> > > > 
> > > > rx-frames 0,   0.63Mpps
> > > > rx-frames 64,  0.99Mpps (+57%)
> > > > rx-frames 256, 0.99Mpps (+57%)
> > > > 
> > > > Have you pinned all threads in one numa nodes during testing?
> > > > 
> > > > Thanks
> > > 
> > > I tried with your image which guest is using 4.10.0+ kernel, the performance
> > > is up to 0.5Mpps, then I also tried with latest upstream 4.11.0-rc5+, the
> > > performance is still 0.33Mpps.
> > > 
> > > So no performance difference between rhel7.4 guest and latest upstream
> > > guest,but it seems an existed regression issue between 4.10.0+ and
> > > 4.11.0+rc5+ in upstream.
> > 
> > Can you try net.git or linux.git. My image use net-next which is in fact a
> > development tree.
> > 
> > Thanks
> 
> Tried again with guest kernel-4.11.0-rc5+ from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, the result
> is still bad which is 0.33Mpps. 
> 
> So it's a upstream bug ? May I open one bug for tracking it and close this
> bug ?

After check again, I found there is no regression in upstream, the root cause for regression pps between 4.10 to 4.11 is the different param in pktgen.sh. 

１. Both enabled dst (IP) and dst_mac, the pps performance was minium with 0.25. 

2. Only enabled dst_mac, the pps performance was middle with 0.32 which I got with 4.11-rc5+ kernel.

3. Only enabled dst(IP), the pps performance was maxium with 0.5 which I got with 4.10 kernel.

So there is regression in upstream.

And for this bug with only dst (IP), the pps performance was indeed improved with enlarging rx-frames. 

rx-frames 0,    0.50
rx-frames 1,    0.53
rx-frames 4,    0.56
rx-frames 64,   0.64


Base on above, change it to verified.

Comment 23 Quan Wenli 2017-06-19 05:17:18 UTC

> 
> After check again, I found there is no regression in upstream, the root
> cause for regression pps between 4.10 to 4.11 is the different param in
> pktgen.sh. 
> 
> １. Both enabled dst (IP) and dst_mac, the pps performance was minium with
> 0.25. 
> 
> 2. Only enabled dst_mac, the pps performance was middle with 0.32 which I
> got with 4.11-rc5+ kernel.
> 
> 3. Only enabled dst(IP), the pps performance was maxium with 0.5 which I got
> with 4.10 kernel.
> 
> So there is regression in upstream.

Should be no regression in upstream

Comment 25 errata-xmlrpc 2017-08-02 04:53:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842