Bug 488882

Summary: cxgb3 driver very slow under Xen with HW acceleration enabled
Product: Red Hat Enterprise Linux 5 Reporter: Mark Wagner <mwagner>
Component: kernel-xenAssignee: Paolo Bonzini <pbonzini>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: agospoda, clalance, divy, herbert.xu, indranil, leiwang, mjenner, mwagner, pbonzini, peterm, xen-maint, yuzhang
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 20:46:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514490    
Attachments:
Description Flags
Tx credit return management for Dom0's Xen none

Description Mark Wagner 2009-03-06 02:37:59 UTC
Description of problem:

When using a Xen guest that is bridged through a Chelsio 10GbE card using the cxgb3 driver, transmit performance is degraded by several orders of magnitude unless the TX assist is disabled on the host. 
 
Version-Release number of selected component (if applicable):


How reproducible:

Every time

Steps to Reproduce:
1. Build a RHEL5.3 Xen guest on a system with the Chelsio card and create a bridge to 10GbE network.

2. From the guest, run netperf to an external box. Observer results

3. On the host, disable the tx assist using ethtool (ethtool -K ethX tx off)

4) Repeat step#2, compare results of step#2, and step#4
  
Actual results:

44 Mb/sec

Expected results:
 ~5000Mb/sec

Additional info:

[root@perf10 ~]# ethtool -i peth2
driver: cxgb3
version: 1.0-ko
firmware-version: T 6.0.0 TP 1.1.0
bus-info: 0000:01:00.0

[root@perf10 ~]# ethtool -k peth2
Offload parameters for peth2:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

From guest
[root@dhcp47-134 np2.4]# ./netperf -l 15 -H 172.17.10.15
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.10.15 (172.17.10.15) port 0 AF_INET : spin interval : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    16.16      44.70   

Now use ethtool to turn the assist off.
[root@perf10 ~]# ethtool -K peth2 tx off
[root@perf10 ~]# ethtool -k peth2
Offload parameters for peth2:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

[root@dhcp47-134 np2.4]# ./netperf -l 15 -H 172.17.10.15
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.10.15 (172.17.10.15) port 0 AF_INET : spin interval : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    15.00    5079.34

Comment 1 Andy Gospodarek 2009-03-06 13:32:20 UTC
Any statistics about drops or invalid checksums after this run?  I'm guessing that if you look at the stats for the device before and after the run you would see a lot more than 87380 bytes received.  Do you have this setup somewhere where I can take a look?

Comment 2 Mark Wagner 2009-03-06 17:39:37 UTC
A before and after

[root@perf10 ~]# ethtool -k peth3
Offload parameters for peth3:
Cannot get device rx csum settings: No such device
Cannot get device tx csum settings: No such device
Cannot get device scatter-gather settings: No such device
Cannot get device tcp segmentation offload settings: No such device
Cannot get device udp large send offload settings: No such device
Cannot get device generic segmentation offload settings: No such device
no offload info available
[root@perf10 ~]# ethtool -k peth2
Offload parameters for peth2:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
[root@perf10 ~]# ethtool -S peth2
NIC statistics:
     TxOctetsOK         : 134044256554
     TxFramesOK         : 107255105
     TxMulticastFramesOK: 262
     TxBroadcastFramesOK: 56
     TxPauseFrames      : 0
     TxUnderrun         : 0
     TxExtUnderrun      : 0
     TxFrames64         : 87
     TxFrames65To127    : 19544570
     TxFrames128To255   : 17101
     TxFrames256To511   : 59350
     TxFrames512To1023  : 427272
     TxFrames1024To1518 : 87206725
     TxFrames1519ToMax  : 0
     RxOctetsOK         : 79921913968
     RxFramesOK         : 79557155
     RxMulticastFramesOK: 35673
     RxBroadcastFramesOK: 1184
     RxPauseFrames      : 0
     RxFCSErrors        : 0
     RxSymbolErrors     : 0
     RxShortErrors      : 0
     RxJabberErrors     : 0
     RxLengthErrors     : 0
     RxFIFOoverflow     : 0
     RxFrames64         : 91
     RxFrames65To127    : 28118586
     RxFrames128To255   : 1410
     RxFrames256To511   : 5334
     RxFrames512To1023  : 175308
     RxFrames1024To1518 : 51256426
     RxFrames1519ToMax  : 0
     PhyFIFOErrors      : 0
     TSO                : 7374
     VLANextractions    : 0
     VLANinsertions     : 0
     TxCsumOffload      : 7707
     RxCsumGood         : 79520496
     LroAggregated      : 0
     LroFlushed         : 0
     LroNoDesc          : 0
     RxDrops            : 0
     CheckTXEnToggled   : 0
     CheckResets        : 0
[root@perf10 ~]# ethtool -K peth2 tx on sg on tso on

Jump on guest, run netperf
[root@dhcp47-134 np2.4]# ./netperf -l 15 -H 172.17.10.15
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.10.15 (172.17.10.15) port 0 AF_INET : spin interval : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    15.17     198.53   


[root@perf10 ~]# ethtool -S peth2
NIC statistics:
     TxOctetsOK         : 134438894870
     TxFramesOK         : 107515094
     TxMulticastFramesOK: 262
     TxBroadcastFramesOK: 57
     TxPauseFrames      : 0
     TxUnderrun         : 0
     TxExtUnderrun      : 0
     TxFrames64         : 88
     TxFrames65To127    : 19544582
     TxFrames128To255   : 17102
     TxFrames256To511   : 59352
     TxFrames512To1023  : 427275
     TxFrames1024To1518 : 87466695
     TxFrames1519ToMax  : 0
     RxOctetsOK         : 79924285765
     RxFramesOK         : 79590973
     RxMulticastFramesOK: 35726
     RxBroadcastFramesOK: 1184
     RxPauseFrames      : 0
     RxFCSErrors        : 0
     RxSymbolErrors     : 0
     RxShortErrors      : 0
     RxJabberErrors     : 0
     RxLengthErrors     : 0
     RxFIFOoverflow     : 0
     RxFrames64         : 93
     RxFrames65To127    : 28152397
     RxFrames128To255   : 1412
     RxFrames256To511   : 5337
     RxFrames512To1023  : 175308
     RxFrames1024To1518 : 51256426
     RxFrames1519ToMax  : 0
     PhyFIFOErrors      : 0
     TSO                : 13380
     VLANextractions    : 0
     VLANinsertions     : 0
     TxCsumOffload      : 13875
     RxCsumGood         : 79554260
     LroAggregated      : 0
     LroFlushed         : 0
     LroNoDesc          : 0
     RxDrops            : 0
     CheckTXEnToggled   : 0
     CheckResets        : 0


Here is a diff of the two outputs from a second, similar run:
 [root@perf10 ~]# diff -w bz1.txt bz2.txt
2,3c2,3
<      TxOctetsOK         : 134438894870
<      TxFramesOK         : 107515094
---
>      TxOctetsOK         : 134770288502
>      TxFramesOK         : 107733415
5c5
<      TxBroadcastFramesOK: 57
---
>      TxBroadcastFramesOK: 58
9,10c9,10
<      TxFrames64         : 88
<      TxFrames65To127    : 19544582
---
>      TxFrames64         : 89
>      TxFrames65To127    : 19544591
12,14c12,14
<      TxFrames256To511   : 59352
<      TxFrames512To1023  : 427275
<      TxFrames1024To1518 : 87466695
---
>      TxFrames256To511   : 59353
>      TxFrames512To1023  : 427277
>      TxFrames1024To1518 : 87685003
16,18c16,18
<      RxOctetsOK         : 79924299370
<      RxFramesOK         : 79591070
<      RxMulticastFramesOK: 35817
---
>      RxOctetsOK         : 79926309990
>      RxFramesOK         : 79619767
>      RxMulticastFramesOK: 35829
27,28c27,28
<      RxFrames64         : 93
<      RxFrames65To127    : 28152485
---
>      RxFrames64         : 94
>      RxFrames65To127    : 28181179
30c30
<      RxFrames256To511   : 5343
---
>      RxFrames256To511   : 5345
35c35
<      TSO                : 13380
---
>      TSO                : 18338
38,39c38,39
<      TxCsumOffload      : 13875
<      RxCsumGood         : 79554260
---
>      TxCsumOffload      : 18959
>      RxCsumGood         : 79582944

Comment 3 Herbert Xu 2009-03-27 03:37:58 UTC
Mark, can you try just turning TSO off without turning tx checksum offload off too? Thanks!

Comment 4 Mark Wagner 2009-04-14 13:47:09 UTC
Herbert I tried with tso off and there were "spurts" of traffic but not a decent flow. The average throughput was less than half of what I was able to get with tx off as well.

Comment 5 Herbert Xu 2009-04-15 11:25:31 UTC
OK, please set up a machine with a Xen guest running and give me remote access so I can try to debug this.  Thanks!

Comment 7 Mark Wagner 2009-07-01 13:45:24 UTC
I think that giving Herbert access to the machine fulfilled the need for the needinfo flag.

Comment 10 Andy Gospodarek 2009-07-06 20:49:50 UTC
Divy, have you done much testing with Xen?

One opinion is that the way in which you free skbs (noted in the large comment in t3_eth_xmit) means that bursts like this will be expected when used in conjunction with virtualization (due to the limited number of pages available that need to be re-used quickly).  Do you have any thoughts about trying to add a tx-completion interrupt to try and address this?

Comment 11 Divy Le Ray 2009-07-08 04:48:55 UTC
Created attachment 350893 [details]
Tx credit return management for Dom0's Xen

Hi Andy,

We've not done much Xen testing in RHEL context. We however ship our driver in both Citrix'Xen server and VMware's ESX. We have hit such a performance degradation. We did not correlate it with tx hw assist, but we've root caused it. It points to the opinion you mention :)

In all the virtualized environments we have tested, the VM's app's send buffer frees up its load only when the hypervisor's driver frees the corresponding skb.
cxgb3 however does not free a TX skb on DMA completion.
The driver relies on FW generated credit returns posted on the receive control queues.

In non virtual environments, the driver programs the HW to coalesce these credit returns to minimize the FW management load, and relies on skb_orphan() to free up space in the app'send buffer. skbs are freed on credit return receptions. 
It does not work for the VMs, skb_orphan() won't free up virtualized app'send buffer.

The attached patch provides a much more aggressive credit return policy, and has solved our perf issues on other virtualized platforms.

Cheers,
Divy

Comment 12 Andy Gospodarek 2009-07-08 12:31:47 UTC
(In reply to comment #11)
> 
> Hi Andy,
> 
> We've not done much Xen testing in RHEL context. We however ship our driver in
> both Citrix'Xen server and VMware's ESX. We have hit such a performance
> degradation. We did not correlate it with tx hw assist, but we've root caused
> it. It points to the opinion you mention :)
> 
> In all the virtualized environments we have tested, the VM's app's send buffer
> frees up its load only when the hypervisor's driver frees the corresponding
> skb.
> cxgb3 however does not free a TX skb on DMA completion.
> The driver relies on FW generated credit returns posted on the receive control
> queues.
> 
> In non virtual environments, the driver programs the HW to coalesce these
> credit returns to minimize the FW management load, and relies on skb_orphan()
> to free up space in the app'send buffer. skbs are freed on credit return
> receptions. 
> It does not work for the VMs, skb_orphan() won't free up virtualized app'send
> buffer.
> 
> The attached patch provides a much more aggressive credit return policy, and
> has solved our perf issues on other virtualized platforms.
> 
> Cheers,
> Divy  

Divy, thanks for the quick response.  I think this could be a nice solution, but I'm curious what the impact would be if we just removed the dependency on CONFIG_XEN and made those changes permanent.  What will be the drop in performance when using a baremetal kernel?  And what about using KVM?  This would certainly still be a problem in that environment.

Comment 13 Herbert Xu 2009-07-08 12:49:12 UTC
Andy, it shouldn't be a problem for KVM because KVM doesn't do per-page tracking which is a Xen-specific hack.

Comment 14 Andy Gospodarek 2009-07-08 13:58:22 UTC
(In reply to comment #13)
> Andy, it shouldn't be a problem for KVM because KVM doesn't do per-page
> tracking which is a Xen-specific hack.  

Good to know, Herbert.

/me needs some Xen and KVM lessons.  :-)

Comment 15 Divy Le Ray 2009-07-09 04:36:55 UTC
Hi Andy,

Making these changes permanent would have an impact on performance on bare metal kernels.
Instead of receiving a control packet returning coalesced credit returns,
you'll have one per sent packet. More pressure on the pci bus, on the FW.
It is better to not use this configuration if you do not need it.

I'll ask our QA team to start testing KVM on RHEL5.4. I also need to get up to speed on KVM.

Cheers,
Divy

Comment 16 Herbert Xu 2009-10-29 18:24:19 UTC
FWIW I'm experimenting with a new TX interrupt mitigation mechanism that will hopefully resolve this problem without creating a different path for Xen.

Comment 18 Paolo Bonzini 2010-06-23 16:40:31 UTC
Herbert, should we go with Divy's patch or wait for your stuff to be complete? Do you have a BZ for your work?

Comment 21 Paolo Bonzini 2010-06-23 22:55:09 UTC
Just like it should not be a problem for KVM it shouldn't be a problem also for PCI passthrough to Xen HVM guests.

However, I have no idea about PCI passthrough to PV guests.  Those run under CONFIG_XEN so they would use the fix.  Herbert/Divy, would they need it?  My guess is "yes", but I'd like a confirmation.

Comment 22 Herbert Xu 2010-06-23 23:02:51 UTC
PV passthrough doesn't need the fix but I think it should still work, albeit with the same effect as if you'd applied the fix to a normal kernel.  Divy, can you confirm?

Comment 23 Divy Le Ray 2010-06-24 02:24:25 UTC
Yes, the fix does not change driver overall behavior, just request the HW to indicate TX completions more often than otherwise needed.

Comment 24 Paolo Bonzini 2010-06-24 11:37:50 UTC
I think I'll make the change only for dom0 then.

Comment 25 RHEL Program Management 2010-08-04 12:09:50 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 26 Paolo Bonzini 2010-08-04 17:40:50 UTC
Do we know if the bad performance is also visible under KVM?

Comment 27 Paolo Bonzini 2010-08-04 17:41:56 UTC
Herbert mentioned it is not a problem in comment #13.  Still some numbers would be nice to have...

Comment 29 Jarod Wilson 2010-09-21 20:58:56 UTC
in kernel-2.6.18-223.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 31 Lei Wang 2010-12-22 08:49:16 UTC
Hi, Mark

As we do not have a Chelsio 10GbE card on hand, would you please help to verify this bug if convenient, thanks a lot:)

Lei Wang

Comment 32 Mark Wagner 2010-12-23 15:58:38 UTC
I can't help at this point in time.

Comment 36 errata-xmlrpc 2011-01-13 20:46:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html