501607 – Slow network throughput when running netperf to multiple KVM guests from multiple external drivers

Bug 501607 - Slow network throughput when running netperf to multiple KVM guests from multiple external drivers

Summary: Slow network throughput when running netperf to multiple KVM guests from mult...

Keywords:
Status:	CLOSED DUPLICATE of bug 504647
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	beta
Target Release:	---
Assignee:	Michael S. Tsirkin
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-05-20 01:27 UTC by Mark Wagner
Modified:	2013-01-09 21:38 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-07-24 20:10:07 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mark Wagner 2009-05-20 01:27:49 UTC

Description of problem:
When running netperf to multiple KVM guests (RHEL5.3 guests) over a bridged 10GbE connection, I see a huge performance drop when I run to four different guests from four different drivers. (One driver perf guest).  This is compared to running to four netperfs to four guests from a single external box. 

The issue is lessened by turning tso off on the external driver machines. However, the aggregates are equal.

Version-Release number of selected component (if applicable):


How reproducible:
Every time

Steps to Reproduce:
1.Create and start four KVM guests on a single host, bridge a 10GbE network to each guest. Start netserver in each guest. 
2.Run concurrent netperfs to each guest from a single external box, note aggregate throughput. 
3.Run a single instance of netperf from four external boxes concurrently. Target each box to a unique guest so there is a 1:1 mapping between an external driver and a guest. Record aggregate throughput.
4. Compare throughputs and you should find a large difference. 
  
Actual results:


Expected results:


Additional info:

Comment 1 Mark Wagner 2009-05-20 20:55:03 UTC

suppose some data would help.

Run four netperfs on perf15, each to a different guest on perf22.
[root@perf15 np2.4]# ./duh_local.sh 
 87380  16384  16384    20.01    1688.96   perf15 -> 172.17.10.222
 87380  16384  16384    20.01    2128.39   perf15 -> 172.17.10.223
 87380  16384  16384    20.01    1909.04   perf15 -> 172.17.10.224
 87380  16384  16384    20.01    1756.28   perf15 -> 172.17.10.221

Aggregate is app 7482 Mbits/sec

Now run to the same four guests, but drive a from four different external boxes.
[root@perf15 np2.4]# ./duh.sh 
 87380  16384  16384    60.01    1573.61   baby -> 172.17.10.222
 87380  16384  16384    60.01     694.65   perf10 -> 172.17.10.221
 87380  16384  16384    60.01     704.28   perf21 -> 172.17.10.223
 87380  16384  16384    60.01     848.95   perf3  -> 172.17.10.224

Aggregate is app 3822 Mbits/sec

Also note that I can drive two guests from either a single or two external boxes and get roughly equal aggregate performance.  My fear is that this problem will get worse by scaling negatively with more external drivers. However, I currently do not have data to prove or disprove that.

Comment 2 Mark Wagner 2009-05-22 03:51:59 UTC

Actually the data in comment #2 was from the host an upstream kernel.  So the problem is there too...

Here are some scaling numbers using multiple numbers of guests. The data clearly shows a slowdown when running to multiple guests from multiple external drivers.  On the flip side, the scaling is actually pretty good when driven from a single driver. 

Single driver to a single guest
 87380  16384  16384    1800.02   1501.89   perf15 -> 172.17.10.227

 87380  16384  16384    1800.01   1486.74   perf10 -> 172.17.10.221

Throughputs are pretty close. 

Two guests / drivers
 87380  16384  16384    1800.02   1473.15   perf15 -> 172.17.10.227
 87380  16384  16384    1800.02   1473.36   perf15 -> 172.17.10.226
                                  -------
                                  2946  driven from one external driver

 87380  16384  16384    1800.02    824.41   baby -> 172.17.10.222
 87380  16384  16384    1800.01    561.69   perf10 -> 172.17.10.221
                                  -------
                                  1358   driven from two external systems
With two drivers we are already at 50%


 87380  16384  16384    1800.02   1429.59   perf15 -> 172.17.10.225
 87380  16384  16384    1800.02   1415.83   perf15 -> 172.17.10.227
 87380  16384  16384    1800.02   1431.67   perf15 -> 172.17.10.226
                                  -------
                                  4277  driven from one external driver

 87380  16384  16384    1800.02    572.90   baby -> 172.17.10.222
 87380  16384  16384    1800.03    508.15   perf10 -> 172.17.10.221
 87380  16384  16384    1800.03    441.48   perf21 -> 172.17.10.223
                                  -------
                                  1522   driven from three external systems
With three drivers its almost 33%


 87380  16384  16384    1800.01   1362.66   perf15 -> 172.17.10.226
 87380  16384  16384    1800.01   1368.60   perf15 -> 172.17.10.225
 87380  16384  16384    1800.02   1378.92   perf15 -> 172.17.10.224
 87380  16384  16384    1800.02   1342.05   perf15 -> 172.17.10.227
                                  -------
                                  5452  driven from one external driver

 87380  16384  16384    1800.03    419.36   perf10 -> 172.17.10.221
 87380  16384  16384    1800.05    455.94   baby -> 172.17.10.222
 87380  16384  16384    1800.02    407.48   perf21 -> 172.17.10.223
 87380  16384  16384    1800.01    507.50   perf3  -> 172.17.10.224
                                  -------
                                  1789   driven from four external systems
Four drivers still about 33%


 87380  16384  16384    1800.01   1254.69   perf15 -> 172.17.10.227
 87380  16384  16384    1800.01   1289.23   perf15 -> 172.17.10.223
 87380  16384  16384    1800.02   1280.61   perf15 -> 172.17.10.225
 87380  16384  16384    1800.02   1283.09   perf15 -> 172.17.10.226
 87380  16384  16384    1800.02   1286.37   perf15 -> 172.17.10.224
                                  -------
                                  6394  driven from one external driver

 87380  16384  16384    1800.03    373.85   perf15 -> 172.17.10.226
 87380  16384  16384    1800.02    384.43   baby -> 172.17.10.222
 87380  16384  16384    1800.02    408.68   perf10 -> 172.17.10.221
 87380  16384  16384    1800.03    386.65   perf21 -> 172.17.10.223
 87380  16384  16384    1800.01    406.95   perf3  -> 172.17.10.224
                                 -------
                                  1961   driven from five external systems
Five drivers the percentages are slowly getting worse.

 87380  16384  16384    1800.01   1210.63   perf15 -> 172.17.10.222
 87380  16384  16384    1800.02   1211.91   perf15 -> 172.17.10.223
 87380  16384  16384    1800.02   1198.98   perf15 -> 172.17.10.224
 87380  16384  16384    1800.02   1205.94   perf15 -> 172.17.10.226
 87380  16384  16384    1800.02   1209.49   perf15 -> 172.17.10.225
 87380  16384  16384    1800.02   1157.73   perf15 -> 172.17.10.227
                                  -------
                                  7195  driven from one external driver

 87380  16384  16384    1800.01    316.41   perf15 -> 172.17.10.226
 87380  16384  16384    1800.03    330.22   perf21 -> 172.17.10.223
 87380  16384  16384    1800.02    382.57   perf21 -> 172.17.10.225
 87380  16384  16384    1800.03    350.84   baby -> 172.17.10.222
 87380  16384  16384    1800.02    381.62   perf10 -> 172.17.10.221
 87380  16384  16384    1800.01    350.75   perf3  -> 172.17.10.224
                                 -------
                                  2111   driven from six external systems

Comment 3 Dor Laor 2009-07-09 09:55:43 UTC

Mark, can you re-check against newer host (s/rhel5.3/rhel5.4.).
Since you've seen it in upstream too, the problem probably exists. Changing the version to rhel5.4 so it will stay under my radar.

Comment 4 Mark Wagner 2009-07-13 17:56:41 UTC

Rechecked with RHEL5.4 Beta1 bits (both guest and host).  This includes GRO in the host.  Problem is still there:

multiple external drivers

 87380  16384  16384    60.01     257.65   perf10 -> 172.17.10.225
 87380  16384  16384    60.01     626.70   perf21 -> 172.17.10.223
 87380  16384  16384    60.02     201.03   perf3  -> 172.17.10.224
 87380  16384  16384    60.03     313.64   baby -> 172.17.10.226


From a single external driver

 87380  16384  16384    20.01    2362.32   perf15 -> 172.17.10.224
 87380  16384  16384    20.01    2283.06   perf15 -> 172.17.10.225
 87380  16384  16384    20.02    2284.92   perf15 -> 172.17.10.226
 87380  16384  16384    20.01    2327.48   perf15 -> 172.17.10.223


Does this match what you are seeing while trying to debug it ?

Comment 5 Michael S. Tsirkin 2009-07-15 09:45:17 UTC

changed target to beta: looks like it will take some time

Comment 6 Dor Laor 2009-07-15 09:51:11 UTC

(In reply to comment #5)
> changed target to beta: looks like it will take some time  

I haven't seem your change. Added rhel5.5? flag.

Comment 7 Oded Ramraz 2009-07-16 11:14:49 UTC

Mark, I've noticed that when you checked multiple external drivers your test duration was 60 seconds and from a single external driver the test duration was 20 seconds.
Please try to test both use cases with the same parameters.

Comment 8 Mark Wagner 2009-07-16 11:42:32 UTC

I have run both for 60 secs many times.  Please share your test data that indicates that 20 secs is a problem .

Comment 9 Herbert Xu 2009-07-16 11:46:45 UTC

I'm with Mark on this one.  I always use 10s tests (call me impatient :) and I see exactly the same thing Mark so I highly doubt 20s makes much difference vs. 60s.

Comment 10 Mark Wagner 2009-07-17 01:14:59 UTC

Note - Herbert helped identify some issues with our 10Gb/sec switch that were causing some traffic loss, so some of the numbers have changed.  However the problem is clearly there.

Exec summary
-------------
1) Retested on new switch to confirm problem was "not just the switch"
2) Tested with and without the TX mitigation code. 
   a) without the TX mitigation code we get roughly a 5% drop in aggregate bandwidth when we drive from four external boxes to four guests instead of a single external box.  This is roughly wire speed and also approximates bare metal performance. 
   b) with the TX mitigation code, we see roughly a 2/3 reduction (63%) in the aggregate throughput when driving with four external boxes. 

Conclusion
----------
The TX mitigation code is causing this huge drop in aggregate throughput.



Supporting Data
---------------

Without TX mitigation driving from multiple external boxes is about parity with driving from a single external box. First the multiple external drivers.

 [root@perf15 np2.4]# ./duh.sh
 87380  16384  16384    60.00    2186.51   perf21 -> 172.17.10.223
 87380  16384  16384    60.00    2305.79   perf10 -> 172.17.10.225
 87380  16384  16384    60.00    2338.77   perf15  -> 172.17.10.224
 87380  16384  16384    60.00    2127.72   baby -> 172.17.10.226
                                --------
                                ~8959  Gbits/sec

Now drive the same guests from a single box.
[root@perf15 np2.4]# ./duh_local.sh
 87380  16384  16384    60.01    2339.36   perf15 -> 172.17.10.226
 87380  16384  16384    60.01    2362.50   perf15 -> 172.17.10.224
 87380  16384  16384    60.01    2340.64   perf15 -> 172.17.10.225
 87380  16384  16384    60.01    2370.65   perf15 -> 172.17.10.223
                                --------
                                ~9413 Gbits/sec


Recompile KVM with TX mitigation back in and see a big slowdown. This would indicate that the TX mitigation is responsible for the loss of throughput. This is basically a repeat of the previous tes with the TX mitigation being the only difference.   First four external boxes to four guests.

[root@perf15 np2.4]# ./duh.sh
 87380  16384  16384    60.01     947.61   perf21 -> 172.17.10.223
 87380  16384  16384    60.03     856.53   perf15  -> 172.17.10.224
 87380  16384  16384    60.01     844.71   baby -> 172.17.10.226
 87380  16384  16384    60.01     787.34   perf10 -> 172.17.10.225
                                --------
                                ~3436 Gbits/sec

Now four TCP streams from a single external box, each stream to a different guest.
[root@perf15 np2.4]# ./duh_local.sh
 87380  16384  16384    60.01    2341.74   perf15 -> 172.17.10.224
 87380  16384  16384    60.01    2328.32   perf15 -> 172.17.10.225
 87380  16384  16384    60.01    2336.13   perf15 -> 172.17.10.226
 87380  16384  16384    60.01    2351.66   perf15 -> 172.17.10.223
                                --------
                                ~9358 Gbits/sec

As a point of reference, I repeat the basic test but this time I drive four netperf TCP streams to the host.  As this is bare metal performance, there is no TX mitigation in the path.  This will make a good set of comparative numbers to the no TX mitigation tests. 

First from four different machines to the host.
[root@perf15 np2.4]# ./duh_host.sh
 87380  16384  16384    60.01    2367.07   baby -> 172.17.10.22   -T 2,1
 87380  16384  16384    60.00    2397.21   perf10 -> 172.17.10.22  -T 2,4
 87380  16384  16384    60.00    2247.54   perf15  -> 172.17.10.22   -T 2,3
 87380  16384  16384    60.00    2384.61   perf21 -> 172.17.10.22   -T 2,2 
                                --------
                                ~9396 Gbits/sec

Now from one external box to the host.
[root@perf15 np2.4]# ./duh_local_host.sh 
 87380  16384  16384    60.01    2353.88   perf15 -> 172.17.10.22
 87380  16384  16384    60.01    2349.01   perf15 -> 172.17.10.22
 87380  16384  16384    60.01    2354.72   perf15 -> 172.17.10.22
 87380  16384  16384    60.01    2356.31   perf15 -> 172.17.10.22
                               --------
                                ~9414 Gbits/sec

So there is a clear win in this scenario when we don't use the TX mitigation.  In fact, the aggregate throughput is 37% w/mitigation compared to w/o it. The question is, given the UDP issues and the poor aggregate throughput with mitigation, does removing the TX mitigation code qualify as a blocker for RHEL5.4 based on the drop in performance?

Comment 11 Dor Laor 2009-07-19 08:22:43 UTC

What about a simple hack of detecting whether the tx traffic is using GSO and cancelling tx mitigation timer for a short while?

In the future we'll have some major re-writes when we migrate into virtio host kernel implementation.

Comment 12 Michael S. Tsirkin 2009-07-19 19:54:33 UTC

Turning off tx mitigation would be even simpler.
Do we have data that TX mitigation helps under some workload?

Comment 13 Mark Wagner 2009-07-19 20:31:42 UTC

re: comment #12

From a pure throughput perspective,  tx mitigation does help with small message TCP data. I am working on pulling all of the data together and should have something posted within 24 hours.

Comment 14 Mark Wagner 2009-07-24 20:10:07 UTC

The patch for BZ 504647 resolves this issue as well.  The performance data is listed in previous comments (#10).  Even though the failure mode is much different from BZ 504647, marking this as a duplicate of that one as it has the same fix.

*** This bug has been marked as a duplicate of bug 504647 ***

Note You need to log in before you can comment on or make changes to this bug.