Description of problem: When running netperf to multiple KVM guests (RHEL5.3 guests) over a bridged 10GbE connection, I see a huge performance drop when I run to four different guests from four different drivers. (One driver perf guest). This is compared to running to four netperfs to four guests from a single external box. The issue is lessened by turning tso off on the external driver machines. However, the aggregates are equal. Version-Release number of selected component (if applicable): How reproducible: Every time Steps to Reproduce: 1.Create and start four KVM guests on a single host, bridge a 10GbE network to each guest. Start netserver in each guest. 2.Run concurrent netperfs to each guest from a single external box, note aggregate throughput. 3.Run a single instance of netperf from four external boxes concurrently. Target each box to a unique guest so there is a 1:1 mapping between an external driver and a guest. Record aggregate throughput. 4. Compare throughputs and you should find a large difference. Actual results: Expected results: Additional info:
suppose some data would help. Run four netperfs on perf15, each to a different guest on perf22. [root@perf15 np2.4]# ./duh_local.sh 87380 16384 16384 20.01 1688.96 perf15 -> 172.17.10.222 87380 16384 16384 20.01 2128.39 perf15 -> 172.17.10.223 87380 16384 16384 20.01 1909.04 perf15 -> 172.17.10.224 87380 16384 16384 20.01 1756.28 perf15 -> 172.17.10.221 Aggregate is app 7482 Mbits/sec Now run to the same four guests, but drive a from four different external boxes. [root@perf15 np2.4]# ./duh.sh 87380 16384 16384 60.01 1573.61 baby -> 172.17.10.222 87380 16384 16384 60.01 694.65 perf10 -> 172.17.10.221 87380 16384 16384 60.01 704.28 perf21 -> 172.17.10.223 87380 16384 16384 60.01 848.95 perf3 -> 172.17.10.224 Aggregate is app 3822 Mbits/sec Also note that I can drive two guests from either a single or two external boxes and get roughly equal aggregate performance. My fear is that this problem will get worse by scaling negatively with more external drivers. However, I currently do not have data to prove or disprove that.
Actually the data in comment #2 was from the host an upstream kernel. So the problem is there too... Here are some scaling numbers using multiple numbers of guests. The data clearly shows a slowdown when running to multiple guests from multiple external drivers. On the flip side, the scaling is actually pretty good when driven from a single driver. Single driver to a single guest 87380 16384 16384 1800.02 1501.89 perf15 -> 172.17.10.227 87380 16384 16384 1800.01 1486.74 perf10 -> 172.17.10.221 Throughputs are pretty close. Two guests / drivers 87380 16384 16384 1800.02 1473.15 perf15 -> 172.17.10.227 87380 16384 16384 1800.02 1473.36 perf15 -> 172.17.10.226 ------- 2946 driven from one external driver 87380 16384 16384 1800.02 824.41 baby -> 172.17.10.222 87380 16384 16384 1800.01 561.69 perf10 -> 172.17.10.221 ------- 1358 driven from two external systems With two drivers we are already at 50% 87380 16384 16384 1800.02 1429.59 perf15 -> 172.17.10.225 87380 16384 16384 1800.02 1415.83 perf15 -> 172.17.10.227 87380 16384 16384 1800.02 1431.67 perf15 -> 172.17.10.226 ------- 4277 driven from one external driver 87380 16384 16384 1800.02 572.90 baby -> 172.17.10.222 87380 16384 16384 1800.03 508.15 perf10 -> 172.17.10.221 87380 16384 16384 1800.03 441.48 perf21 -> 172.17.10.223 ------- 1522 driven from three external systems With three drivers its almost 33% 87380 16384 16384 1800.01 1362.66 perf15 -> 172.17.10.226 87380 16384 16384 1800.01 1368.60 perf15 -> 172.17.10.225 87380 16384 16384 1800.02 1378.92 perf15 -> 172.17.10.224 87380 16384 16384 1800.02 1342.05 perf15 -> 172.17.10.227 ------- 5452 driven from one external driver 87380 16384 16384 1800.03 419.36 perf10 -> 172.17.10.221 87380 16384 16384 1800.05 455.94 baby -> 172.17.10.222 87380 16384 16384 1800.02 407.48 perf21 -> 172.17.10.223 87380 16384 16384 1800.01 507.50 perf3 -> 172.17.10.224 ------- 1789 driven from four external systems Four drivers still about 33% 87380 16384 16384 1800.01 1254.69 perf15 -> 172.17.10.227 87380 16384 16384 1800.01 1289.23 perf15 -> 172.17.10.223 87380 16384 16384 1800.02 1280.61 perf15 -> 172.17.10.225 87380 16384 16384 1800.02 1283.09 perf15 -> 172.17.10.226 87380 16384 16384 1800.02 1286.37 perf15 -> 172.17.10.224 ------- 6394 driven from one external driver 87380 16384 16384 1800.03 373.85 perf15 -> 172.17.10.226 87380 16384 16384 1800.02 384.43 baby -> 172.17.10.222 87380 16384 16384 1800.02 408.68 perf10 -> 172.17.10.221 87380 16384 16384 1800.03 386.65 perf21 -> 172.17.10.223 87380 16384 16384 1800.01 406.95 perf3 -> 172.17.10.224 ------- 1961 driven from five external systems Five drivers the percentages are slowly getting worse. 87380 16384 16384 1800.01 1210.63 perf15 -> 172.17.10.222 87380 16384 16384 1800.02 1211.91 perf15 -> 172.17.10.223 87380 16384 16384 1800.02 1198.98 perf15 -> 172.17.10.224 87380 16384 16384 1800.02 1205.94 perf15 -> 172.17.10.226 87380 16384 16384 1800.02 1209.49 perf15 -> 172.17.10.225 87380 16384 16384 1800.02 1157.73 perf15 -> 172.17.10.227 ------- 7195 driven from one external driver 87380 16384 16384 1800.01 316.41 perf15 -> 172.17.10.226 87380 16384 16384 1800.03 330.22 perf21 -> 172.17.10.223 87380 16384 16384 1800.02 382.57 perf21 -> 172.17.10.225 87380 16384 16384 1800.03 350.84 baby -> 172.17.10.222 87380 16384 16384 1800.02 381.62 perf10 -> 172.17.10.221 87380 16384 16384 1800.01 350.75 perf3 -> 172.17.10.224 ------- 2111 driven from six external systems
Mark, can you re-check against newer host (s/rhel5.3/rhel5.4.). Since you've seen it in upstream too, the problem probably exists. Changing the version to rhel5.4 so it will stay under my radar.
Rechecked with RHEL5.4 Beta1 bits (both guest and host). This includes GRO in the host. Problem is still there: multiple external drivers 87380 16384 16384 60.01 257.65 perf10 -> 172.17.10.225 87380 16384 16384 60.01 626.70 perf21 -> 172.17.10.223 87380 16384 16384 60.02 201.03 perf3 -> 172.17.10.224 87380 16384 16384 60.03 313.64 baby -> 172.17.10.226 From a single external driver 87380 16384 16384 20.01 2362.32 perf15 -> 172.17.10.224 87380 16384 16384 20.01 2283.06 perf15 -> 172.17.10.225 87380 16384 16384 20.02 2284.92 perf15 -> 172.17.10.226 87380 16384 16384 20.01 2327.48 perf15 -> 172.17.10.223 Does this match what you are seeing while trying to debug it ?
changed target to beta: looks like it will take some time
(In reply to comment #5) > changed target to beta: looks like it will take some time I haven't seem your change. Added rhel5.5? flag.
Mark, I've noticed that when you checked multiple external drivers your test duration was 60 seconds and from a single external driver the test duration was 20 seconds. Please try to test both use cases with the same parameters.
I have run both for 60 secs many times. Please share your test data that indicates that 20 secs is a problem .
I'm with Mark on this one. I always use 10s tests (call me impatient :) and I see exactly the same thing Mark so I highly doubt 20s makes much difference vs. 60s.
Note - Herbert helped identify some issues with our 10Gb/sec switch that were causing some traffic loss, so some of the numbers have changed. However the problem is clearly there. Exec summary ------------- 1) Retested on new switch to confirm problem was "not just the switch" 2) Tested with and without the TX mitigation code. a) without the TX mitigation code we get roughly a 5% drop in aggregate bandwidth when we drive from four external boxes to four guests instead of a single external box. This is roughly wire speed and also approximates bare metal performance. b) with the TX mitigation code, we see roughly a 2/3 reduction (63%) in the aggregate throughput when driving with four external boxes. Conclusion ---------- The TX mitigation code is causing this huge drop in aggregate throughput. Supporting Data --------------- Without TX mitigation driving from multiple external boxes is about parity with driving from a single external box. First the multiple external drivers. [root@perf15 np2.4]# ./duh.sh 87380 16384 16384 60.00 2186.51 perf21 -> 172.17.10.223 87380 16384 16384 60.00 2305.79 perf10 -> 172.17.10.225 87380 16384 16384 60.00 2338.77 perf15 -> 172.17.10.224 87380 16384 16384 60.00 2127.72 baby -> 172.17.10.226 -------- ~8959 Gbits/sec Now drive the same guests from a single box. [root@perf15 np2.4]# ./duh_local.sh 87380 16384 16384 60.01 2339.36 perf15 -> 172.17.10.226 87380 16384 16384 60.01 2362.50 perf15 -> 172.17.10.224 87380 16384 16384 60.01 2340.64 perf15 -> 172.17.10.225 87380 16384 16384 60.01 2370.65 perf15 -> 172.17.10.223 -------- ~9413 Gbits/sec Recompile KVM with TX mitigation back in and see a big slowdown. This would indicate that the TX mitigation is responsible for the loss of throughput. This is basically a repeat of the previous tes with the TX mitigation being the only difference. First four external boxes to four guests. [root@perf15 np2.4]# ./duh.sh 87380 16384 16384 60.01 947.61 perf21 -> 172.17.10.223 87380 16384 16384 60.03 856.53 perf15 -> 172.17.10.224 87380 16384 16384 60.01 844.71 baby -> 172.17.10.226 87380 16384 16384 60.01 787.34 perf10 -> 172.17.10.225 -------- ~3436 Gbits/sec Now four TCP streams from a single external box, each stream to a different guest. [root@perf15 np2.4]# ./duh_local.sh 87380 16384 16384 60.01 2341.74 perf15 -> 172.17.10.224 87380 16384 16384 60.01 2328.32 perf15 -> 172.17.10.225 87380 16384 16384 60.01 2336.13 perf15 -> 172.17.10.226 87380 16384 16384 60.01 2351.66 perf15 -> 172.17.10.223 -------- ~9358 Gbits/sec As a point of reference, I repeat the basic test but this time I drive four netperf TCP streams to the host. As this is bare metal performance, there is no TX mitigation in the path. This will make a good set of comparative numbers to the no TX mitigation tests. First from four different machines to the host. [root@perf15 np2.4]# ./duh_host.sh 87380 16384 16384 60.01 2367.07 baby -> 172.17.10.22 -T 2,1 87380 16384 16384 60.00 2397.21 perf10 -> 172.17.10.22 -T 2,4 87380 16384 16384 60.00 2247.54 perf15 -> 172.17.10.22 -T 2,3 87380 16384 16384 60.00 2384.61 perf21 -> 172.17.10.22 -T 2,2 -------- ~9396 Gbits/sec Now from one external box to the host. [root@perf15 np2.4]# ./duh_local_host.sh 87380 16384 16384 60.01 2353.88 perf15 -> 172.17.10.22 87380 16384 16384 60.01 2349.01 perf15 -> 172.17.10.22 87380 16384 16384 60.01 2354.72 perf15 -> 172.17.10.22 87380 16384 16384 60.01 2356.31 perf15 -> 172.17.10.22 -------- ~9414 Gbits/sec So there is a clear win in this scenario when we don't use the TX mitigation. In fact, the aggregate throughput is 37% w/mitigation compared to w/o it. The question is, given the UDP issues and the poor aggregate throughput with mitigation, does removing the TX mitigation code qualify as a blocker for RHEL5.4 based on the drop in performance?
What about a simple hack of detecting whether the tx traffic is using GSO and cancelling tx mitigation timer for a short while? In the future we'll have some major re-writes when we migrate into virtio host kernel implementation.
Turning off tx mitigation would be even simpler. Do we have data that TX mitigation helps under some workload?
re: comment #12 From a pure throughput perspective, tx mitigation does help with small message TCP data. I am working on pulling all of the data together and should have something posted within 24 hours.
The patch for BZ 504647 resolves this issue as well. The performance data is listed in previous comments (#10). Even though the failure mode is much different from BZ 504647, marking this as a duplicate of that one as it has the same fix. *** This bug has been marked as a duplicate of bug 504647 ***