Bug 1026874 - Enable pvticketlocks cause cpu performance falling
Enable pvticketlocks cause cpu performance falling
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm (Show other bugs)
7.0
x86_64 Linux
urgent Severity high
: rc
: ---
Assigned To: Andrew Jones
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-05 10:06 EST by FuXiangChun
Modified: 2014-02-24 08:51 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-02-24 08:51:23 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description FuXiangChun 2013-11-05 10:06:13 EST
Description of problem:
Use dbench tool to test pvticketlocks performance on Intel and AMD host.  Throughput is falling when pvticketlocks is enalbed.  QE tested many times with dbench time. 

Version-Release number of selected component (if applicable):
qemu-kvm-1.5.3-10.el7.x86_64 
3.10.0-37.el7.x86_64 

How reproducible:
100%

Steps to Reproduce:
3.Test steps:
1.Boot guest
/usr/libexec/qemu-kvm -M q35 -cpu Opteron_G2,+kvm_pv_unhalt
-enable-kvm -m 4096 -smp 4,sockets=1,cores=4,threads=1 ....

2.mount -t tmpfs -o size=1G tmpfs /mnt inside guest

3.dbench -D /mnt -t 60 32 -c
/usr/local/share/doc/dbench/loadfiles/client.txt inside guest

Actual results
Intel host:

enable pvticketlocks(-cpu SandyBridge,+kvm_pv_unhalt)
Throughput 5252.95 MB/sec  32 clients  32 procs  max_latency=57.459 ms
Throughput 5481.55 MB/sec  32 clients  32 procs  max_latency=37.068 ms
Throughput 5447.84 MB/sec  32 clients  32 procs  max_latency=41.034 ms
Throughput 5490.55 MB/sec  32 clients  32 procs  max_latency=42.116 ms
Throughput 5468.27 MB/sec  32 clients  32 procs  max_latency=39.693 ms

disable pvticketlocks(-cpu SandyBridge,-kvm_pv_unhalt)
Throughput 5529.27 MB/sec  32 clients  32 procs  max_latency=44.026 ms
Throughput 5556.54 MB/sec  32 clients  32 procs  max_latency=42.031 ms
Throughput 5495.86 MB/sec  32 clients  32 procs  max_latency=41.051 ms
Throughput 5560.48 MB/sec  32 clients  32 procs  max_latency=37.887 ms
Throughput 5553.49 MB/sec  32 clients  32 procs  max_latency=44.023 ms

AMD host:

Enable pvticketlocks(-M q35 -cpu Opteron_G2,+kvm_pv_unhalt, or -M q35
-cpu host,+kvm_pv_unhalt)
dbench result:
Throughput 1941.31 MB/sec  32 clients  32 procs  max_latency=76.107 ms
Throughput 1949.63 MB/sec  32 clients  32 procs  max_latency=330.093 ms
Throughput 1956.27 MB/sec  32 clients  32 procs  max_latency=180.038 ms
Throughput 1947.99 MB/sec  32 clients  32 procs  max_latency=72.166 ms
Throughput 1954.01 MB/sec  32 clients  32 procs  max_latency=80.094 ms
Throughput 1950.92 MB/sec  32 clients  32 procs  max_latency=69.060 ms
Throughput 1950.03 MB/sec  32 clients  32 procs  max_latency=73.183 ms
Throughput 1944.33 MB/sec  32 clients  32 procs  max_latency=64.051 ms
Throughput 1957.74 MB/sec  32 clients  32 procs  max_latency=170.082 ms
Throughput 1953.79 MB/sec  32 clients  32 procs  max_latency=65.057 ms

Disable pvticketlocks(-M q35 -cpu Opteron_G2)

Dbench result:
Throughput 2005.98 MB/sec  32 clients  32 procs max_latency=130.086 ms
Throughput 2012.03 MB/sec  32 clients  32 procs max_latency=97.109 ms
Throughput 2035.5 MB/sec  32 clients  32 procs  max_latency=87.768 ms
Throughput 2035.5 MB/sec  32 clients  32 procs  max_latency=87.768 ms
Throughput 2019.16 MB/sec  32 clients  32 procs max_latency=66.029 ms
Throughput 2020.43 MB/sec  32 clients  32 procs max_latency=133.097 ms

Expected results:
enable pvticketlocks cpu performance rising

Additional info:
Comment 2 FuXiangChun 2013-12-26 20:10:16 EST
Re-tested this feature performance with the latest kernel 3.10.0-64.el7.x86_64(guest and host) and qemu-kvm qemu-kvm-1.5.3-30.el7.x86_64. 

steps:
as comment 0

a)Enable pvticketlocks
result:
Throughput 2187.92 MB/sec  32 clients  32 procs  max_latency=219.047 ms

b)disable pvticketlocks
Throughput 2229 MB/sec  32 clients  32 procs  max_latency=219.078 ms

so still can be reproduced.
Comment 5 FuXiangChun 2014-01-08 05:35:00 EST
QE test 4 scenarios again with +kvm_pv_unhalt and w/o +kvm_pv_unhalt.

Scenario1 boot one guest and vcpu number == host cpu number 
cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp 8,sockets=2,cores=2,threads=2,maxcpus=160

1.1 with +kvm_pv_unhalt
result
Throughput 5869.41 MB/sec  32 clients  32 procs  max_latency=110.400 ms

1.2 w/o kvm_pv_unhalt
result:
Throughput 5969.32 MB/sec  32 clients  32 procs  max_latency=134.305 ms

Scenario2 boot one guest and vcpu number = 1/2 host cpu number 
cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=160

2.1 with +kvm_pv_unhalt
result:
Throughput 4878.82 MB/sec  32 clients  32 procs  max_latency=219.040 ms

2.2 w/o kvm_pv_unhalt
result:
Throughput 4945.82 MB/sec  32 clients  32 procs  max_latency=219.025 ms

Scenario3 boot 2 guest and each guest vcpu number == host cpu number. total vcpu number = 2* host cpu number
cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp 8,sockets=2,cores=2,threads=2,maxcpus=160

3.1 with +kvm_pv_unhalt
result
guest 1
Throughput 2711.75 MB/sec  32 clients  32 procs  max_latency=304.435 ms
guest 2
Throughput 2812.6 MB/sec  32 clients  32 procs  max_latency=343.341 ms

3.2 w/o kvm_pv_unhalt
results:
guest1
Throughput 42.4602 MB/sec  32 clients  32 procs  max_latency=411.488 ms
guest2
Throughput 183.776 MB/sec  32 clients  32 procs  max_latency=1410.923 ms

Scenario4 boot 5 guest and each guest vcpu number = 1/2 host cpu number. total vcpu number = 2.5* host cpu number
cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=160

4.1 with +kvm_pv_unhalt
results:
guest1-guest5
Throughput 1091.51 MB/sec  32 clients  32 procs  max_latency=810.757 ms
Throughput 1040.13 MB/sec  32 clients  32 procs  max_latency=831.144 ms
Throu++ghput 1103.84 MB/sec  32 clients  32 procs  max_latency=843.273 ms
Throughput 931.447 MB/sec  32 clients  32 procs  max_latency=848.634 ms
Throughput 1002.89 MB/sec  32 clients  32 procs  max_latency=840.861 ms

4.2 w/o +kvm_pv_unhalt
results:
guest1-guest5
Throughput 1014.25 MB/sec  32 clients  32 procs  max_latency=844.221 ms
Throughput 1016.37 MB/sec  32 clients  32 procs  max_latency=1001.080 ms
Throughput 838.244 MB/sec  32 clients  32 procs  max_latency=1206.591 ms
Throughput 991.877 MB/sec  32 clients  32 procs  max_latency=975.590 ms
Throughput 731.11 MB/sec  32 clients  32 procs  max_latency=2074.715 ms
Comment 6 FuXiangChun 2014-01-08 05:45:02 EST
sum up:

As scenario1 and scenario2 show, if no vcpu overcommit, enable kvm_pv_unhalt decrease about 85 MB/sec Throughput.

and scenario3 show, with vcpu overcommit, enable kvm_pv_unhalt increases about 2700 MB/sec Throughput for each guest.


As scenario4 show, with vcpu overcommit, enable kvm_pv_unhalt increases about  
115 MB/sec Throughput for each guest on average.
Comment 7 Andrew Jones 2014-01-08 06:06:53 EST
(In reply to Quan Wenli from comment #4)
> (In reply to Karen Noel from comment #3)
> > FuXiang and Drew,
> > 
> > We need to decide if pvticketlocks will be supported in RHEL 7.0, and if it
> > should be the default. Therefore, we need to determine in what cases it
> > improves performance and when performance degrades.
> > 
> 
> Firstly dbench is a filesystem benchmark tool, it's not fit for testing cpu
> performance very much.
> 

We're not testing cpu performance, we're testing throughput of whatever requires spinlocks (and file systems require spinlocks). Using a ramdisk (as is done here) for the file system helps avoid variability in the benchmark results run to run.

> Then if you have to use dbench, expect for throughput result in dbench, the
> host cpu utilization and throughput per host cpu are also important while
> running debench in guest. it's possible that pvticketlocks uses more less
> host cpu utilization in comment #0, thus from throughput per cpu aspect, it
> may be no difference or give improvement with pvticketlock.

We want the guest's throughput to be optimized, even when vcpus are overcommitted, and not regressed, even when vcpus are undercommitted. Whether or not the host is fully utilizing the machine's cpus isn't the concern. If vcpus are getting scheduled out of a guest, then the host (or another guest) is certainly using the corresponding cpus, and thus you'll see utilization on them, but it may not be the right utilization as far the the guest you're measuring is concerned. So what matters is the combined throughput of all guests running dbench simultaneously, i.e. each guest should be scheduling the correct vcpus at the correct times in order to optimize the throughput - even when vcpus are overcommitted on the system (i.e. it can't have all vcpus running simultaneously). And, there should be no regressions in performance while using pvticketlocks in undercommitted configs either. I believe our dbench benchmark is a good benchmark to test this, but it's not the only use case we should test.

> 
> To compare cpu performance between pvticketlock and ticketlock, I suggest
> that run test with 'make -j 50" on 2/4/8/16/32 vcpus on a 8 cpu system with
> three iterations for example. If needed, We would like give a help.

If the system only has 8 cpus then you need to do something like this

test1: vm1:vcpus=4 (.5x commitment)
test2: vm1:vcpus=8 (1x commitment)
test3: vm1:vcpus=4, vm2:vcpus=4 (1x commitment)
test4: vm1:vcpus=8, vm2:vcpus=8 (2x commitment)
test5: vm1:vcpus=8, vm2:vcpus=8, vm3:vcpus=8 (3x commitment)

and add more tests in between with other vcpu counts to try other commitment levels if desired. Note, we NEVER allocate a single vm more vcpus than the host has cpus - that just doesn't make any sense, and the results will surely be poor.

> 
> > Can we run the dbench test under various scenarios? Maybe -smp 2, 4, 8, 16
> > (up to the number of cpus on the host)? 
> > 
> > Maybe run dbench with cpu overcommit? If the host has 8 cpus, run 5 guests
> > with -smp 4. 
> > 
> > Drew, What tests make sense?
> > 
> > I also thought that perhaps the perf regression suite can be run with
> > pvticketlock enabled. This would test several different workloads and give
> > us an idea how performance compares to RHEL 6.5 and 7.0 with pvticktlocks
> > disabled.
> > 

This is definitely a good idea. That's run the full suite and see what the overall results are.

> 
> Since only block/network performance regression tests in our current plan, I
> suggest we can confirm that if pvticketlock give us cpu performance
> improvement and if make eanbled it by default firstly, if it's yes, we test
> block/netperf to make sure that pvticketlock does not bring us any
> performance regression.   

As I said above, it's not about cpu perf, it's about improving subsystems that have lots of locking - block/network subsystems sound like the perfect target, but we must measure the combined throughput of all guests, not just a single guest, when looking at overcommitment scenarios. For undercommit it should be fine to just look at the results of each guest separately, but also looking at the combined throughput would be interesting.

It looks like FuXiangChun  has some results we can interpret in comment 5 and comment 6, I'll comment to them next.
Comment 8 Andrew Jones 2014-01-08 07:44:36 EST
(In reply to FuXiangChun from comment #5)
> QE test 4 scenarios again with +kvm_pv_unhalt and w/o +kvm_pv_unhalt.
> 
> Scenario1 boot one guest and vcpu number == host cpu number 
> cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp
> 8,sockets=2,cores=2,threads=2,maxcpus=160
> 
> 1.1 with +kvm_pv_unhalt
> result
> Throughput 5869.41 MB/sec  32 clients  32 procs  max_latency=110.400 ms
> 
> 1.2 w/o kvm_pv_unhalt
> result:
> Throughput 5969.32 MB/sec  32 clients  32 procs  max_latency=134.305 ms

Not a horrible regression here, it'd be good to see other benchmarks with undercommit. Assuming this machine has PLE, then we do have another couple knobs to turn (ple_gap, ple_window), which may allow us to improve things.

> 
> Scenario2 boot one guest and vcpu number = 1/2 host cpu number 
> cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp
> 4,sockets=2,cores=2,threads=1,maxcpus=160
> 
> 2.1 with +kvm_pv_unhalt
> result:
> Throughput 4878.82 MB/sec  32 clients  32 procs  max_latency=219.040 ms
> 
> 2.2 w/o kvm_pv_unhalt
> result:
> Throughput 4945.82 MB/sec  32 clients  32 procs  max_latency=219.025 ms

about the same as the Scenario1

> 
> Scenario3 boot 2 guest and each guest vcpu number == host cpu number. total
> vcpu number = 2* host cpu number
> cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp
> 8,sockets=2,cores=2,threads=2,maxcpus=160
> 
> 3.1 with +kvm_pv_unhalt
> result
> guest 1
> Throughput 2711.75 MB/sec  32 clients  32 procs  max_latency=304.435 ms
> guest 2
> Throughput 2812.6 MB/sec  32 clients  32 procs  max_latency=343.341 ms

2711 + 2812 = 5523, 304 + 343 = 647

> 
> 3.2 w/o kvm_pv_unhalt
> results:
> guest1
> Throughput 42.4602 MB/sec  32 clients  32 procs  max_latency=411.488 ms
> guest2
> Throughput 183.776 MB/sec  32 clients  32 procs  max_latency=1410.923 ms

Huge difference between guests... not good, and max_latency for guest2 is horrible too
 
42 + 183 = 225, 411 + 1410 = 1821

This scenario shows a clear win for pvticketlocks!!

> 
> Scenario4 boot 5 guest and each guest vcpu number = 1/2 host cpu number.
> total vcpu number = 2.5* host cpu number
> cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp
> 4,sockets=2,cores=2,threads=1,maxcpus=160
> 
> 4.1 with +kvm_pv_unhalt
> results:
> guest1-guest5
> Throughput 1091.51 MB/sec  32 clients  32 procs  max_latency=810.757 ms
> Throughput 1040.13 MB/sec  32 clients  32 procs  max_latency=831.144 ms
> Throu++ghput 1103.84 MB/sec  32 clients  32 procs  max_latency=843.273 ms
> Throughput 931.447 MB/sec  32 clients  32 procs  max_latency=848.634 ms
> Throughput 1002.89 MB/sec  32 clients  32 procs  max_latency=840.861 ms

5169.817, 4174.669

> 
> 4.2 w/o +kvm_pv_unhalt
> results:
> guest1-guest5
> Throughput 1014.25 MB/sec  32 clients  32 procs  max_latency=844.221 ms
> Throughput 1016.37 MB/sec  32 clients  32 procs  max_latency=1001.080 ms
> Throughput 838.244 MB/sec  32 clients  32 procs  max_latency=1206.591 ms
> Throughput 991.877 MB/sec  32 clients  32 procs  max_latency=975.590 ms
> Throughput 731.11 MB/sec  32 clients  32 procs  max_latency=2074.715 ms

4591.851, 6102.197

pvticketlocks wins again
Comment 9 Andrew Jones 2014-01-08 07:49:02 EST
(In reply to FuXiangChun from comment #6)
> sum up:
> 
> As scenario1 and scenario2 show, if no vcpu overcommit, enable kvm_pv_unhalt
> decrease about 85 MB/sec Throughput.
> 
> and scenario3 show, with vcpu overcommit, enable kvm_pv_unhalt increases
> about 2700 MB/sec Throughput for each guest.
> 
> 
> As scenario4 show, with vcpu overcommit, enable kvm_pv_unhalt increases
> about  
> 115 MB/sec Throughput for each guest on average.

yup, these results show pvticketlocks wins. We need to confirm that with undercommit that at least some some benchmarks show wins too (i.e. we should run all perf regression tests). And, we need to confirm that we see results like these on machines with PLE. Testing with/without pvticketlocks and with/without PLE would be best, but creates a pretty big test matrix. If we need to cut that matrix in half, then we can test only with PLE (as rhel7 targets more recent hardware, which has PLE).
Comment 14 Andrew Jones 2014-02-24 08:51:23 EST
The results, comment 8, actually show that pvticketlocks improves performance. Closing this as not-a-bug.

Note You need to log in before you can comment on or make changes to this bug.