Hide Forgot
Description of problem: Use dbench tool to test pvticketlocks performance on Intel and AMD host. Throughput is falling when pvticketlocks is enalbed. QE tested many times with dbench time. Version-Release number of selected component (if applicable): qemu-kvm-1.5.3-10.el7.x86_64 3.10.0-37.el7.x86_64 How reproducible: 100% Steps to Reproduce: 3.Test steps: 1.Boot guest /usr/libexec/qemu-kvm -M q35 -cpu Opteron_G2,+kvm_pv_unhalt -enable-kvm -m 4096 -smp 4,sockets=1,cores=4,threads=1 .... 2.mount -t tmpfs -o size=1G tmpfs /mnt inside guest 3.dbench -D /mnt -t 60 32 -c /usr/local/share/doc/dbench/loadfiles/client.txt inside guest Actual results Intel host: enable pvticketlocks(-cpu SandyBridge,+kvm_pv_unhalt) Throughput 5252.95 MB/sec 32 clients 32 procs max_latency=57.459 ms Throughput 5481.55 MB/sec 32 clients 32 procs max_latency=37.068 ms Throughput 5447.84 MB/sec 32 clients 32 procs max_latency=41.034 ms Throughput 5490.55 MB/sec 32 clients 32 procs max_latency=42.116 ms Throughput 5468.27 MB/sec 32 clients 32 procs max_latency=39.693 ms disable pvticketlocks(-cpu SandyBridge,-kvm_pv_unhalt) Throughput 5529.27 MB/sec 32 clients 32 procs max_latency=44.026 ms Throughput 5556.54 MB/sec 32 clients 32 procs max_latency=42.031 ms Throughput 5495.86 MB/sec 32 clients 32 procs max_latency=41.051 ms Throughput 5560.48 MB/sec 32 clients 32 procs max_latency=37.887 ms Throughput 5553.49 MB/sec 32 clients 32 procs max_latency=44.023 ms AMD host: Enable pvticketlocks(-M q35 -cpu Opteron_G2,+kvm_pv_unhalt, or -M q35 -cpu host,+kvm_pv_unhalt) dbench result: Throughput 1941.31 MB/sec 32 clients 32 procs max_latency=76.107 ms Throughput 1949.63 MB/sec 32 clients 32 procs max_latency=330.093 ms Throughput 1956.27 MB/sec 32 clients 32 procs max_latency=180.038 ms Throughput 1947.99 MB/sec 32 clients 32 procs max_latency=72.166 ms Throughput 1954.01 MB/sec 32 clients 32 procs max_latency=80.094 ms Throughput 1950.92 MB/sec 32 clients 32 procs max_latency=69.060 ms Throughput 1950.03 MB/sec 32 clients 32 procs max_latency=73.183 ms Throughput 1944.33 MB/sec 32 clients 32 procs max_latency=64.051 ms Throughput 1957.74 MB/sec 32 clients 32 procs max_latency=170.082 ms Throughput 1953.79 MB/sec 32 clients 32 procs max_latency=65.057 ms Disable pvticketlocks(-M q35 -cpu Opteron_G2) Dbench result: Throughput 2005.98 MB/sec 32 clients 32 procs max_latency=130.086 ms Throughput 2012.03 MB/sec 32 clients 32 procs max_latency=97.109 ms Throughput 2035.5 MB/sec 32 clients 32 procs max_latency=87.768 ms Throughput 2035.5 MB/sec 32 clients 32 procs max_latency=87.768 ms Throughput 2019.16 MB/sec 32 clients 32 procs max_latency=66.029 ms Throughput 2020.43 MB/sec 32 clients 32 procs max_latency=133.097 ms Expected results: enable pvticketlocks cpu performance rising Additional info:
Re-tested this feature performance with the latest kernel 3.10.0-64.el7.x86_64(guest and host) and qemu-kvm qemu-kvm-1.5.3-30.el7.x86_64. steps: as comment 0 a)Enable pvticketlocks result: Throughput 2187.92 MB/sec 32 clients 32 procs max_latency=219.047 ms b)disable pvticketlocks Throughput 2229 MB/sec 32 clients 32 procs max_latency=219.078 ms so still can be reproduced.
QE test 4 scenarios again with +kvm_pv_unhalt and w/o +kvm_pv_unhalt. Scenario1 boot one guest and vcpu number == host cpu number cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp 8,sockets=2,cores=2,threads=2,maxcpus=160 1.1 with +kvm_pv_unhalt result Throughput 5869.41 MB/sec 32 clients 32 procs max_latency=110.400 ms 1.2 w/o kvm_pv_unhalt result: Throughput 5969.32 MB/sec 32 clients 32 procs max_latency=134.305 ms Scenario2 boot one guest and vcpu number = 1/2 host cpu number cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=160 2.1 with +kvm_pv_unhalt result: Throughput 4878.82 MB/sec 32 clients 32 procs max_latency=219.040 ms 2.2 w/o kvm_pv_unhalt result: Throughput 4945.82 MB/sec 32 clients 32 procs max_latency=219.025 ms Scenario3 boot 2 guest and each guest vcpu number == host cpu number. total vcpu number = 2* host cpu number cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp 8,sockets=2,cores=2,threads=2,maxcpus=160 3.1 with +kvm_pv_unhalt result guest 1 Throughput 2711.75 MB/sec 32 clients 32 procs max_latency=304.435 ms guest 2 Throughput 2812.6 MB/sec 32 clients 32 procs max_latency=343.341 ms 3.2 w/o kvm_pv_unhalt results: guest1 Throughput 42.4602 MB/sec 32 clients 32 procs max_latency=411.488 ms guest2 Throughput 183.776 MB/sec 32 clients 32 procs max_latency=1410.923 ms Scenario4 boot 5 guest and each guest vcpu number = 1/2 host cpu number. total vcpu number = 2.5* host cpu number cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=160 4.1 with +kvm_pv_unhalt results: guest1-guest5 Throughput 1091.51 MB/sec 32 clients 32 procs max_latency=810.757 ms Throughput 1040.13 MB/sec 32 clients 32 procs max_latency=831.144 ms Throu++ghput 1103.84 MB/sec 32 clients 32 procs max_latency=843.273 ms Throughput 931.447 MB/sec 32 clients 32 procs max_latency=848.634 ms Throughput 1002.89 MB/sec 32 clients 32 procs max_latency=840.861 ms 4.2 w/o +kvm_pv_unhalt results: guest1-guest5 Throughput 1014.25 MB/sec 32 clients 32 procs max_latency=844.221 ms Throughput 1016.37 MB/sec 32 clients 32 procs max_latency=1001.080 ms Throughput 838.244 MB/sec 32 clients 32 procs max_latency=1206.591 ms Throughput 991.877 MB/sec 32 clients 32 procs max_latency=975.590 ms Throughput 731.11 MB/sec 32 clients 32 procs max_latency=2074.715 ms
sum up: As scenario1 and scenario2 show, if no vcpu overcommit, enable kvm_pv_unhalt decrease about 85 MB/sec Throughput. and scenario3 show, with vcpu overcommit, enable kvm_pv_unhalt increases about 2700 MB/sec Throughput for each guest. As scenario4 show, with vcpu overcommit, enable kvm_pv_unhalt increases about 115 MB/sec Throughput for each guest on average.
(In reply to Quan Wenli from comment #4) > (In reply to Karen Noel from comment #3) > > FuXiang and Drew, > > > > We need to decide if pvticketlocks will be supported in RHEL 7.0, and if it > > should be the default. Therefore, we need to determine in what cases it > > improves performance and when performance degrades. > > > > Firstly dbench is a filesystem benchmark tool, it's not fit for testing cpu > performance very much. > We're not testing cpu performance, we're testing throughput of whatever requires spinlocks (and file systems require spinlocks). Using a ramdisk (as is done here) for the file system helps avoid variability in the benchmark results run to run. > Then if you have to use dbench, expect for throughput result in dbench, the > host cpu utilization and throughput per host cpu are also important while > running debench in guest. it's possible that pvticketlocks uses more less > host cpu utilization in comment #0, thus from throughput per cpu aspect, it > may be no difference or give improvement with pvticketlock. We want the guest's throughput to be optimized, even when vcpus are overcommitted, and not regressed, even when vcpus are undercommitted. Whether or not the host is fully utilizing the machine's cpus isn't the concern. If vcpus are getting scheduled out of a guest, then the host (or another guest) is certainly using the corresponding cpus, and thus you'll see utilization on them, but it may not be the right utilization as far the the guest you're measuring is concerned. So what matters is the combined throughput of all guests running dbench simultaneously, i.e. each guest should be scheduling the correct vcpus at the correct times in order to optimize the throughput - even when vcpus are overcommitted on the system (i.e. it can't have all vcpus running simultaneously). And, there should be no regressions in performance while using pvticketlocks in undercommitted configs either. I believe our dbench benchmark is a good benchmark to test this, but it's not the only use case we should test. > > To compare cpu performance between pvticketlock and ticketlock, I suggest > that run test with 'make -j 50" on 2/4/8/16/32 vcpus on a 8 cpu system with > three iterations for example. If needed, We would like give a help. If the system only has 8 cpus then you need to do something like this test1: vm1:vcpus=4 (.5x commitment) test2: vm1:vcpus=8 (1x commitment) test3: vm1:vcpus=4, vm2:vcpus=4 (1x commitment) test4: vm1:vcpus=8, vm2:vcpus=8 (2x commitment) test5: vm1:vcpus=8, vm2:vcpus=8, vm3:vcpus=8 (3x commitment) and add more tests in between with other vcpu counts to try other commitment levels if desired. Note, we NEVER allocate a single vm more vcpus than the host has cpus - that just doesn't make any sense, and the results will surely be poor. > > > Can we run the dbench test under various scenarios? Maybe -smp 2, 4, 8, 16 > > (up to the number of cpus on the host)? > > > > Maybe run dbench with cpu overcommit? If the host has 8 cpus, run 5 guests > > with -smp 4. > > > > Drew, What tests make sense? > > > > I also thought that perhaps the perf regression suite can be run with > > pvticketlock enabled. This would test several different workloads and give > > us an idea how performance compares to RHEL 6.5 and 7.0 with pvticktlocks > > disabled. > > This is definitely a good idea. That's run the full suite and see what the overall results are. > > Since only block/network performance regression tests in our current plan, I > suggest we can confirm that if pvticketlock give us cpu performance > improvement and if make eanbled it by default firstly, if it's yes, we test > block/netperf to make sure that pvticketlock does not bring us any > performance regression. As I said above, it's not about cpu perf, it's about improving subsystems that have lots of locking - block/network subsystems sound like the perfect target, but we must measure the combined throughput of all guests, not just a single guest, when looking at overcommitment scenarios. For undercommit it should be fine to just look at the results of each guest separately, but also looking at the combined throughput would be interesting. It looks like FuXiangChun has some results we can interpret in comment 5 and comment 6, I'll comment to them next.
(In reply to FuXiangChun from comment #5) > QE test 4 scenarios again with +kvm_pv_unhalt and w/o +kvm_pv_unhalt. > > Scenario1 boot one guest and vcpu number == host cpu number > cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp > 8,sockets=2,cores=2,threads=2,maxcpus=160 > > 1.1 with +kvm_pv_unhalt > result > Throughput 5869.41 MB/sec 32 clients 32 procs max_latency=110.400 ms > > 1.2 w/o kvm_pv_unhalt > result: > Throughput 5969.32 MB/sec 32 clients 32 procs max_latency=134.305 ms Not a horrible regression here, it'd be good to see other benchmarks with undercommit. Assuming this machine has PLE, then we do have another couple knobs to turn (ple_gap, ple_window), which may allow us to improve things. > > Scenario2 boot one guest and vcpu number = 1/2 host cpu number > cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp > 4,sockets=2,cores=2,threads=1,maxcpus=160 > > 2.1 with +kvm_pv_unhalt > result: > Throughput 4878.82 MB/sec 32 clients 32 procs max_latency=219.040 ms > > 2.2 w/o kvm_pv_unhalt > result: > Throughput 4945.82 MB/sec 32 clients 32 procs max_latency=219.025 ms about the same as the Scenario1 > > Scenario3 boot 2 guest and each guest vcpu number == host cpu number. total > vcpu number = 2* host cpu number > cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp > 8,sockets=2,cores=2,threads=2,maxcpus=160 > > 3.1 with +kvm_pv_unhalt > result > guest 1 > Throughput 2711.75 MB/sec 32 clients 32 procs max_latency=304.435 ms > guest 2 > Throughput 2812.6 MB/sec 32 clients 32 procs max_latency=343.341 ms 2711 + 2812 = 5523, 304 + 343 = 647 > > 3.2 w/o kvm_pv_unhalt > results: > guest1 > Throughput 42.4602 MB/sec 32 clients 32 procs max_latency=411.488 ms > guest2 > Throughput 183.776 MB/sec 32 clients 32 procs max_latency=1410.923 ms Huge difference between guests... not good, and max_latency for guest2 is horrible too 42 + 183 = 225, 411 + 1410 = 1821 This scenario shows a clear win for pvticketlocks!! > > Scenario4 boot 5 guest and each guest vcpu number = 1/2 host cpu number. > total vcpu number = 2.5* host cpu number > cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp > 4,sockets=2,cores=2,threads=1,maxcpus=160 > > 4.1 with +kvm_pv_unhalt > results: > guest1-guest5 > Throughput 1091.51 MB/sec 32 clients 32 procs max_latency=810.757 ms > Throughput 1040.13 MB/sec 32 clients 32 procs max_latency=831.144 ms > Throu++ghput 1103.84 MB/sec 32 clients 32 procs max_latency=843.273 ms > Throughput 931.447 MB/sec 32 clients 32 procs max_latency=848.634 ms > Throughput 1002.89 MB/sec 32 clients 32 procs max_latency=840.861 ms 5169.817, 4174.669 > > 4.2 w/o +kvm_pv_unhalt > results: > guest1-guest5 > Throughput 1014.25 MB/sec 32 clients 32 procs max_latency=844.221 ms > Throughput 1016.37 MB/sec 32 clients 32 procs max_latency=1001.080 ms > Throughput 838.244 MB/sec 32 clients 32 procs max_latency=1206.591 ms > Throughput 991.877 MB/sec 32 clients 32 procs max_latency=975.590 ms > Throughput 731.11 MB/sec 32 clients 32 procs max_latency=2074.715 ms 4591.851, 6102.197 pvticketlocks wins again
(In reply to FuXiangChun from comment #6) > sum up: > > As scenario1 and scenario2 show, if no vcpu overcommit, enable kvm_pv_unhalt > decrease about 85 MB/sec Throughput. > > and scenario3 show, with vcpu overcommit, enable kvm_pv_unhalt increases > about 2700 MB/sec Throughput for each guest. > > > As scenario4 show, with vcpu overcommit, enable kvm_pv_unhalt increases > about > 115 MB/sec Throughput for each guest on average. yup, these results show pvticketlocks wins. We need to confirm that with undercommit that at least some some benchmarks show wins too (i.e. we should run all perf regression tests). And, we need to confirm that we see results like these on machines with PLE. Testing with/without pvticketlocks and with/without PLE would be best, but creates a pretty big test matrix. If we need to cut that matrix in half, then we can test only with PLE (as rhel7 targets more recent hardware, which has PLE).
The results, comment 8, actually show that pvticketlocks improves performance. Closing this as not-a-bug.