| Summary: | Enable pvticketlocks cause cpu performance falling | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | FuXiangChun <xfu> |
| Component: | qemu-kvm | Assignee: | Andrew Jones <drjones> |
| Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.0 | CC: | acathrow, drjones, juzhang, knoel, michen, riehecky, riel, sgordon, shyu, virt-maint, wquan, xigao |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-02-24 13:51:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
FuXiangChun
2013-11-05 15:06:13 UTC
Re-tested this feature performance with the latest kernel 3.10.0-64.el7.x86_64(guest and host) and qemu-kvm qemu-kvm-1.5.3-30.el7.x86_64. steps: as comment 0 a)Enable pvticketlocks result: Throughput 2187.92 MB/sec 32 clients 32 procs max_latency=219.047 ms b)disable pvticketlocks Throughput 2229 MB/sec 32 clients 32 procs max_latency=219.078 ms so still can be reproduced. QE test 4 scenarios again with +kvm_pv_unhalt and w/o +kvm_pv_unhalt. Scenario1 boot one guest and vcpu number == host cpu number cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp 8,sockets=2,cores=2,threads=2,maxcpus=160 1.1 with +kvm_pv_unhalt result Throughput 5869.41 MB/sec 32 clients 32 procs max_latency=110.400 ms 1.2 w/o kvm_pv_unhalt result: Throughput 5969.32 MB/sec 32 clients 32 procs max_latency=134.305 ms Scenario2 boot one guest and vcpu number = 1/2 host cpu number cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=160 2.1 with +kvm_pv_unhalt result: Throughput 4878.82 MB/sec 32 clients 32 procs max_latency=219.040 ms 2.2 w/o kvm_pv_unhalt result: Throughput 4945.82 MB/sec 32 clients 32 procs max_latency=219.025 ms Scenario3 boot 2 guest and each guest vcpu number == host cpu number. total vcpu number = 2* host cpu number cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp 8,sockets=2,cores=2,threads=2,maxcpus=160 3.1 with +kvm_pv_unhalt result guest 1 Throughput 2711.75 MB/sec 32 clients 32 procs max_latency=304.435 ms guest 2 Throughput 2812.6 MB/sec 32 clients 32 procs max_latency=343.341 ms 3.2 w/o kvm_pv_unhalt results: guest1 Throughput 42.4602 MB/sec 32 clients 32 procs max_latency=411.488 ms guest2 Throughput 183.776 MB/sec 32 clients 32 procs max_latency=1410.923 ms Scenario4 boot 5 guest and each guest vcpu number = 1/2 host cpu number. total vcpu number = 2.5* host cpu number cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=160 4.1 with +kvm_pv_unhalt results: guest1-guest5 Throughput 1091.51 MB/sec 32 clients 32 procs max_latency=810.757 ms Throughput 1040.13 MB/sec 32 clients 32 procs max_latency=831.144 ms Throu++ghput 1103.84 MB/sec 32 clients 32 procs max_latency=843.273 ms Throughput 931.447 MB/sec 32 clients 32 procs max_latency=848.634 ms Throughput 1002.89 MB/sec 32 clients 32 procs max_latency=840.861 ms 4.2 w/o +kvm_pv_unhalt results: guest1-guest5 Throughput 1014.25 MB/sec 32 clients 32 procs max_latency=844.221 ms Throughput 1016.37 MB/sec 32 clients 32 procs max_latency=1001.080 ms Throughput 838.244 MB/sec 32 clients 32 procs max_latency=1206.591 ms Throughput 991.877 MB/sec 32 clients 32 procs max_latency=975.590 ms Throughput 731.11 MB/sec 32 clients 32 procs max_latency=2074.715 ms sum up: As scenario1 and scenario2 show, if no vcpu overcommit, enable kvm_pv_unhalt decrease about 85 MB/sec Throughput. and scenario3 show, with vcpu overcommit, enable kvm_pv_unhalt increases about 2700 MB/sec Throughput for each guest. As scenario4 show, with vcpu overcommit, enable kvm_pv_unhalt increases about 115 MB/sec Throughput for each guest on average. (In reply to Quan Wenli from comment #4) > (In reply to Karen Noel from comment #3) > > FuXiang and Drew, > > > > We need to decide if pvticketlocks will be supported in RHEL 7.0, and if it > > should be the default. Therefore, we need to determine in what cases it > > improves performance and when performance degrades. > > > > Firstly dbench is a filesystem benchmark tool, it's not fit for testing cpu > performance very much. > We're not testing cpu performance, we're testing throughput of whatever requires spinlocks (and file systems require spinlocks). Using a ramdisk (as is done here) for the file system helps avoid variability in the benchmark results run to run. > Then if you have to use dbench, expect for throughput result in dbench, the > host cpu utilization and throughput per host cpu are also important while > running debench in guest. it's possible that pvticketlocks uses more less > host cpu utilization in comment #0, thus from throughput per cpu aspect, it > may be no difference or give improvement with pvticketlock. We want the guest's throughput to be optimized, even when vcpus are overcommitted, and not regressed, even when vcpus are undercommitted. Whether or not the host is fully utilizing the machine's cpus isn't the concern. If vcpus are getting scheduled out of a guest, then the host (or another guest) is certainly using the corresponding cpus, and thus you'll see utilization on them, but it may not be the right utilization as far the the guest you're measuring is concerned. So what matters is the combined throughput of all guests running dbench simultaneously, i.e. each guest should be scheduling the correct vcpus at the correct times in order to optimize the throughput - even when vcpus are overcommitted on the system (i.e. it can't have all vcpus running simultaneously). And, there should be no regressions in performance while using pvticketlocks in undercommitted configs either. I believe our dbench benchmark is a good benchmark to test this, but it's not the only use case we should test. > > To compare cpu performance between pvticketlock and ticketlock, I suggest > that run test with 'make -j 50" on 2/4/8/16/32 vcpus on a 8 cpu system with > three iterations for example. If needed, We would like give a help. If the system only has 8 cpus then you need to do something like this test1: vm1:vcpus=4 (.5x commitment) test2: vm1:vcpus=8 (1x commitment) test3: vm1:vcpus=4, vm2:vcpus=4 (1x commitment) test4: vm1:vcpus=8, vm2:vcpus=8 (2x commitment) test5: vm1:vcpus=8, vm2:vcpus=8, vm3:vcpus=8 (3x commitment) and add more tests in between with other vcpu counts to try other commitment levels if desired. Note, we NEVER allocate a single vm more vcpus than the host has cpus - that just doesn't make any sense, and the results will surely be poor. > > > Can we run the dbench test under various scenarios? Maybe -smp 2, 4, 8, 16 > > (up to the number of cpus on the host)? > > > > Maybe run dbench with cpu overcommit? If the host has 8 cpus, run 5 guests > > with -smp 4. > > > > Drew, What tests make sense? > > > > I also thought that perhaps the perf regression suite can be run with > > pvticketlock enabled. This would test several different workloads and give > > us an idea how performance compares to RHEL 6.5 and 7.0 with pvticktlocks > > disabled. > > This is definitely a good idea. That's run the full suite and see what the overall results are. > > Since only block/network performance regression tests in our current plan, I > suggest we can confirm that if pvticketlock give us cpu performance > improvement and if make eanbled it by default firstly, if it's yes, we test > block/netperf to make sure that pvticketlock does not bring us any > performance regression. As I said above, it's not about cpu perf, it's about improving subsystems that have lots of locking - block/network subsystems sound like the perfect target, but we must measure the combined throughput of all guests, not just a single guest, when looking at overcommitment scenarios. For undercommit it should be fine to just look at the results of each guest separately, but also looking at the combined throughput would be interesting. It looks like FuXiangChun has some results we can interpret in comment 5 and comment 6, I'll comment to them next. (In reply to FuXiangChun from comment #5) > QE test 4 scenarios again with +kvm_pv_unhalt and w/o +kvm_pv_unhalt. > > Scenario1 boot one guest and vcpu number == host cpu number > cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp > 8,sockets=2,cores=2,threads=2,maxcpus=160 > > 1.1 with +kvm_pv_unhalt > result > Throughput 5869.41 MB/sec 32 clients 32 procs max_latency=110.400 ms > > 1.2 w/o kvm_pv_unhalt > result: > Throughput 5969.32 MB/sec 32 clients 32 procs max_latency=134.305 ms Not a horrible regression here, it'd be good to see other benchmarks with undercommit. Assuming this machine has PLE, then we do have another couple knobs to turn (ple_gap, ple_window), which may allow us to improve things. > > Scenario2 boot one guest and vcpu number = 1/2 host cpu number > cli:-cpu SandyBridge,+kvm_pv_unhalt -m 4096 -smp > 4,sockets=2,cores=2,threads=1,maxcpus=160 > > 2.1 with +kvm_pv_unhalt > result: > Throughput 4878.82 MB/sec 32 clients 32 procs max_latency=219.040 ms > > 2.2 w/o kvm_pv_unhalt > result: > Throughput 4945.82 MB/sec 32 clients 32 procs max_latency=219.025 ms about the same as the Scenario1 > > Scenario3 boot 2 guest and each guest vcpu number == host cpu number. total > vcpu number = 2* host cpu number > cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp > 8,sockets=2,cores=2,threads=2,maxcpus=160 > > 3.1 with +kvm_pv_unhalt > result > guest 1 > Throughput 2711.75 MB/sec 32 clients 32 procs max_latency=304.435 ms > guest 2 > Throughput 2812.6 MB/sec 32 clients 32 procs max_latency=343.341 ms 2711 + 2812 = 5523, 304 + 343 = 647 > > 3.2 w/o kvm_pv_unhalt > results: > guest1 > Throughput 42.4602 MB/sec 32 clients 32 procs max_latency=411.488 ms > guest2 > Throughput 183.776 MB/sec 32 clients 32 procs max_latency=1410.923 ms Huge difference between guests... not good, and max_latency for guest2 is horrible too 42 + 183 = 225, 411 + 1410 = 1821 This scenario shows a clear win for pvticketlocks!! > > Scenario4 boot 5 guest and each guest vcpu number = 1/2 host cpu number. > total vcpu number = 2.5* host cpu number > cli:-M pc -cpu SandyBridge,+kvm_pv_unhalt -enable-kvm -m 4096 -smp > 4,sockets=2,cores=2,threads=1,maxcpus=160 > > 4.1 with +kvm_pv_unhalt > results: > guest1-guest5 > Throughput 1091.51 MB/sec 32 clients 32 procs max_latency=810.757 ms > Throughput 1040.13 MB/sec 32 clients 32 procs max_latency=831.144 ms > Throu++ghput 1103.84 MB/sec 32 clients 32 procs max_latency=843.273 ms > Throughput 931.447 MB/sec 32 clients 32 procs max_latency=848.634 ms > Throughput 1002.89 MB/sec 32 clients 32 procs max_latency=840.861 ms 5169.817, 4174.669 > > 4.2 w/o +kvm_pv_unhalt > results: > guest1-guest5 > Throughput 1014.25 MB/sec 32 clients 32 procs max_latency=844.221 ms > Throughput 1016.37 MB/sec 32 clients 32 procs max_latency=1001.080 ms > Throughput 838.244 MB/sec 32 clients 32 procs max_latency=1206.591 ms > Throughput 991.877 MB/sec 32 clients 32 procs max_latency=975.590 ms > Throughput 731.11 MB/sec 32 clients 32 procs max_latency=2074.715 ms 4591.851, 6102.197 pvticketlocks wins again (In reply to FuXiangChun from comment #6) > sum up: > > As scenario1 and scenario2 show, if no vcpu overcommit, enable kvm_pv_unhalt > decrease about 85 MB/sec Throughput. > > and scenario3 show, with vcpu overcommit, enable kvm_pv_unhalt increases > about 2700 MB/sec Throughput for each guest. > > > As scenario4 show, with vcpu overcommit, enable kvm_pv_unhalt increases > about > 115 MB/sec Throughput for each guest on average. yup, these results show pvticketlocks wins. We need to confirm that with undercommit that at least some some benchmarks show wins too (i.e. we should run all perf regression tests). And, we need to confirm that we see results like these on machines with PLE. Testing with/without pvticketlocks and with/without PLE would be best, but creates a pretty big test matrix. If we need to cut that matrix in half, then we can test only with PLE (as rhel7 targets more recent hardware, which has PLE). The results, comment 8, actually show that pvticketlocks improves performance. Closing this as not-a-bug. |