Bug 1536398
Summary: | AioContext polling inhibits itself when facing heavy load from multiple vCPUs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Sergio Lopez <slopezpa> | ||||
Component: | qemu-kvm-rhev | Assignee: | Stefan Hajnoczi <stefanha> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Yanhui Ma <yama> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.4 | CC: | aliang, atheurer, chayang, coli, cww, djdumas, dsafford, gveitmic, gwatson, jentrena, juzhang, knoel, michen, mkalinin, sfroemer, slopezpa, stefanha, virt-maint, wquan, xuwei, yama, yhong | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-12-17 16:27:58 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1477664 | ||||||
Attachments: |
|
I tried it with qemu-kvm-rhev-2.10.0-16.el7.x86_64 and kernel-3.10.0-820.el7.x86_64, and used the NVMe, not tmpfs. Fio configuration is the same as comment0. No iops decrease with the bigger numjobs in VM. * Server specs: - Intel(R) Xeon(R) CPU E7-4820 v2 @ 2.00GHz - 64G RAM * VM specs: pin vcpus and iothreads in the same way in comment0 /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox off \ -machine pc \ -nodefaults \ -vga std \ -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel75-64-virtio.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=03 \ -drive if=none,id=drive0,file=/dev/nvme0n1,format=raw,cache=none,aio=threads \ -object iothread,id=iothread0 \ -device virtio-blk-pci,iothread=iothread0,drive=drive0 \ -device virtio-net-pci,id=id10Nyb9,vectors=4,netdev=id0msQUV,bus=pci.0,addr=05 \ -netdev tap,id=id0msQUV \ -m 1024 \ -smp 16 \ -cpu 'SandyBridge',+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_vapic,hv_time \ -vnc :0 \ -rtc base=localtime,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -enable-kvm -monitor stdio * Tests on the Host: 1 thread: iops=11121 latency=87.81us 2 threads: iops=22008 latency=88.66us 4 threads: iops=43017 latency=90.77us 8 threads: iops=82524 latency=94.72us 12 threads: iops=116544 latency=100.71us 14 threads: iops=134024 latency=102.16us 16 threads: iops=148334 latency=105.58us 32 threads: iops=250153 latency=125.24us * Tests on the VM: 1 thread: iops=7487 latency=130.69us 2 threads: iops=14841 latency=131.96us 4 threads: iops=30127 latency=129.90us 8 threads: iops=58565 latency=133.36us 12 threads:iops=78155 latency=149.99us 14 threads:iops=86214 latency=158.64us 16 threads:iops=91474 latency=170.99us 32 threads:iops=110177 latency=286.31us I tried it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64.rpm, others are the same with comment 3. And no iops decrease with the bigger numjobs in VM. * Tests on the VM: 1 thread: iops=7634 latency=128.09us 2 threads: iops=14904 latency=131.33us 4 threads: iops=30302 latency=129.16us 8 threads: iops=58926 latency=132.60us 12 threads: iops=80586 latency=145.40us 14 threads: iops=89001 latency=153.58us 16 threads: iops=95559 latency=163.53us 32 threads: iops=121226 latency=259.81us (In reply to Yanhui Ma from comment #4) > I tried it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64.rpm, others are the > same with comment 3. And no iops decrease with the bigger numjobs in VM. > > * Tests on the VM: > 1 thread: iops=7634 latency=128.09us Polling is only active when latency <= 32us by default. The drive tested is too slow to take advantage of polling (too much CPU would be spent waiting for a request to complete). The -object iothread,poll-max-ns= parameter can be set to increase the limit (the default is 32us), but anything above 100us is probably going to cause noticable CPU utilization from the IOThread on the host. (In reply to Sergio Lopez from comment #0) > Created attachment 1383289 [details] > Patch implementing a more conservative shrinking of poll_ns The patch and your reasoning looks good. Please post it upstream CCing Christian Borntraeger <borntraeger.com> and myself. Please test aio=native on real hardware, if possible, and include those numbers when you send the patch upstream. (In reply to Stefan Hajnoczi from comment #5) > (In reply to Yanhui Ma from comment #4) > > I tried it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64.rpm, others are the > > same with comment 3. And no iops decrease with the bigger numjobs in VM. > > > > * Tests on the VM: > > 1 thread: iops=7634 latency=128.09us > > Polling is only active when latency <= 32us by default. The drive tested is > too slow to take advantage of polling (too much CPU would be spent waiting > for a request to complete). > > The -object iothread,poll-max-ns= parameter can be set to increase the limit > (the default is 32us), but anything above 100us is probably going to cause > noticable CPU utilization from the IOThread on the host. Hi Stefan, Because the latency is 163.53us with 16 threads in above results. You mean I need to set poll-max-ns=256us (>163.53us) to make polling active from 1 threads to 16 threads? If it is meaningful, I will try it again. Thanks, Yanhui (In reply to Yanhui Ma from comment #7) > (In reply to Stefan Hajnoczi from comment #5) > > (In reply to Yanhui Ma from comment #4) > > > I tried it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64.rpm, others are the > > > same with comment 3. And no iops decrease with the bigger numjobs in VM. > > > > > > * Tests on the VM: > > > 1 thread: iops=7634 latency=128.09us > > > > Polling is only active when latency <= 32us by default. The drive tested is > > too slow to take advantage of polling (too much CPU would be spent waiting > > for a request to complete). > > > > The -object iothread,poll-max-ns= parameter can be set to increase the limit > > (the default is 32us), but anything above 100us is probably going to cause > > noticable CPU utilization from the IOThread on the host. > > Hi Stefan, > > Because the latency is 163.53us with 16 threads in above results. You mean I > need to set poll-max-ns=256us (>163.53us) to make polling active from 1 > threads to 16 threads? Yes, exactly. > If it is meaningful, I will try it again. I'm not sure it's meaningful. Users are unlikely to enable polling on slow devices because it consumes too much CPU. From Comment #1: > - 1 virtio-blk device with iothread (with aio=threads) backed by a raw file on a tmpfs filesystem, used by FIO tests. If you use tmpfs like Sergio does then the latency should be much lower. I suggest testing the same configuration as Sergio if you want to reproduce this. Hi Sergio, I tried it with tmpfs, and iops just decreased a little with numjobs=14. Could you help check my steps? The host specs, fio specs and numa pin are the same as comment 3, and are as close as possible to your configuration. #mount -t tmpfs -o size=30G,mode=700 tmpfs /mytmpfs #qemu-img create -f raw /mytmpfs/storage2.raw 20G qemu cml: /usr/libexec/qemu-kvm \ ... -object iothread,id=iothread0 \ -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel75-64-virtio.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,iothread=iothread0,bus=pci.0,addr=0x3 \ -drive id=drive_disk1,if=none,snapshot=off,aio=threads,cache=writeback,format=raw,file=/mytmpfs/storage2.raw \ -device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,iothread=iothread0,bus=pci.0,addr=0x4 \ -m 1024 \ -smp 16,cores=8,threads=1,sockets=2 \ -cpu 'SandyBridge',+kvm_pv_unhalt \ ... Results with default poll-max-ns=32: 1 thread: iops=14962 latency=63.33us 2 threads: iops=30401 latency=61.95us 4 threads: iops=46302 latency=83.14us 8 threads: iops=69719 latency=111.41us 12 threads: iops=83105 latency=140.61us 14 threads: iops=82755 latency=165.17us 16 threads: iops=84206 latency=185.96us Results with poll-max-ns=256: 1 thread: iops=21932 latency=42.71us 2 threads: iops=35889 latency=52.73us 4 threads: iops=56226 latency=67.94us 8 threads: iops=78410 latency=98.61us 12 threads: iops=83841 latency=139.45us 14 threads: iops=83389 latency=163.98us 16 threads: iops=85881 latency=182.45us OK, I'll send the info by email. |
Created attachment 1383289 [details] Patch implementing a more conservative shrinking of poll_ns Description of problem: * Server specs - Xeon E5-2650 @ 2.00GHz (2 sockets, 8 cores, 2 threads) - 64 GB of RAM * VM specs - 16 vCPUs pinned to NUMA node 1 - 1 GB of RAM - 1 IOThread pinned to 8 cores from NUMA node 0 - 1 virtio-blk device without iothread backed by a qcow2 file on an xfs filesystem, holding the Guest OS. - 1 virtio-blk device with iothread (with aio=threads) backed by a raw file on a tmpfs filesystem, used by FIO tests. * FIO test specs: - fio --runtime=300 --time_based --clocksource=clock_gettime --name=randread --rw=randread --bs=4k --filename=/dev/vdb --size=10g --numjobs=X --iodepth=1 --direct=1 --ioengine=sync --group_reporting - numbjobs is the only parameter changed in each run. * Tests on the Host: - 1 Thread: latency=3us, iops=170K * Tests on the VM: - 1 Thread: latency=33us, iops=25K, iothread-cpu-load=92% - 2 Threads: latency=33us, iops=50K, iothread-cpu-load=96% - 4 Threads: latency=40us, iops=83K, iothread-cpu-load=97% - 8 Threads: latency=55us, iops=128K, iothread-cpu-load=99% - 12 Threads: latency=71us, iops=150K, iothread-cpu-load=99% - 14 Threads: latency=124us, iops=103K, iothread-cpu-load=74% - 16 Threads: latency=128us, iops=106K, iothread-cpu-load=75% Please note the sudden increase in latency, accompanied by the decrease in IOPS and IOThread CPU load. The latter clearly signals is no longer actively polling the AIO Context. Enabling tracing shows a constant repetition of this pattern (I've added block_ns to the trace point): <snip> 5991:poll_shrink ctx 0x55b02b45c5c0 block_ns 36383 old 32000 new 0 5991:poll_grow ctx 0x55b02b45c5c0 block_ns 27767 old 0 new 4000 5991:poll_grow ctx 0x55b02b45c5c0 block_ns 31081 old 4000 new 8000 5991:poll_grow ctx 0x55b02b45c5c0 block_ns 25130 old 8000 new 16000 5991:poll_grow ctx 0x55b02b45c5c0 block_ns 26365 old 16000 new 32000 5991:poll_shrink ctx 0x55b02b45c5c0 block_ns 41143 old 32000 new 0 5991:poll_grow ctx 0x55b02b45c5c0 block_ns 26033 old 0 new 4000 5991:poll_grow ctx 0x55b02b45c5c0 block_ns 25253 old 4000 new 8000 5991:poll_grow ctx 0x55b02b45c5c0 block_ns 26373 old 8000 new 16000 5991:poll_grow ctx 0x55b02b45c5c0 block_ns 25224 old 16000 new 32000 5991:poll_shrink ctx 0x55b02b45c5c0 block_ns 38603 old 32000 new 0 </snip> This reveals two problems with the current adaptative algorithm: - Misinterprets a high block_ns as wasted time, but under heavy load, block_ns can be higher than poll_max_ns just because the amount of work done processing the vqs. - The default value for poll_shrink is 0, which causes ctx->poll_ns to suddenly drop to 0 each time poll_max_ns is exceeded. Manually tunning poll_max_ns helps, but I think these two changes could also be considered (also see attached patch): - Only shrink poll_ns when no progress has been done. - Set a default shrink factor of 2. With the above changes, tests with numjobs from 1 to 12 present the same values, but ones with 14 and 16 change to this: - 14 Threads: latency=84us, iops=150K, iothread-cpu-load=99% - 16 Threads: latency=92us, iops=156K, iothread-cpu-load=99% Looks like we've reached the max throughput we can get from a single IOThread on this server, but at least it's not getting down. Version-Release number of selected component (if applicable): Tested with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64. I suspect upstream behaves the same, but haven't actually tried it yet. How reproducible: Always.