Bug 1536398

Summary:

AioContext polling inhibits itself when facing heavy load from multiple vCPUs

Product:

Red Hat Enterprise Linux 7

Reporter:

Sergio Lopez <slopezpa>

Component:

qemu-kvm-rhev

Assignee:

Stefan Hajnoczi <stefanha>

Status:

CLOSED DUPLICATE

QA Contact:

Yanhui Ma <yama>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.4

CC:

aliang, atheurer, chayang, coli, cww, djdumas, dsafford, gveitmic, gwatson, jentrena, juzhang, knoel, michen, mkalinin, sfroemer, slopezpa, stefanha, virt-maint, wquan, xuwei, yama, yhong

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-12-17 16:27:58 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1477664

Attachments:

Description	Flags
Patch implementing a more conservative shrinking of poll_ns	none

Description Sergio Lopez 2018-01-19 10:56:21 UTC

Created attachment 1383289 [details]
Patch implementing a more conservative shrinking of poll_ns

Description of problem:

* Server specs
 - Xeon E5-2650 @ 2.00GHz (2 sockets, 8 cores, 2 threads)
 - 64 GB of RAM

* VM specs
 - 16 vCPUs pinned to NUMA node 1
 - 1 GB of RAM
 - 1 IOThread pinned to 8 cores from NUMA node 0
 - 1 virtio-blk device without iothread backed by a qcow2 file on an xfs filesystem, holding the Guest OS.
 - 1 virtio-blk device with iothread (with aio=threads) backed by a raw file on a tmpfs filesystem, used by FIO tests.

* FIO test specs:
 - fio --runtime=300 --time_based --clocksource=clock_gettime --name=randread --rw=randread --bs=4k --filename=/dev/vdb --size=10g --numjobs=X --iodepth=1 --direct=1 --ioengine=sync --group_reporting
 - numbjobs is the only parameter changed in each run.

* Tests on the Host:
 - 1 Thread: latency=3us, iops=170K

* Tests on the VM:
 - 1 Thread:   latency=33us,  iops=25K,  iothread-cpu-load=92%
 - 2 Threads:  latency=33us,  iops=50K,  iothread-cpu-load=96%
 - 4 Threads:  latency=40us,  iops=83K,  iothread-cpu-load=97%
 - 8 Threads:  latency=55us,  iops=128K, iothread-cpu-load=99%
 - 12 Threads: latency=71us,  iops=150K, iothread-cpu-load=99%
 - 14 Threads: latency=124us, iops=103K, iothread-cpu-load=74%
 - 16 Threads: latency=128us, iops=106K, iothread-cpu-load=75%

Please note the sudden increase in latency, accompanied by the decrease in IOPS and IOThread CPU load. The latter clearly signals is no longer actively polling the AIO Context.

Enabling tracing shows a constant repetition of this pattern (I've added block_ns to the trace point):

<snip>
5991:poll_shrink ctx 0x55b02b45c5c0 block_ns 36383 old 32000 new 0
5991:poll_grow ctx 0x55b02b45c5c0 block_ns 27767 old 0 new 4000
5991:poll_grow ctx 0x55b02b45c5c0 block_ns 31081 old 4000 new 8000
5991:poll_grow ctx 0x55b02b45c5c0 block_ns 25130 old 8000 new 16000
5991:poll_grow ctx 0x55b02b45c5c0 block_ns 26365 old 16000 new 32000
5991:poll_shrink ctx 0x55b02b45c5c0 block_ns 41143 old 32000 new 0
5991:poll_grow ctx 0x55b02b45c5c0 block_ns 26033 old 0 new 4000
5991:poll_grow ctx 0x55b02b45c5c0 block_ns 25253 old 4000 new 8000
5991:poll_grow ctx 0x55b02b45c5c0 block_ns 26373 old 8000 new 16000
5991:poll_grow ctx 0x55b02b45c5c0 block_ns 25224 old 16000 new 32000
5991:poll_shrink ctx 0x55b02b45c5c0 block_ns 38603 old 32000 new 0
</snip>

This reveals two problems with the current adaptative algorithm:

 - Misinterprets a high block_ns as wasted time, but under heavy load, block_ns can be higher than poll_max_ns just because the amount of work done processing the vqs.

 - The default value for poll_shrink is 0, which causes ctx->poll_ns to suddenly drop to 0 each time poll_max_ns is exceeded.

Manually tunning poll_max_ns helps, but I think these two changes could also be considered (also see attached patch):

 - Only shrink poll_ns when no progress has been done.

 - Set a default shrink factor of 2.

With the above changes, tests with numjobs from 1 to 12 present the same values, but ones with 14 and 16 change to this:

 - 14 Threads: latency=84us, iops=150K, iothread-cpu-load=99%
 - 16 Threads: latency=92us, iops=156K, iothread-cpu-load=99%

Looks like we've reached the max throughput we can get from a single IOThread on this server, but at least it's not getting down.


Version-Release number of selected component (if applicable):

Tested with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64. I suspect upstream behaves the same, but haven't actually tried it yet.


How reproducible:

Always.

Comment 3 Yanhui Ma 2018-01-24 10:19:49 UTC

I tried it with qemu-kvm-rhev-2.10.0-16.el7.x86_64 and kernel-3.10.0-820.el7.x86_64, and used the NVMe, not tmpfs. Fio configuration is the same as comment0. No iops decrease with the bigger numjobs in VM.

* Server specs:
 - Intel(R) Xeon(R) CPU E7-4820 v2 @ 2.00GHz
 - 64G RAM

* VM specs:

pin vcpus and iothreads in the same way in comment0

/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults  \
    -vga std  \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel75-64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=03 \
    -drive if=none,id=drive0,file=/dev/nvme0n1,format=raw,cache=none,aio=threads \
    -object iothread,id=iothread0 \
    -device virtio-blk-pci,iothread=iothread0,drive=drive0 \
    -device virtio-net-pci,id=id10Nyb9,vectors=4,netdev=id0msQUV,bus=pci.0,addr=05  \
    -netdev tap,id=id0msQUV \
    -m 1024  \
    -smp 16  \
    -cpu 'SandyBridge',+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_vapic,hv_time \
    -vnc :0  \
    -rtc base=localtime,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm -monitor stdio

* Tests on the Host:
1 thread:   iops=11121  latency=87.81us
2 threads:  iops=22008  latency=88.66us
4 threads:  iops=43017  latency=90.77us
8 threads:  iops=82524  latency=94.72us
12 threads: iops=116544 latency=100.71us
14 threads: iops=134024 latency=102.16us
16 threads: iops=148334 latency=105.58us
32 threads: iops=250153 latency=125.24us
* Tests on the VM:
1 thread:  iops=7487   latency=130.69us
2 threads: iops=14841  latency=131.96us
4 threads: iops=30127  latency=129.90us
8 threads: iops=58565  latency=133.36us
12 threads:iops=78155  latency=149.99us
14 threads:iops=86214  latency=158.64us
16 threads:iops=91474  latency=170.99us
32 threads:iops=110177 latency=286.31us

Comment 4 Yanhui Ma 2018-01-24 11:19:02 UTC

I tried it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64.rpm, others are the same with comment 3. And no iops decrease with the bigger numjobs in VM.

* Tests on the VM:
1 thread:   iops=7634   latency=128.09us
2 threads:  iops=14904  latency=131.33us
4 threads:  iops=30302  latency=129.16us
8 threads:  iops=58926  latency=132.60us
12 threads: iops=80586  latency=145.40us
14 threads: iops=89001  latency=153.58us
16 threads: iops=95559  latency=163.53us
32 threads: iops=121226 latency=259.81us

Comment 5 Stefan Hajnoczi 2018-01-24 15:34:20 UTC

(In reply to Yanhui Ma from comment #4)
> I tried it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64.rpm, others are the
> same with comment 3. And no iops decrease with the bigger numjobs in VM.
> 
> * Tests on the VM:
> 1 thread:   iops=7634   latency=128.09us

Polling is only active when latency <= 32us by default.  The drive tested is too slow to take advantage of polling (too much CPU would be spent waiting for a request to complete).

The -object iothread,poll-max-ns= parameter can be set to increase the limit (the default is 32us), but anything above 100us is probably going to cause noticable CPU utilization from the IOThread on the host.

Comment 6 Stefan Hajnoczi 2018-01-24 15:38:11 UTC

(In reply to Sergio Lopez from comment #0)
> Created attachment 1383289 [details]
> Patch implementing a more conservative shrinking of poll_ns

The patch and your reasoning looks good.  Please post it upstream CCing Christian Borntraeger <borntraeger.com> and myself.

Please test aio=native on real hardware, if possible, and include those numbers when you send the patch upstream.

Comment 7 Yanhui Ma 2018-01-26 05:07:54 UTC

(In reply to Stefan Hajnoczi from comment #5)
> (In reply to Yanhui Ma from comment #4)
> > I tried it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64.rpm, others are the
> > same with comment 3. And no iops decrease with the bigger numjobs in VM.
> > 
> > * Tests on the VM:
> > 1 thread:   iops=7634   latency=128.09us
> 
> Polling is only active when latency <= 32us by default.  The drive tested is
> too slow to take advantage of polling (too much CPU would be spent waiting
> for a request to complete).
> 
> The -object iothread,poll-max-ns= parameter can be set to increase the limit
> (the default is 32us), but anything above 100us is probably going to cause
> noticable CPU utilization from the IOThread on the host.

Hi Stefan,

Because the latency is 163.53us with 16 threads in above results. You mean I need to set poll-max-ns=256us (>163.53us) to make polling active from 1 threads to 16 threads? If it is meaningful, I will try it again.

Thanks,
Yanhui

Comment 8 Stefan Hajnoczi 2018-01-30 08:22:30 UTC

(In reply to Yanhui Ma from comment #7)
> (In reply to Stefan Hajnoczi from comment #5)
> > (In reply to Yanhui Ma from comment #4)
> > > I tried it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64.rpm, others are the
> > > same with comment 3. And no iops decrease with the bigger numjobs in VM.
> > > 
> > > * Tests on the VM:
> > > 1 thread:   iops=7634   latency=128.09us
> > 
> > Polling is only active when latency <= 32us by default.  The drive tested is
> > too slow to take advantage of polling (too much CPU would be spent waiting
> > for a request to complete).
> > 
> > The -object iothread,poll-max-ns= parameter can be set to increase the limit
> > (the default is 32us), but anything above 100us is probably going to cause
> > noticable CPU utilization from the IOThread on the host.
> 
> Hi Stefan,
> 
> Because the latency is 163.53us with 16 threads in above results. You mean I
> need to set poll-max-ns=256us (>163.53us) to make polling active from 1
> threads to 16 threads?

Yes, exactly.

> If it is meaningful, I will try it again.

I'm not sure it's meaningful.  Users are unlikely to enable polling on slow devices because it consumes too much CPU.

From Comment #1:
> - 1 virtio-blk device with iothread (with aio=threads) backed by a raw file on a tmpfs filesystem, used by FIO tests.

If you use tmpfs like Sergio does then the latency should be much lower.  I suggest testing the same configuration as Sergio if you want to reproduce this.

Comment 9 Yanhui Ma 2018-01-31 11:39:45 UTC

Hi Sergio,

I tried it with tmpfs, and iops just decreased a little with numjobs=14. Could you help check my steps?

The host specs, fio specs and numa pin are the same as comment 3, and are as close as possible to your configuration.

#mount -t tmpfs -o size=30G,mode=700 tmpfs /mytmpfs
#qemu-img create -f raw /mytmpfs/storage2.raw 20G

qemu cml:
/usr/libexec/qemu-kvm \
...
-object iothread,id=iothread0 \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel75-64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,iothread=iothread0,bus=pci.0,addr=0x3 \
    -drive id=drive_disk1,if=none,snapshot=off,aio=threads,cache=writeback,format=raw,file=/mytmpfs/storage2.raw \
    -device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,iothread=iothread0,bus=pci.0,addr=0x4 \
-m 1024  \
    -smp 16,cores=8,threads=1,sockets=2  \
    -cpu 'SandyBridge',+kvm_pv_unhalt \
...

Results with default poll-max-ns=32:
1 thread: iops=14962 latency=63.33us
2 threads: iops=30401 latency=61.95us
4 threads: iops=46302 latency=83.14us
8 threads: iops=69719 latency=111.41us
12 threads: iops=83105 latency=140.61us
14 threads: iops=82755 latency=165.17us
16 threads: iops=84206 latency=185.96us


Results with poll-max-ns=256:
1 thread: iops=21932 latency=42.71us
2 threads: iops=35889 latency=52.73us
4 threads: iops=56226 latency=67.94us
8 threads: iops=78410 latency=98.61us
12 threads: iops=83841 latency=139.45us
14 threads: iops=83389 latency=163.98us
16 threads: iops=85881 latency=182.45us

Comment 19 Fam Zheng 2018-08-01 01:02:13 UTC

OK, I'll send the info by email.