Description of problem: flag 'hv_vapic' doesn't improve Windows' performance evidently Version-Release number of selected component (if applicable): Host qemu-kvm-4.0.0-5.module+el8.1.0+3622+5812d9bf.x86_64 kernel-4.18.0-112.el8.x86_64 seabios-bin-1.11.1-4.module+el8.1.0+3531+2918145b.noarch Guest en_windows_10_business_editions_version_1903_x64_dvd_37200948.iso How reproducible: 3/3 Steps to Reproduce: 1. boot guest with command [1] without flag 'hv_vapic' 2. Using the IOmeter tool observer the storage performance a.download the tool inside the guest http://sourceforge.net/projects/iometer/files/iometer-stable/1.1.0/iometer-1.1.0-win64.x86_64-bin.zip/download b.Open the IOmeter and do configuration "Disk Target" ==> "D: "" "Access Specifications" ==> "4KiB 100% Read;" "Test Setup" ==> "30 Minutes" c. Start Test 3.Shutdown the guest. Then boot the same guest again with command "-cpu Skylake-Client-IBRS,+kvm_pv_unhalt,hv_vapic". Repeat the step 2 Actual results: [test 1]--two work1 in iometer: storage performance without any flag PROCESSOR,CPU ==> 24.24% IOPS ==> 5388.54 storage performance with the flag "hv_vapic" PROCESSOR,CPU ==> 23.80% IOPS ==> 5400.17 [test 2]-- two work1 in iometer: storage performance without any flag PROCESSOR,CPU ==> 24.39% IOPS ==> 5459.93 storage performance with the flag "hv_vapic" PROCESSOR,CPU ==> 24.01% IOPS ==> 5438 [test 3]--one work1 in iometer: storage performance without any flag PROCESSOR,CPU ==> 24.60% IOPS ==> 5423.28 storage performance with the flag "hv_vapic" PROCESSOR,CPU ==> 24.00% IOPS ==> 5393.71 Expected results: flag 'hv_vapic' can improve Windows' performance evidently Additional info: [1] /usr/libexec/qemu-kvm -name win10-edk2 -M q35 -enable-kvm \ -cpu SandyBridge-IBRS,+kvm_pv_unhalt,hv_time \ -monitor stdio \ -nodefaults -rtc base=utc \ -m 4G \ -boot menu=on,splash-time=12000 \ -global driver=cfi.pflash01,property=secure,value=on \ -drive file=/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd,if=pflash,format=raw,readonly=on,unit=0 \ -drive file=/home/1-win10-edk2/OVMF/OVMF_VARS.fd,if=pflash,format=raw,unit=1,readonly=off \ -smp 2,sockets=1,cores=2,threads=2,maxcpus=4 \ -object secret,id=sec0,data=redhat \ -blockdev node-name=back_image,driver=file,cache.direct=on,cache.no-flush=off,filename=/home/1-win10-edk2/win10.luks,aio=threads \ -blockdev node-name=drive-virtio-disk0,driver=luks,cache.direct=on,cache.no-flush=off,file=back_image,key-secret=sec0 \ -device pcie-root-port,id=root0,slot=0 \ -device virtio-blk-pci,drive=drive-virtio-disk0,id=disk0,bus=root0 \ -device pcie-root-port,id=root1,slot=1 \ -device virtio-net-pci,mac=70:5a:0f:38:cd:a3,id=idhRa7sf,vectors=4,netdev=idNIlYmb,bus=root1 -netdev tap,id=idNIlYmb,vhost=on \ -drive id=drive_cd1,if=none,snapshot=off,aio=threads,cache=none,media=cdrom,file=/home/iso/windows/virtio-win-prewhql-0.1-172.iso \ -device ide-cd,id=cd1,drive=drive_cd1,bus=ide.0,unit=0 \ -device ich9-usb-uhci6 \ -device usb-tablet,id=mouse \ -device qxl-vga,id=video1 \ -spice port=5901,disable-ticketing \ -device virtio-serial-pci,id=virtio-serial1 \ -chardev spicevmc,id=charchannel0,name=vdagent \ -device virtserialport,bus=virtio-serial1.0,nr=3,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 \
Update status of RHEL8.2 fast train: Test Environments: 4.18.0-148.el8.x86_64 qemu-kvm-4.2.0-0.module+el8.2.0+4714+8670762e.x86_64 seabios-1.12.0-5.module+el8.2.0+4673+ff4b3b61.x86_64 en_windows_server_2019_updated_march_2019_x64_dvd_2ae967ab.iso Test results: storage performance without any flag PROCESSOR,CPU 1 ==> 24.84% IOPS ==> 7652.10 storage performance with the flag "+kvm_pv_unhalt,hv_vapic" PROCESSOR,CPU ==> 21.94% IOPS ==> 7658.34
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks
Vadim, do you by any chance remember how 'hv_vapic' feature was tested when it was introduced? Or, maybe, you know how to construct a good test for it? Thanks!
(In reply to Vitaly Kuznetsov from comment #6) > Vadim, do you by any chance remember how 'hv_vapic' feature was tested when > it was introduced? Or, maybe, you know how to construct a good test for it? > Thanks! Honestly, at that time the only tool that I used to check any hyper-v related performance improvements was IoMeter. I was testing 512B read/write IO against a FAT-formatted volume with 512B sector size. Best, Vadim.
I checked and the old trick seems to work, IO test shows moderate improvement. What I did was: 1) Create a new raw volume on the host on tmpfs (important): # qemu-img create -f raw /tmp/disk.raw 8G 2) Start Windows guest, I used WS2016, the command line was: qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=split -name guest=win10 -cpu host,hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer,-vmx -smp 6 -m 16384 -drive file=/var/lib/libvirt/images/WindowsServer2016_Gen1.qcow2,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive file=/tmp/disk.raw,format=raw,if=none,id=drive-ide1-0-0 -device ide-hd,bus=ide.1,unit=0,drive=drive-ide1-0-0,id=ide1-0-0,bootindex=2 -net nic,model=e1000e -net bridge,br=br0 -vnc :0 3) Partition hard drive, create an NTFS partition (D: in my case) 4) Install FIO (https://bsdio.com/fio/) 5) Create fio job, I used the following: [global] name=fio-rand-RW filename=fio-rand-RW directory=D\:\ rw=randwrite bs=512B direct=1 numjobs=6 time_based=1 runtime=300 [file1] size=1G iodepth=16 6) Note, the job uses the same 'numjobs' as the number of vCPUs the guest has 7) Run the job, 'fio job.fio' 8) Reboot the guest without 'hv_apic', let it calm down and run the same job, compare the result. You may also want to do vCPU pinning (can use libvirt for that). In my testsing I'm seeing >10% improvement.
(In reply to Vitaly Kuznetsov from comment #8) > I checked and the old trick seems to work, IO test shows moderate > improvement. > > What I did was: > 1) Create a new raw volume on the host on tmpfs (important): > # qemu-img create -f raw /tmp/disk.raw 8G > 2) Start Windows guest, I used WS2016, the command line was: > qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=split -name > guest=win10 -cpu > host,hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer,-vmx -smp 6 > -m 16384 -drive > file=/var/lib/libvirt/images/WindowsServer2016_Gen1.qcow2,format=qcow2, > if=none,id=drive-ide0-0-0 -device > ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive > file=/tmp/disk.raw,format=raw,if=none,id=drive-ide1-0-0 -device > ide-hd,bus=ide.1,unit=0,drive=drive-ide1-0-0,id=ide1-0-0,bootindex=2 -net > nic,model=e1000e -net bridge,br=br0 -vnc :0 > 3) Partition hard drive, create an NTFS partition (D: in my case) > 4) Install FIO (https://bsdio.com/fio/) > 5) Create fio job, I used the following: > > [global] > name=fio-rand-RW > filename=fio-rand-RW > directory=D\:\ > rw=randwrite > bs=512B > direct=1 > numjobs=6 > time_based=1 > runtime=300 > > [file1] > size=1G > iodepth=16 > > 6) Note, the job uses the same 'numjobs' as the number of vCPUs the guest has > 7) Run the job, 'fio job.fio' > 8) Reboot the guest without 'hv_vapic', let it calm down and run the same > job, compare the result. Tried as above (win10-64), without flag, bw= 7505Kib/s with flag "hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer", bw=8209Kib/s About 9.38% improvement. And I have some question: 1 this case is testing for hv_vapic, is it enough that only testing with hv_vapic, not "hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer" 2 Create a new raw volume on the host on tmpfs (important) we must use tmpfs filesystem to test,right? I do this as below, is it right? 1 #mount tmpfs /mnt/tmpfs -t tmpfs 2 #qemu-img create -f raw /mnt/tmpfs/data.raw 10G And why we must use tmpfs? Is qcow2 ok for this case ? > > You may also want to do vCPU pinning (can use libvirt for that). Do you mean we must do vCPU pinning for our test? Since we usually test with qemu cml not libvirt. In my > testsing I'm seeing >10% improvement. In my testing, it increase 9% and sometimes 8%, what percentage we should achieve at least? Thanks Yu Wang
(In reply to Yu Wang from comment #9) > About 9.38% improvement. Sounds great) > > And I have some question: > 1 this case is testing for hv_vapic, is it enough that only testing with > hv_vapic, not > "hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer" Should be but I haven't tried myself. > > 2 Create a new raw volume on the host on tmpfs (important) > we must use tmpfs filesystem to test,right? > > I do this as below, is it right? > 1 #mount tmpfs /mnt/tmpfs -t tmpfs > 2 #qemu-img create -f raw /mnt/tmpfs/data.raw 10G > > And why we must use tmpfs? Is qcow2 ok for this case ? We use tmpfs to not get blocked on the real storage as we won't likely see any improvements. In case you can get a fast storage (NVMe SSD, for example) it should be equally good. 'qcow2' will work after it's fully populated. When created, it is small and it grows over time. You may have unstable test results before it reaches the desired capacity. 'raw', on the other hand, doesn't have this problem as it is fully populated upon creation. > > In my testing, it increase 9% and sometimes 8%, what percentage we should > achieve at least? It probably depends on the environment and may depend on Windows version used, in case you need a number for test automation I'd say let's set it fairly low (e.g. 5%), this way we know that the feature works and we'll catch regressions if they ever happen.
Hi Vitaly > You may also want to do vCPU pinning (can use libvirt for that). Do you mean vCPU pinning is a must for our test or not? we use "numactl --physcpubind=1,2,3,4" for cpu pinning, is that right? Another question: >-drive file=/tmp/disk.raw,format=raw,if=none,id=drive-ide1-0-0 -device ide-hd,bus=ide.1,unit=0,drive=drive-ide1-0-0,id=ide1-0-0,bootindex=2 Do you suggest testing with ide disk? Or virtio-scsi/virtio-blk is ok? Since we use our own driver for virtio-scsi/virtio-blk, not a microsoft build-in driver, will it influence this hv flag performance? I retest this case today, the result is not as well as expected (without cpu pinning) KiB/s with hv_vapic without hv_vapic ------------------------------------------------------------------------------ win2019/ide 9603 9256 win10-64/ide 7835 7565 win2019/virtio-scsi 11100 10400 win10-64/virtio-scsi 9153 8803 I used vCPU pinning to test it, but the improvement is low or even no improvement. win2019/virtio-scsi 11800 11900 Boot cmd (full): numactl --physcpubind=1,2,3,4,5,6 /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine q35 \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,port=0x1,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 6144 \ -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ -cpu 'Skylake-Server',+kvm_pv_unhalt,hv_vapic \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_y_i74ftz1,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_y_i74ftz1,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idOFBXlG \ -chardev socket,server,path=/var/tmp/avocado_y_i74ftz1,nowait,id=chardev_serial0 \ -device isa-serial,id=serial0,chardev=chardev_serial0 \ -chardev socket,id=seabioslog_id_20200210-015430-r3CRLBJ0,path=/var/tmp/avocado_y_i74ftz1,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20200210-015430-r3CRLBJ0,iobase=0x402 \ -device pcie-root-port,id=pcie-root-port-2,addr=0x1.0x1,port=0x2,bus=pcie.0,chassis=2 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ -drive file=/home/kvm_autotest_root/images/win2019-64-virtio-scsi.qcow2,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 \ -device pcie-root-port,id=pcie.0-root-port-5,slot=5,chassis=5,addr=0x5,bus=pcie.0 \ -device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pcie.0-root-port-5,addr=0x0 \ -drive id=drive_image2,if=none,snapshot=off,aio=threads,format=raw,file=/mnt/tmpfs/data.raw \ -device scsi-hd,id=image2,drive=drive_image2 \ -device pcie-root-port,id=pcie-root-port-4,addr=0x1.0x3,port=0x4,bus=pcie.0,chassis=4 \ -device virtio-net-pci,mac=9a:24:b5:18:61:aa,id=idb3Oo7Z,mq=on,vectors=14,netdev=id1yLkRM,bus=pcie-root-port-4,addr=0x0 \ -netdev tap,id=id1yLkRM,vhost=on,queues=6 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -vnc :0 \ -rtc base=localtime,clock=host,driftfix=slew \ -boot menu=on,order=cdn,once=c,strict=off \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,port=0x5,bus=pcie.0,addr=0x3,chassis=5 \ -monitor stdio \ Thanks Yu Wang
(In reply to Yu Wang from comment #11) > Hi Vitaly > > > You may also want to do vCPU pinning (can use libvirt for that). > > Do you mean vCPU pinning is a must for our test or not? > we use "numactl --physcpubind=1,2,3,4" for cpu pinning, is that right? It is not a must but may give you more stable results, like with any performance related testing. > > Another question: > >-drive file=/tmp/disk.raw,format=raw,if=none,id=drive-ide1-0-0 -device ide-hd,bus=ide.1,unit=0,drive=drive-ide1-0-0,id=ide1-0-0,bootindex=2 > > Do you suggest testing with ide disk? Or virtio-scsi/virtio-blk is ok? > Since we use our own driver for virtio-scsi/virtio-blk, not a microsoft > build-in driver, will it influence this hv flag performance? Modern devices may not generate that many interrupts and that's why I was using IDE as the simplest possible test. It's also possible to use some legacy networking device instead of storage I guess but I haven't tested that.
Could you have a look at comment#13? The result is not well expected on my side. Thanks a lot Yu Wang
Is there any difference between testing you've done for https://bugzilla.redhat.com/show_bug.cgi?id=1729077#c9 and https://bugzilla.redhat.com/show_bug.cgi?id=1729077#c13? Is it the same guest and the same hardware on the host? It's not very easy to achieve very stable test results with Windows guests, unfortunately. It would probably be possible to write a synthetic test (e.g. for kvm-unit-tests) for the feature but this won't tell us much about genuine Windows behavior across versions.
(In reply to Vitaly Kuznetsov from comment #15) > Is there any difference between testing you've done for > https://bugzilla.redhat.com/show_bug.cgi?id=1729077#c9 and > https://bugzilla.redhat.com/show_bug.cgi?id=1729077#c13? > Is it the same guest and the same hardware on the host? It's not very easy > to achieve very stable test results with Windows guests, unfortunately. It > would probably be > possible to write a synthetic test (e.g. for kvm-unit-tests) for the feature > but this won't tell us much about genuine Windows behavior across versions. The only difference is used "hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer" as comment#8, and I re-tried on win10-64 with these flag above,only hv_vapic and no flag: flag"hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer" only hv_vpaic no flag win10-64 3448 2500 2303 So it seems that with more flags, the perfomance is better, the performance for only hv_vapic and no flag not incresed obviously. Thanks Yu Wang numactl --physcpubind=1,2,3,4,5,6 /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine q35 \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,port=0x1,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 6144 \ -smp 6,maxcpus=6,cores=3,threads=1,sockets=2 \ -cpu 'Skylake-Server',hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_y_i74ftz1,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_y_i74ftz1,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idOFBXlG \ -chardev socket,server,path=/var/tmp/avocado_y_i74ftz1,nowait,id=chardev_serial0 \ -device isa-serial,id=serial0,chardev=chardev_serial0 \ -chardev socket,id=seabioslog_id_20200210-015430-r3CRLBJ0,path=/var/tmp/avocado_y_i74ftz1,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20200210-015430-r3CRLBJ0,iobase=0x402 \ -device pcie-root-port,id=pcie-root-port-2,addr=0x1.0x1,port=0x2,bus=pcie.0,chassis=2 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \ -drive file=/home/kvm_autotest_root/images/win10-64-virtio-scsi.qcow2,format=qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 \ -drive file=/mnt/tmpfs/data.raw,format=raw,if=none,id=drive-ide1-0-0 -device ide-hd,bus=ide.1,unit=0,drive=drive-ide1-0-0,id=ide1-0-0,bootindex=2 \ -device pcie-root-port,id=pcie-root-port-4,addr=0x1.0x3,port=0x4,bus=pcie.0,chassis=4 \ -device virtio-net-pci,mac=9a:24:b5:18:61:aa,id=idb3Oo7Z,mq=on,vectors=14,netdev=id1yLkRM,bus=pcie-root-port-4,addr=0x0 \ -netdev tap,id=id1yLkRM,vhost=on,queues=6 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -vnc :0 \ -rtc base=localtime,clock=host,driftfix=slew \ -boot menu=on,order=cdn,once=c,strict=off \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,port=0x5,bus=pcie.0,addr=0x3,chassis=5 \ -monitor stdio \
(In reply to Yu Wang from comment #16) > (In reply to Vitaly Kuznetsov from comment #15) > > Is there any difference between testing you've done for > > https://bugzilla.redhat.com/show_bug.cgi?id=1729077#c9 and > > https://bugzilla.redhat.com/show_bug.cgi?id=1729077#c13? > > Is it the same guest and the same hardware on the host? It's not very easy > > to achieve very stable test results with Windows guests, unfortunately. It > > would probably be > > possible to write a synthetic test (e.g. for kvm-unit-tests) for the feature > > but this won't tell us much about genuine Windows behavior across versions. > > The only difference is used > "hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer" > as comment#8, and I re-tried on win10-64 with these flag above,only hv_vapic > and no flag: > > > flag"hv_vpindex,hv_time,hv_relaxed,hv_vapic,hv_synic,hv_stimer" > only hv_vpaic no flag > > win10-64 3448 > 2500 2303 > > > So it seems that with more flags, the perfomance is better, the performance > for only hv_vapic and no flag > not incresed obviously. > With no hv_time/hv_stimer we get way more vmexits and this may explain the result you're getting: e.g. an exit for EOI is happening and timer gets inject but when it's not needed (with hv_vapic) we will still exit. That said I think it make sense to change the test to use 'all' and 'all but hv_vapic' flags to actually see what our users are seeing (as noone will likely run with 'hv_vapic' only).
Test with "all flags", "all but hv_vapic" and "none flag". Results are as below: all all but hv_vapic none Win10-64 6742/6899 6926/7028 2288/2790 So, it's almost the same with "all" and "all but hv_vapic", but it is higher performance than "none flag".
Test with "all flags", "all but hv_vapic" on fastt rain (hv_evmcs depends on hv_vapic, so no hv_evmcs either) Results are as below (two runs): all all but hv_vapic Win10-64 3151/3409 2750/2881 Win2016 2688/2439 2245/2200 almost 10% improvement Steps as https://bugzilla.redhat.com/show_bug.cgi?id=1727238#c19 Thanks Yu Wang
(In reply to Yu Wang from comment #19) > Test with "all flags", "all but hv_vapic" on fastt rain > (hv_evmcs depends on hv_vapic, so no hv_evmcs either) > > Results are as below (two runs): > > all all but hv_vapic > Win10-64 3151/3409 2750/2881 > Win2016 2688/2439 2245/2200 > > almost 10% improvement > Looks good to me, thanks!
According to comment#20, change this bug to verified. Thanks Yu Wang
Changing this TestOnly BZ as CLOSED CURRENTRELEASE. Please reopen if the issue is not resolved.