We've made substantial changes to the NVMe driver in QEMU and, although we don't consider it fully supported, we want to encourage layered products to give it a try (it's still TechPreview). This BZ is to bring some awareness to the changes and request some additional testing of the driver. This is a copy&paste from a document shared with QE earlier, by Stefan. It includes testing procedures: ---- A userspace NVMe driver has been available in QEMU but was experimental until recently. It is now ready to be used when the physical storage is a local NVMe PCI device. Disk I/O performance is improved over the traditional file-backed block drivers in QEMU. The entire PCI adapter is assigned to a single guest. The host cannot access the NVMe device while the guest is running. Users may choose to use VFIO Device Assignment instead for even lower overhead if they do not require live migration. Documentation This feature is available on x86. POWER and aarch64 are not yet supported, but may be available by the release date. The userspace NVMe driver is a good choice when I/O performance is a priority but VFIO Device Assignment cannot be used. Storage migration and other storage features are available with the userspace NVMe driver. The libvirt domain XML is as follows: <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vde' bus='virtio'/> </disk> The NVMe namespace can be selected on drives that support multiple namespaces using <source namespace=’N’>. --- Testing --- Requires: A host with a spare NVMe drive that is not in use by the host operating system. Configure a virtio-blk device with the userspace NVMe driver as shown in the libvirt domain XML above. Define an IOThread and assign the virtio-blk device. When the guest boots it sees a virtio-blk device. Comparing the I/O performance to the host /dev/nvme0n1 block device and aio=native shows that the userspace NVMe driver is at least as fast as the host block device.
Use the following xml to start a vm with a disk: > <disk type='nvme' device='disk'> > <driver name='qemu' type='raw'/> > <source type='pci' managed='yes' namespace='1'> > <address domain='0x0000' bus='0x65' slot='0x00' function='0x0'/> > </source> > <target dev='vde' bus='virtio'/> > </disk> The qemu cmd line is like: -blockdev {"driver":"nvme","device":"0000:65:00.0","namespace":1,"node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} \ -blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"} \ -device virtio-blk-pci,bus=pci.4,addr=0x0,drive=libvirt-1-format,id=virtio-disk4 \ Xueqiang,could you have a look ?
According to Description, do fio test on host /dev/nvme0n1 block device, and on a virtio-blk device with the userspace NVMe driver in guest. Comparing the I/O performance between them, the userspace NVMe driver is much lower than the host block device. Details: Version: kernel-4.18.0-255.el8.x86_64 qemu-kvm-5.2.0-0.module+el8.4.0+8855+a9e237a9 1. create a partition on host /dev/nvme0n1, an mount it to /home/fio_nvme # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 372G 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 371G 0 part ├─rhel_dell--per740xd--01-root 253:0 0 70G 0 lvm / ├─rhel_dell--per740xd--01-swap 253:1 0 31.4G 0 lvm [SWAP] └─rhel_dell--per740xd--01-home 253:2 0 269.7G 0 lvm /home sdb 8:16 0 558.4G 0 disk └─sdb1 8:17 0 558.4G 0 part nvme0n1 259:0 0 745.2G 0 disk └─nvme0n1p1 259:2 0 745.2G 0 part /home/fio_nvme 2. do fio test on /home/fio_nvme # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/fio_nvme/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][100.0%][r=547MiB/s,w=137MiB/s][r=140k,w=34.9k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=32471: Thu Dec 3 09:31:22 2020 read: IOPS=133k, BW=521MiB/s (546MB/s)(153GiB/300005msec) slat (nsec): min=1791, max=256544, avg=7424.69, stdev=3190.10 clat (usec): min=33, max=12300, avg=634.19, stdev=703.31 lat (usec): min=46, max=12306, avg=641.74, stdev=703.23 clat percentiles (usec): | 1.00th=[ 90], 5.00th=[ 105], 10.00th=[ 123], 20.00th=[ 167], | 30.00th=[ 215], 40.00th=[ 273], 50.00th=[ 347], 60.00th=[ 449], | 70.00th=[ 603], 80.00th=[ 906], 90.00th=[ 1860], 95.00th=[ 2343], | 99.00th=[ 2835], 99.50th=[ 3064], 99.90th=[ 4555], 99.95th=[ 4883], | 99.99th=[ 5735] bw ( KiB/s): min=487200, max=587008, per=100.00%, avg=534129.38, stdev=2211.75, samples=4784 iops : min=121800, max=146752, avg=133532.32, stdev=552.93, samples=4784 write: IOPS=33.3k, BW=130MiB/s (137MB/s)(38.2GiB/300005msec); 0 zone resets slat (nsec): min=1924, max=348193, avg=8215.24, stdev=3844.45 clat (usec): min=11, max=13123, avg=1259.29, stdev=1435.59 lat (usec): min=22, max=13130, avg=1267.63, stdev=1435.34 clat percentiles (usec): | 1.00th=[ 29], 5.00th=[ 56], 10.00th=[ 100], 20.00th=[ 182], | 30.00th=[ 265], 40.00th=[ 375], 50.00th=[ 553], 60.00th=[ 865], | 70.00th=[ 1532], 80.00th=[ 2638], 90.00th=[ 3523], 95.00th=[ 3982], | 99.00th=[ 5997], 99.50th=[ 6456], 99.90th=[ 7373], 99.95th=[ 7832], | 99.99th=[ 9110] bw ( KiB/s): min=118944, max=149098, per=100.00%, avg=133549.00, stdev=608.14, samples=4784 iops : min=29736, max=37274, avg=33387.24, stdev=152.03, samples=4784 lat (usec) : 20=0.01%, 50=0.90%, 100=3.91%, 250=30.07%, 500=25.66% lat (usec) : 750=11.62%, 1000=5.82% lat (msec) : 2=9.93%, 4=10.95%, 10=1.14%, 20=0.01% cpu : usr=7.14%, sys=15.25%, ctx=27755970, majf=0, minf=379 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=39999049,10001019,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=521MiB/s (546MB/s), 521MiB/s-521MiB/s (546MB/s-546MB/s), io=153GiB (164GB), run=300005-300005msec WRITE: bw=130MiB/s (137MB/s), 130MiB/s-130MiB/s (137MB/s-137MB/s), io=38.2GiB (40.0GB), run=300005-300005msec Disk stats (read/write): nvme0n1: ios=39986219/9997872, merge=0/3, ticks=25066034/12163862, in_queue=37229896, util=100.00% 3. Configure the userspace NVMe driver Unbind the host NVMe controller from host # echo 0000:bc:00.0 > /sys/bus/pci/devices/0000\:bc\:00.0/driver/unbind Bind the host NVMe controller to the host vfio-pci driver # echo 144d a822 > /sys/bus/pci/drivers/vfio-pci/new_id 4. create a image # qemu-img create -f raw nvme://0000:bc:00.0/1 20G # qemu-img info nvme://0000:bc:00.0/1 image: nvme://0000:bc:00.0/1 file format: raw virtual size: 745 GiB (800166076416 bytes) disk size: unavailable 5. Configure a virtio-blk device with the userspace NVMe driver, define an IOThread and assign the virtio-blk device. And then install rhel8.4 guest on it /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine q35 \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 15360 \ -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2 \ -cpu 'Haswell-noTSX',+kvm_pv_unhalt \ -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/monitor-qmpmonitor1-20200522-125204-4Vi7sqOR,server,id=qmp_id_qmpmonitor1 \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/monitor-catch_monitor-20200522-125204-4Vi7sqOR,server,id=qmp_id_catch_monitor \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idX2dIhI \ -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/serial-serial0-20200522-125204-4Vi7sqOR,server,id=chardev_serial0 \ -device isa-serial,id=serial0,chardev=chardev_serial0 \ -chardev socket,id=seabioslog_id_20200522-125204-4Vi7sqOR,path=/var/tmp/avocado_xpeuo28b/seabios-20200522-125204-4Vi7sqOR,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20200522-125204-4Vi7sqOR,iobase=0x402 \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -object iothread,id=iothread0 \ -object iothread,id=iothread1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0 \ -netdev tap,id=idEFQ4i1,vhost=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm \ -monitor stdio \ -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \ -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap \ -device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \ -device pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x1.0x6,bus=pcie.0,chassis=6 \ -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-6,addr=0x0 \ -blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/linux/RHEL-8.4.0-20200905.n.0-x86_64-dvd1.iso,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \ -device scsi-cd,id=cd1,drive=drive_cd1,write-cache=on,bootindex=1 \ 6. boot the guest after the installation. do fio test on /home/test # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][3.5%][eta 08d:21h:09m:48s]] After step 2, fio test finished in 5 minutes. After step 6, fio test didn't finish after 12 hours. Hi Philippe, Ademar, I want to confirm the following items, please help correct me. Many thanks. 1. I think we also need to compare the I/O performance between the host /dev/nvme0n1 block device and NVMe Device Assignment, right? (NVMe Device Assignment, e.g. -device vfio-pci,host=0000:65:00.0,id=pf2,bus=root.5,addr=0x0) And this bug is just track nvme userspace driver right? 2. Parameter 'aio' is unexpected when boot a virtio-blk device with the userspace NVMe driver qemu cmd lines: -object iothread,id=iothread1 \ -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \ -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap,aio=native \ -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap \ -device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \ error message: qemu-kvm: -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap,aio=native: Parameter 'aio' is unexpected Do we need a rfe bug to track this issue?
(In reply to Xueqiang Wei from comment #2) [...] > 4. create a image > > # qemu-img create -f raw nvme://0000:bc:00.0/1 20G Since you use a 20G size here, > # qemu-img info nvme://0000:bc:00.0/1 > image: nvme://0000:bc:00.0/1 > file format: raw > virtual size: 745 GiB (800166076416 bytes) > disk size: unavailable ... > -blockdev > node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read- > only=on,discard=unmap \ > -blockdev > node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off, > discard=unmap \ I think you should use ...,size=20G here.
(In reply to Philippe Mathieu-Daudé from comment #4) > (In reply to Xueqiang Wei from comment #2) > [...] > > 4. create a image > > > > # qemu-img create -f raw nvme://0000:bc:00.0/1 20G > > Since you use a 20G size here, > > > # qemu-img info nvme://0000:bc:00.0/1 > > image: nvme://0000:bc:00.0/1 > > file format: raw > > virtual size: 745 GiB (800166076416 bytes) > > disk size: unavailable > > ... > > -blockdev > > node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read- > > only=on,discard=unmap \ > > -blockdev > > node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off, > > discard=unmap \ > > I think you should use ...,size=20G here. 1. Create a image with size=30G, and add size=32212254720 to blockdev command lines. Install rhel8.4 guest on it, the installation didn't finish after 2 hours. The details are shown in below. 2. If I don't use size=32212254720 in command lines, the installation finished in 30 minutes. 3. Philippe, please check the two questions I asked in Comment 2. Many thanks. Details: # qemu-img create -f raw nvme://0000:bc:00.0/1 30G Formatting 'nvme://0000:bc:00.0/1', fmt=raw size=32212254720 # qemu-img info nvme://0000:bc:00.0/1 image: nvme://0000:bc:00.0/1 file format: raw virtual size: 745 GiB (800166076416 bytes) disk size: unavailable qemu cmd lines: -object iothread,id=iothread1 \ -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \ -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap,size=32212254720 \ -device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \ screenshot: http://fileshare.englab.nay.redhat.com/pub/section2/kvm/xuwei/bug/installation_screenshot.png
Tested with qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a, not hit the issue mentioned in Comment 5. But the fio test in guest still can't not finish, even the progress is shown to be 100%. Details: Versions: kernel-4.18.0-260.el8.x86_64 qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a 1. create raw image on nvme device # qemu-img create -f raw nvme://0000:bc:00.0/1 30G Formatting 'nvme://0000:bc:00.0/1', fmt=raw size=32212254720 # qemu-img info nvme://0000:bc:00.0/1 image: nvme://0000:bc:00.0/1 file format: raw virtual size: 745 GiB (800166076416 bytes) disk size: unavailable 2. install rhel8.4 on it /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine q35 \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 15360 \ -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2 \ -cpu 'Skylake-Server',+kvm_pv_unhalt \ -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/monitor-qmpmonitor1-20200522-125204-4Vi7sqOR,server,id=qmp_id_qmpmonitor1 \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/monitor-catch_monitor-20200522-125204-4Vi7sqOR,server,id=qmp_id_catch_monitor \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idX2dIhI \ -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/serial-serial0-20200522-125204-4Vi7sqOR,server,id=chardev_serial0 \ -device isa-serial,id=serial0,chardev=chardev_serial0 \ -chardev socket,id=seabioslog_id_20200522-125204-4Vi7sqOR,path=/var/tmp/avocado_xpeuo28b/seabios-20200522-125204-4Vi7sqOR,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20200522-125204-4Vi7sqOR,iobase=0x402 \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -object iothread,id=iothread0 \ -object iothread,id=iothread1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0 \ -netdev tap,id=idEFQ4i1,vhost=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm \ -monitor stdio \ -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \ -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap,size=32212254720 \ -device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \ -device pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x1.0x6,bus=pcie.0,chassis=6 \ -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-6,addr=0x0 \ -blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/linux/RHEL-8.4.0-20201209.n.0-x86_64-dvd1.iso,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \ -device scsi-cd,id=cd1,drive=drive_cd1,write-cache=on,bootindex=1 \ 3. check info in guest # uname -r kernel-4.18.0-259.el8.x86_64 # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 9.3G 0 rom /run/media/xuwei/RHEL-8-4-0-BaseOS-x86_64 vda 252:0 0 30G 0 disk ├─vda1 252:1 0 1G 0 part /boot └─vda2 252:2 0 29G 0 part ├─rhel-root 253:0 0 26G 0 lvm / └─rhel-swap 253:1 0 3G 0 lvm [SWAP] 4. fio test # mkdir -p /home/fio_nvme/ # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/fio_nvme/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][100.0%][eta 00m:00s] ************* fio test didn't finish after 12 hours 5. check the vm status: QEMU 5.2.0 monitor - type 'help' for more information (qemu) c (qemu) info status VM status: running After step 2, guest install successfully, and disk size is 30G in guest. After step 4, fio test didn't finish after 12 hours, even the progress is shown to be 100%. check the vm status, it's running. Hi Philippe, Please help check if I missed some message or steps. By the way, the two questions I asked in Comment 2, please also help check them. Many thanks. Hi Yanghang, Please help test it with your nvme environment, check if you will also hit it. Thanks.
(In reply to Xueqiang Wei from comment #6) > Hi Yanghang, > > Please help test it with your nvme environment, check if you will also hit > it. Thanks. Hi My machine with NVME disk is now performing any other tests task. I will test it immediately once I regain the machine.
Hi Philippe, Could you check the following test sceniro: 1.prepare a host with nvme disk and bind the driver of host nvme disk to vfio-pci # virsh nodedev-detach pci_$domain_$bus_$device_$function 2.assign the nvme disk to the vm and start the vm the xml of this nvme is like: <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='$domain' bus='$bus' slot='$device' function='$device'/> </source> <alias name='hostdev0'/> </hostdev> the qemu cmd line of this nvme is like: -device vfio-pci,host=$domain:$bus:$device.$device,id=nvme_disk,addr=0x0 3.do some performance test for the disk in the vm According to your description in comment0, it seems to me that we don’t need to cover the test scenarios above for this bug. And the qemu command line and domain xml used to verify this bug should be the same as the one as I posted in the comment 1. Is my understanding correct? If I have any misunderstandings, or I need to do some additional testing for this bug, please feel free to let me know. Thanks in advance.
> Please help test it with your nvme environment,check if you will also hit it. In my test environment, I encountered the same problem as Xueqiang mentioned in the comment 6. Test Env: host: qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.x86_64 4.18.0-262.el8.dt4.x86_64 guest: 4.18.0-262.el8.x86_64 Test Step: (1) # virsh nodedev-detach pci_0000_65_00_0 (2) # qemu-img create -f raw nvme://0000:65:00.0/1 30G Formatting 'nvme://0000:65:00.0/1', fmt=raw size=32212254720 #qemu-img info nvme://0000:65:00.0/1 image: nvme://0000:65:00.0/1 file format: raw virtual size: 745 GiB (800166076416 bytes) disk size: unavailable\ (3) install a vm on the userspace nvme disk -object iothread,id=iothread1 \ -blockdev node-name=nvme_image1,driver=nvme,device=0000:65:00.0,namespace=1,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap,size=32212254720 \ -device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=root.2,addr=0x0,iothread=iothread1 \ ... -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=root.4,addr=0x0 \ -blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/iso/RHEL-8.4.0-20201217.n.0-x86_64-dvd1.iso,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \ -device scsi-cd,id=cd1,drive=drive_cd1,write-cache=on,bootindex=1 \ ... (4) do fio test in the vm # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][100.0%][eta 00m:00s] Jobs: 8 (f=8): [m(8)][100.0%][eta 00m:00s] Jobs: 8 (f=8): [m(8)][100.0%][eta 00m:00s] <---- This fio test cannot be completed. Some error dmesg: INFO : task X blocked for more than 120 seconds. Not tainted 4.18.0-262.el8.x86_64 # 1
According to the bug description and comment1, this bug should belong to the NVME User Space driver part Assign QA Contact to Xueqiang first. Please feel free to ping me if there is anything I can help.
With the same steps in Comment 6, I retested it on qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f, the fio test in guest still can't finish. Versions: Host: kernel-4.18.0-270.el8.x86_64 qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f Guest: kernel-4.18.0-259.el8.x86_64 In guest: 1. check dmesg before fio test, not found "call trace" # dmesg | grep "Call Trace" # 2. check dmesg during fio test, found "call trace" # dmesg | grep "Call Trace" [ 615.352754] Call Trace: [ 738.222864] Call Trace: [ 738.223445] Call Trace: [ 861.093906] INFO: task in:imjournal:1604 blocked for more than 120 seconds. [ 861.093909] Not tainted 4.18.0-259.el8.x86_64 #1 [ 861.093910] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 861.093947] in:imjournal D 0 1604 1 0x00000080 [ 861.093949] Call Trace: [ 861.093959] __schedule+0x2a6/0x700 [ 861.093963] schedule+0x38/0xa0 [ 861.093965] io_schedule+0x12/0x40 [ 861.093969] wait_on_page_bit+0x137/0x230 [ 861.093973] ? xas_find+0x173/0x1b0 [ 861.093977] ? file_check_and_advance_wb_err+0xd0/0xd0 [ 861.093984] truncate_inode_pages_range+0x484/0x8b0 [ 861.094038] ? xfs_rename+0x5f7/0x9b0 [xfs] [ 861.094049] ? __d_move+0x296/0x510 [ 861.094053] ? __inode_wait_for_writeback+0x7f/0xf0 [ 861.094058] ? init_wait_var_entry+0x50/0x50 [ 861.094062] evict+0x183/0x1a0 [ 861.094065] __dentry_kill+0xd5/0x170 [ 861.094068] dentry_kill+0x4d/0x190 [ 861.094071] dput.part.34+0xd9/0x120 [ 861.094075] do_renameat2+0x39d/0x530 [ 861.094080] __x64_sys_rename+0x1c/0x20 [ 861.094084] do_syscall_64+0x5b/0x1a0 [ 861.094088] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 861.094090] RIP: 0033:0x7fb7278da9bb [ 861.094094] Code: Bad RIP value. [ 861.094095] RSP: 002b:00007fb72516cae8 EFLAGS: 00000213 ORIG_RAX: 0000000000000052 [ 861.094098] RAX: ffffffffffffffda RBX: 00007fb72516caf0 RCX: 00007fb7278da9bb [ 861.094099] RDX: 000055a7da65b250 RSI: 000055a7da65b940 RDI: 00007fb72516caf0 [ 861.094100] RBP: 000055a7da65b170 R08: 00007fb71806e9f0 R09: 0000000000000003 [ 861.094102] R10: 000000000000003f R11: 0000000000000213 R12: 0000000000000000 [ 861.094103] R13: 00007fb718020df0 R14: 0000000000000051 R15: 00007fb725ab5c38
Hi Yanghang, (In reply to Yanghang Liu from comment #8) > Hi Philippe, > > Could you check the following test sceniro: > > 1.prepare a host with nvme disk and bind the driver of host nvme disk to > vfio-pci > # virsh nodedev-detach pci_$domain_$bus_$device_$function > > 2.assign the nvme disk to the vm and start the vm > the xml of this nvme is like: > > <hostdev mode='subsystem' type='pci' managed='yes'> > <driver name='vfio'/> > <source> > <address domain='$domain' bus='$bus' slot='$device' > function='$device'/> > </source> > <alias name='hostdev0'/> > </hostdev> > > > the qemu cmd line of this nvme is like: > > -device > vfio-pci,host=$domain:$bus:$device.$device,id=nvme_disk,addr=0x0 > > 3.do some performance test for the disk in the vm > > > > According to your description in comment0, it seems to me that we don’t need > to cover the test scenarios above for this bug. Indeed. Per comment #0 this is for when "VFIO Device Assignment cannot be used". So we do not want to test the "-device vfio-pci,host=..." command in this BZ. > And the qemu command line and domain xml used to verify this bug should be > the same as the one as I posted in the comment 1. > > Is my understanding correct? Correct, I am using the format from comment #1. Manually I use: -drive file=nvme://0000:04:00.0/1,if=none,id=drive0 -device virtio-blk-pci,drive=drive0 Or with libvirt: <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='no' namespace='1'> <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </source> <target dev='vdb' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> </disk> Expanded to: -blockdev {"driver":"nvme","device":"0000:04:00.0","namespace":1,"node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"} -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xa,drive=libvirt-1-format,id=virtio-disk1 I created a 80G ext2 partition on the NVMe drive: # cat /proc/partitions major minor #blocks name 252 0 16777216 vda 252 1 1048576 vda1 252 2 15727616 vda2 252 16 366292584 vdb 252 17 83886080 vdb1 <--- 253 0 14045184 dm-0 253 1 1679360 dm-1 Then mounted it on /mnt and ran your test: # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes Jobs: 8 (f=8): [m(8)][100.0%][r=253MiB/s,w=63.0MiB/s][r=64.7k,w=16.1k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=1472: Tue Jan 19 19:25:39 2021 read: IOPS=57.3k, BW=224MiB/s (235MB/s)(65.6GiB/300001msec) slat (usec): min=2, max=30277, avg=95.85, stdev=541.92 clat (usec): min=47, max=64752, avg=1685.80, stdev=3262.41 lat (usec): min=51, max=64920, avg=1782.13, stdev=3402.31 clat percentiles (usec): | 1.00th=[ 93], 5.00th=[ 103], 10.00th=[ 112], 20.00th=[ 180], | 30.00th=[ 204], 40.00th=[ 260], 50.00th=[ 273], 60.00th=[ 293], | 70.00th=[ 420], 80.00th=[ 2180], 90.00th=[ 5932], 95.00th=[ 8717], | 99.00th=[15270], 99.50th=[17171], 99.90th=[23462], 99.95th=[25297], | 99.99th=[31327] bw ( KiB/s): min=88776, max=501955, per=100.00%, avg=229412.70, stdev=8163.76, samples=4776 iops : min=22194, max=125487, avg=57352.19, stdev=2040.94, samples=4776 write: IOPS=14.3k, BW=55.9MiB/s (58.7MB/s)(16.4GiB/300001msec); 0 zone resets slat (usec): min=2, max=30249, avg=155.89, stdev=720.80 clat (usec): min=48, max=61061, avg=1644.51, stdev=3208.78 lat (usec): min=53, max=61092, avg=1800.93, stdev=3441.64 clat percentiles (usec): | 1.00th=[ 91], 5.00th=[ 104], 10.00th=[ 115], 20.00th=[ 180], | 30.00th=[ 204], 40.00th=[ 260], 50.00th=[ 273], 60.00th=[ 289], | 70.00th=[ 404], 80.00th=[ 1958], 90.00th=[ 5735], 95.00th=[ 8586], | 99.00th=[15008], 99.50th=[17171], 99.90th=[22938], 99.95th=[24773], | 99.99th=[30802] bw ( KiB/s): min=22176, max=126822, per=100.00%, avg=57349.80, stdev=2056.94, samples=4776 iops : min= 5544, max=31704, avg=14336.51, stdev=514.23, samples=4776 lat (usec) : 50=0.01%, 100=3.38%, 250=32.55%, 500=34.89%, 750=1.97% lat (usec) : 1000=1.99% lat (msec) : 2=4.91%, 4=4.28%, 10=12.29%, 20=3.49%, 50=0.24% lat (msec) : 100=0.01% cpu : usr=2.36%, sys=12.00%, ctx=3372753, majf=0, minf=136 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=17188322,4296826,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=224MiB/s (235MB/s), 224MiB/s-224MiB/s (235MB/s-235MB/s), io=65.6GiB (70.4GB), run=300001-300001msec WRITE: bw=55.9MiB/s (58.7MB/s), 55.9MiB/s-55.9MiB/s (58.7MB/s-58.7MB/s), io=16.4GiB (17.6GB), run=300001-300001msec Disk stats (read/write): vdb: ios=17173202/4293091, merge=20/0, ticks=910678/237756, in_queue=1148434, util=100.00% Checking guest stats after the test: { "execute": "query-blockstats" } { "return": [ { "device": "virtio-disk1", "parent": { "node-name": "#block014", "driver-specific": { "aligned-accesses": 17563525, "driver": "nvme", "completion-errors": 0, "unaligned-accesses": 4 } }, "stats": { "unmap_operations": 0, "unmap_merged": 0, "flush_total_time_ns": 105668, "wr_highest_offset": 114294784, "wr_total_time_ns": 324639788037, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 43, "wr_bytes": 14381838336, "timed_stats": [ ], "failed_unmap_operations": 0, "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 1321636726291, "invalid_unmap_operations": 0, "flush_operations": 1, "wr_operations": 3511189, "unmap_bytes": 0, "rd_merged": 1346, "rd_bytes": 57570702848, "unmap_total_time_ns": 0, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 20045456348, "rd_operations": 14053718, "invalid_wr_operations": 0, "invalid_rd_operations": 0 }, "node-name": "#block105", "qdev": "/machine/peripheral-anon/device[4]/virtio-backend" }, Running the same test on the host: /dev/nvme0n1p1 on /mnt type ext2 (rw,relatime,seclabel) # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.21 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][100.0%][r=1663MiB/s,w=417MiB/s][r=426k,w=107k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=46916: Tue Jan 19 16:14:34 2021 read: IOPS=426k, BW=1663MiB/s (1744MB/s)(487GiB/300002msec) slat (nsec): min=1947, max=456132, avg=4679.92, stdev=1690.68 clat (usec): min=59, max=2144, avg=234.07, stdev=13.28 lat (usec): min=62, max=2149, avg=238.97, stdev=12.97 clat percentiles (usec): | 1.00th=[ 212], 5.00th=[ 219], 10.00th=[ 221], 20.00th=[ 225], | 30.00th=[ 227], 40.00th=[ 231], 50.00th=[ 233], 60.00th=[ 235], | 70.00th=[ 237], 80.00th=[ 241], 90.00th=[ 253], 95.00th=[ 265], | 99.00th=[ 277], 99.50th=[ 285], 99.90th=[ 302], 99.95th=[ 306], | 99.99th=[ 326] bw ( MiB/s): min= 1638, max= 1680, per=100.00%, avg=1665.44, stdev= 0.70, samples=4792 iops : min=419474, max=430258, avg=426352.63, stdev=178.78, samples=4792 write: IOPS=106k, BW=416MiB/s (436MB/s)(122GiB/300002msec); 0 zone resets slat (usec): min=2, max=367, avg= 5.34, stdev= 1.93 clat (usec): min=66, max=2351, avg=236.07, stdev=13.43 lat (usec): min=73, max=2356, avg=241.65, stdev=13.06 clat percentiles (usec): | 1.00th=[ 215], 5.00th=[ 221], 10.00th=[ 223], 20.00th=[ 227], | 30.00th=[ 229], 40.00th=[ 231], 50.00th=[ 235], 60.00th=[ 237], | 70.00th=[ 239], 80.00th=[ 243], 90.00th=[ 255], 95.00th=[ 265], | 99.00th=[ 277], 99.50th=[ 289], 99.90th=[ 306], 99.95th=[ 310], | 99.99th=[ 326] bw ( KiB/s): min=411256, max=442344, per=100.00%, avg=426231.17, stdev=589.19, samples=4792 iops : min=102814, max=110586, avg=106557.77, stdev=147.30, samples=4792 lat (usec) : 100=0.01%, 250=89.13%, 500=10.86%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01% cpu : usr=16.16%, sys=34.43%, ctx=128107520, majf=0, minf=510 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=127731635,31923845,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=1663MiB/s (1744MB/s), 1663MiB/s-1663MiB/s (1744MB/s-1744MB/s), io=487GiB (523GB), run=300002-300002msec WRITE: bw=416MiB/s (436MB/s), 416MiB/s-416MiB/s (436MB/s-436MB/s), io=122GiB (131GB), run=300002-300002msec Disk stats (read/write): nvme0n1: ios=127713173/31919281, merge=0/3, ticks=29747708/7397612, in_queue=37145320, util=100.00% There is big performance difference, but no kernel crash. Note that I am using a different host kernel, tomorrow I plan to reinstall with the same used in your test: kernel-4.18.0-270.el8.x86_64
I reinstalled my host with RHEL8.4-AV and did the same tests. Host: kernel-4.18.0-275.el8.x86_64 qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f Guest: kernel-4.18.0-275.el8.x86_64 # virsh dumpxml rhel84c ... <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/rhel84c.qcow2' index='2'/> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='no' namespace='1' index='1'> <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </source> <target dev='vdb' bus='virtio'/> <alias name='virtio-disk1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> </disk> [root@guest ~]# cat /proc/partitions major minor #blocks name 252 0 16777216 vda 252 1 1048576 vda1 252 2 15727616 vda2 252 16 366292584 vdb 253 0 14045184 dm-0 253 1 1679360 dm-1 [root@guest ~]# mke2fs -F /dev/vdb mke2fs 1.45.6 (20-Mar-2020) Found a dos partition table in /dev/vdb Creating filesystem with 91573146 4k blocks and 22896640 inodes Filesystem UUID: 46efb6ba-47ed-4a30-ac46-7e6a6a71aa0a Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968 Allocating group tables: done Writing inode tables: done Writing superblocks and filesystem accounting information: done [root@guest ~]# mount /dev/vdb /mnt [root@guest ~]# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][100.0%][r=616MiB/s,w=155MiB/s][r=158k,w=39.6k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=1561: Wed Jan 20 13:56:49 2021 read: IOPS=152k, BW=595MiB/s (623MB/s)(174GiB/300001msec) slat (usec): min=2, max=34921, avg=31.73, stdev=219.39 clat (usec): min=28, max=52468, avg=629.41, stdev=1146.41 lat (usec): min=35, max=52545, avg=661.42, stdev=1186.37 clat percentiles (usec): | 1.00th=[ 77], 5.00th=[ 91], 10.00th=[ 101], 20.00th=[ 125], | 30.00th=[ 281], 40.00th=[ 383], 50.00th=[ 465], 60.00th=[ 545], | 70.00th=[ 635], 80.00th=[ 750], 90.00th=[ 930], 95.00th=[ 1156], | 99.00th=[ 6128], 99.50th=[ 8356], 99.90th=[14484], 99.95th=[16450], | 99.99th=[23462] bw ( KiB/s): min=282317, max=899654, per=100.00%, avg=609671.56, stdev=9967.15, samples=4784 iops : min=70579, max=224912, avg=152417.61, stdev=2491.79, samples=4784 write: IOPS=38.1k, BW=149MiB/s (156MB/s)(43.5GiB/300001msec); 0 zone resets slat (usec): min=2, max=34865, avg=57.19, stdev=310.85 clat (usec): min=9, max=52387, avg=656.92, stdev=1196.70 lat (usec): min=41, max=52459, avg=714.46, stdev=1268.16 clat percentiles (usec): | 1.00th=[ 77], 5.00th=[ 92], 10.00th=[ 104], 20.00th=[ 128], | 30.00th=[ 285], 40.00th=[ 396], 50.00th=[ 486], 60.00th=[ 570], | 70.00th=[ 660], 80.00th=[ 775], 90.00th=[ 963], 95.00th=[ 1205], | 99.00th=[ 6521], 99.50th=[ 8717], 99.90th=[15008], 99.95th=[16909], | 99.99th=[23987] bw ( KiB/s): min=70531, max=226711, per=100.00%, avg=152402.62, stdev=2508.06, samples=4784 iops : min=17632, max=56677, avg=38100.38, stdev=627.02, samples=4784 lat (usec) : 10=0.01%, 50=0.01%, 100=9.15%, 250=18.54%, 500=25.97% lat (usec) : 750=26.13%, 1000=12.29% lat (msec) : 2=4.93%, 4=1.07%, 10=1.58%, 20=0.31%, 50=0.02% lat (msec) : 100=0.01% cpu : usr=6.30%, sys=15.73%, ctx=19379351, majf=0, minf=143 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=45665464,11415186,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=595MiB/s (623MB/s), 595MiB/s-595MiB/s (623MB/s-623MB/s), io=174GiB (187GB), run=300001-300001msec WRITE: bw=149MiB/s (156MB/s), 149MiB/s-149MiB/s (156MB/s-156MB/s), io=43.5GiB (46.8GB), run=300001-300001msec Disk stats (read/write): vdb: ios=45635088/11407656, merge=0/0, ticks=7636538/1932687, in_queue=9569225, util=100.00% [root@guest ~]# Test finished in 5min, still no kernel crash.
Performance improved using the following XML configuration: <iothreads>1</iothreads> <devices> ... <disk type='nvme' device='disk'> <driver name='qemu' type='raw' iothread='1'/> <source type='pci' managed='no' namespace='1' index='1'> <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </source> <target dev='vdb' bus='virtio'/> <alias name='virtio-disk1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> </disk> [root@guest ~]# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes Jobs: 8 (f=8): [m(8)][100.0%][r=418MiB/s,w=105MiB/s][r=107k,w=26.8k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=1470: Thu Jan 21 06:50:11 2021 read: IOPS=108k, BW=423MiB/s (444MB/s)(124GiB/300001msec) slat (usec): min=2, max=39593, avg=49.25, stdev=523.16 clat (usec): min=7, max=78085, avg=893.77, stdev=3211.67 lat (usec): min=32, max=78133, avg=943.23, stdev=3349.10 clat percentiles (usec): | 1.00th=[ 96], 5.00th=[ 100], 10.00th=[ 102], 20.00th=[ 108], | 30.00th=[ 109], 40.00th=[ 110], 50.00th=[ 111], 60.00th=[ 113], | 70.00th=[ 118], 80.00th=[ 130], 90.00th=[ 486], 95.00th=[ 6980], | 99.00th=[16450], 99.50th=[22152], 99.90th=[31327], 99.95th=[33817], | 99.99th=[42206] bw ( KiB/s): min=73594, max=973981, per=100.00%, avg=434044.37, stdev=20074.88, samples=4776 iops : min=18398, max=243494, avg=108509.89, stdev=5018.73, samples=4776 write: IOPS=27.1k, BW=106MiB/s (111MB/s)(31.0GiB/300001msec); 0 zone resets slat (usec): min=2, max=39670, avg=88.26, stdev=732.69 clat (usec): min=22, max=78077, avg=859.92, stdev=3119.56 lat (usec): min=26, max=78701, avg=948.42, stdev=3360.97 clat percentiles (usec): | 1.00th=[ 95], 5.00th=[ 100], 10.00th=[ 102], 20.00th=[ 108], | 30.00th=[ 109], 40.00th=[ 110], 50.00th=[ 111], 60.00th=[ 113], | 70.00th=[ 118], 80.00th=[ 130], 90.00th=[ 453], 95.00th=[ 6456], | 99.00th=[16319], 99.50th=[21103], 99.90th=[30802], 99.95th=[32637], | 99.99th=[41157] bw ( KiB/s): min=19034, max=245008, per=100.00%, avg=108541.22, stdev=5038.50, samples=4776 iops : min= 4758, max=61251, avg=27134.15, stdev=1259.63, samples=4776 lat (usec) : 10=0.01%, 50=0.01%, 100=4.44%, 250=83.88%, 500=1.85% lat (usec) : 750=1.41%, 1000=0.48% lat (msec) : 2=0.45%, 4=1.02%, 10=3.66%, 20=2.18%, 50=0.64% lat (msec) : 100=0.01% cpu : usr=3.43%, sys=7.95%, ctx=2331181, majf=0, minf=152 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=32511521,8130273,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=423MiB/s (444MB/s), 423MiB/s-423MiB/s (444MB/s-444MB/s), io=124GiB (133GB), run=300001-300001msec WRITE: bw=106MiB/s (111MB/s), 106MiB/s-106MiB/s (111MB/s-111MB/s), io=31.0GiB (33.3GB), run=300001-300001msec Disk stats (read/write): vdb: ios=32496594/8126521, merge=0/0, ticks=625043/156398, in_queue=781441, util=100.00%
Using Xueqiang Wei hardware I get the same hang reported in comment #16 in the guest: > 2. check dmesg during fio test, found "call trace" > # dmesg | grep "Call Trace" > [ 615.352754] Call Trace: > [ 738.222864] Call Trace: > [ 738.223445] Call Trace: > > [ 861.093906] INFO: task in:imjournal:1604 blocked for more than 120 > seconds. > [ 861.093909] Not tainted 4.18.0-259.el8.x86_64 #1 > [ 861.093910] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 861.093947] in:imjournal D 0 1604 1 0x00000080 > [ 861.093949] Call Trace: > [ 861.093959] __schedule+0x2a6/0x700 > [ 861.093963] schedule+0x38/0xa0 > [ 861.093965] io_schedule+0x12/0x40 > [ 861.093969] wait_on_page_bit+0x137/0x230 > [ 861.093973] ? xas_find+0x173/0x1b0 > [ 861.093977] ? file_check_and_advance_wb_err+0xd0/0xd0 > [ 861.093984] truncate_inode_pages_range+0x484/0x8b0 > [ 861.094038] ? xfs_rename+0x5f7/0x9b0 [xfs] > [ 861.094049] ? __d_move+0x296/0x510 > [ 861.094053] ? __inode_wait_for_writeback+0x7f/0xf0 > [ 861.094058] ? init_wait_var_entry+0x50/0x50 > [ 861.094062] evict+0x183/0x1a0 > [ 861.094065] __dentry_kill+0xd5/0x170 > [ 861.094068] dentry_kill+0x4d/0x190 > [ 861.094071] dput.part.34+0xd9/0x120 > [ 861.094075] do_renameat2+0x39d/0x530 > [ 861.094080] __x64_sys_rename+0x1c/0x20 > [ 861.094084] do_syscall_64+0x5b/0x1a0 > [ 861.094088] entry_SYSCALL_64_after_hwframe+0x65/0xca > [ 861.094090] RIP: 0033:0x7fb7278da9bb > [ 861.094094] Code: Bad RIP value. > [ 861.094095] RSP: 002b:00007fb72516cae8 EFLAGS: 00000213 ORIG_RAX: > 0000000000000052 > [ 861.094098] RAX: ffffffffffffffda RBX: 00007fb72516caf0 RCX: > 00007fb7278da9bb > [ 861.094099] RDX: 000055a7da65b250 RSI: 000055a7da65b940 RDI: > 00007fb72516caf0 > [ 861.094100] RBP: 000055a7da65b170 R08: 00007fb71806e9f0 R09: > 0000000000000003 > [ 861.094102] R10: 000000000000003f R11: 0000000000000213 R12: > 0000000000000000 > [ 861.094103] R13: 00007fb718020df0 R14: 0000000000000051 R15: > 00007fb725ab5c38
Hi Philippe, 1. Could you please share the details of your’s nvme device? Maybe QE needs to purchase a new nvme device. 2. I tested with another nvme disk, not hit the issue reported in Comment 16. Versions: Host: kernel-4.18.0-270.el8.x86_64 qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a Guest: kernel-4.18.0-259.el8.x86_64 in the host: # lspci|grep -i "Non-Volatile" 81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) # lspci -n -s 0000:81:00.0 81:00.0 0108: 8086:0953 (rev 01) # echo 0000:81:00.0 > /sys/bus/pci/devices/0000\:81\:00.0/driver/unbind # echo 8086 0953 > /sys/bus/pci/drivers/vfio-pci/new_id # qemu-img create -f raw nvme://0000:81:00.0/1 30G Formatting 'nvme://0000:81:00.0/1', fmt=raw size=32212254720 # qemu-img info nvme://0000:81:00.0/1 image: nvme://0000:81:00.0/1 file format: raw virtual size: 373 GiB (400088457216 bytes) disk size: unavailable in the host: # mkfs.xfs /dev/nvme0n1 # mount /dev/nvme0n1 /mnt/nvme_fio_test/ # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/nvme_fio_test/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][1.0%][r=888MiB/s,w=221MiB/s][r=227k,w=56.5k IOPS][eta 04m:Jobs: 8 (f=8): [m(8)][1.3%][r=862MiB/s,w=214MiB/s][r=221k,w=54.9k IOPS][eta 04m:Jobs: 8 (f=8): [m(8)]......[99.7%][r=790MiB/s,w=196MiB/s][r=202k,w=50.3k IOPS][eta 00mJobs: 8 (f=8): [m(8)][100.0%][r=811MiB/s,w=204MiB/s][r=208k,w=52.3k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=8638: Fri Jan 22 02:31:42 2021 read: IOPS=219k, BW=857MiB/s (899MB/s)(251GiB/300002msec) slat (usec): min=3, max=276, avg= 6.57, stdev= 3.24 clat (usec): min=7, max=17087, avg=545.15, stdev=662.55 lat (usec): min=29, max=17093, avg=551.95, stdev=662.55 clat percentiles (usec): | 1.00th=[ 79], 5.00th=[ 93], 10.00th=[ 104], 20.00th=[ 131], | 30.00th=[ 172], 40.00th=[ 229], 50.00th=[ 297], 60.00th=[ 396], | 70.00th=[ 537], 80.00th=[ 717], 90.00th=[ 1418], 95.00th=[ 2245], | 99.00th=[ 2868], 99.50th=[ 3032], 99.90th=[ 3359], 99.95th=[ 3687], | 99.99th=[12256] bw ( KiB/s): min=785320, max=978192, per=100.00%, avg=878976.52, stdev=5890.75, samples=4784 iops : min=196330, max=244546, avg=219744.10, stdev=1472.68, samples=4784 write: IOPS=54.9k, BW=214MiB/s (225MB/s)(62.8GiB/300002msec); 0 zone resets slat (usec): min=4, max=396, avg= 7.23, stdev= 3.94 clat (usec): min=4, max=14517, avg=110.82, stdev=180.05 lat (usec): min=17, max=14523, avg=118.29, stdev=180.10 clat percentiles (usec): | 1.00th=[ 15], 5.00th=[ 16], 10.00th=[ 18], 20.00th=[ 24], | 30.00th=[ 32], 40.00th=[ 43], 50.00th=[ 57], 60.00th=[ 75], | 70.00th=[ 104], 80.00th=[ 159], 90.00th=[ 260], 95.00th=[ 359], | 99.00th=[ 816], 99.50th=[ 1090], 99.90th=[ 1713], 99.95th=[ 1991], | 99.99th=[ 5538] bw ( KiB/s): min=192472, max=250000, per=100.00%, avg=219831.45, stdev=1548.75, samples=4784 iops : min=48118, max=62500, avg=54957.86, stdev=387.19, samples=4784 lat (usec) : 10=0.01%, 20=2.88%, 50=6.37%, 100=11.10%, 250=31.95% lat (usec) : 500=21.31%, 750=11.16%, 1000=4.47% lat (msec) : 2=5.53%, 4=5.19%, 10=0.02%, 20=0.01% cpu : usr=11.13%, sys=22.81%, ctx=48265710, majf=0, minf=1984 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=65833373,16465027,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=857MiB/s (899MB/s), 857MiB/s-857MiB/s (899MB/s-899MB/s), io=251GiB (270GB), run=300002-300002msec WRITE: bw=214MiB/s (225MB/s), 214MiB/s-214MiB/s (225MB/s-225MB/s), io=62.8GiB (67.4GB), run=300002-300002msec Disk stats (read/write): nvme0n1: ios=65802590/16457320, merge=0/3, ticks=35459756/1622270, in_queue=37082027, util=100.00% in the guest: # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 9.3G 0 rom /run/media/root/RHEL-8-4-0-BaseOS-x86_64 vda 252:0 0 30G 0 disk vdb 252:16 0 20G 0 disk ├─vdb1 252:17 0 1G 0 part /boot └─vdb2 252:18 0 19G 0 part ├─rhel_vm--197--177-root │ 253:0 0 17G 0 lvm / └─rhel_vm--197--177-swap 253:1 0 2G 0 lvm [SWAP] # mkdir -p /home/fio_nvme/ # mkfs.xfs /dev/vda # mount /dev/vda /home/fio_nvme/ # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/fio_nvme/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][1.0%][r=403MiB/s,w=102MiB/s][r=103k,w=26.0k IOPS][eta 04m:Jobs: 8 (f=8): [m(8)][1.3%][r=411MiB/s,w=102MiB/s][r=105k,w=26.2k IOPS][eta 04m:Jobs: 8 (f=8): [m(8)]......[99.3%][r=518MiB/s,w=129MiB/s][r=133k,w=33.0k IOPS][eta 00mJobs: 8 (f=8): [m(8)][99.7%][r=511MiB/s,w=129MiB/s][r=131k,w=32.9k IOPS][eta 00mJobs: 8 (f=8): [m(8)][100.0%][r=522MiB/s,w=132MiB/s][r=134k,w=33.7k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=3181: Fri Jan 22 15:54:10 2021 read: IOPS=129k, BW=504MiB/s (528MB/s)(148GiB/300003msec) slat (usec): min=5, max=5240, avg=11.19, stdev= 8.44 clat (usec): min=48, max=30506, avg=840.15, stdev=607.15 lat (usec): min=62, max=30512, avg=851.70, stdev=606.89 clat percentiles (usec): | 1.00th=[ 188], 5.00th=[ 253], 10.00th=[ 314], 20.00th=[ 486], | 30.00th=[ 611], 40.00th=[ 676], 50.00th=[ 742], 60.00th=[ 816], | 70.00th=[ 906], 80.00th=[ 1029], 90.00th=[ 1303], 95.00th=[ 1942], | 99.00th=[ 2966], 99.50th=[ 3163], 99.90th=[ 3687], 99.95th=[ 4948], | 99.99th=[23987] bw ( KiB/s): min=313457, max=617248, per=100.00%, avg=516616.82, stdev=7492.02, samples=4784 iops : min=78363, max=154312, avg=129153.98, stdev=1873.02, samples=4784 write: IOPS=32.2k, BW=126MiB/s (132MB/s)(36.9GiB/300003msec); 0 zone resets slat (usec): min=5, max=5309, avg=12.39, stdev= 8.81 clat (usec): min=24, max=29131, avg=541.07, stdev=385.53 lat (usec): min=37, max=29139, avg=553.83, stdev=385.33 clat percentiles (usec): | 1.00th=[ 151], 5.00th=[ 202], 10.00th=[ 237], 20.00th=[ 285], | 30.00th=[ 330], 40.00th=[ 379], 50.00th=[ 441], 60.00th=[ 529], | 70.00th=[ 644], 80.00th=[ 799], 90.00th=[ 988], 95.00th=[ 1139], | 99.00th=[ 1401], 99.50th=[ 1516], 99.90th=[ 1762], 99.95th=[ 3654], | 99.99th=[ 9634] bw ( KiB/s): min=76696, max=157896, per=100.00%, avg=129156.98, stdev=1903.94, samples=4784 iops : min=19174, max=39474, avg=32289.06, stdev=476.01, samples=4784 lat (usec) : 50=0.01%, 100=0.02%, 250=6.31%, 500=21.69%, 750=28.23% lat (usec) : 1000=24.14% lat (msec) : 2=15.80%, 4=3.74%, 10=0.05%, 20=0.01%, 50=0.01% cpu : usr=9.40%, sys=21.46%, ctx=10682480, majf=0, minf=147 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=38684995,9671505,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=504MiB/s (528MB/s), 504MiB/s-504MiB/s (528MB/s-528MB/s), io=148GiB (158GB), run=300003-300003msec WRITE: bw=126MiB/s (132MB/s), 126MiB/s-126MiB/s (132MB/s-132MB/s), io=36.9GiB (39.6GB), run=300003-300003msec Disk stats (read/write): vda: ios=38668202/9667305, merge=0/5, ticks=30154026/4415942, in_queue=34569968, util=100.00% 3. according to Comment 20 and item 2 in this comment, not every nvme device works well. Next, how to deal with this issue? What do you think about it? Many thanks. Get the hang reported in comment 16 in the guest with the following nvme device. # lspci|grep -i "Non-Volatile" bc:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172Xa/172Xb (rev 01) Not get the same hang reported in comment 16 in the guest with the following nvme device. # lspci|grep -i "Non-Volatile" 81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
(In reply to Xueqiang Wei from comment #23) > Hi Philippe, > > 1. Could you please share the details of your’s nvme device? [root@virtlab505] # nvme intel id-ctrl /dev/nvme0 NVME Identify Controller: vid : 0x8086 ssvid : 0x8086 sn : PHKS917200LB375AGN mn : INTEL SSDPED1K375GA fr : E2010435 ... > Maybe QE needs to purchase a new nvme device. No, it has to work :)
(In reply to Xueqiang Wei from comment #2) > 6. boot the guest after the installation. do fio test on /home/test > > # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 > --filename=/home/test --ioengine=libaio --size=100M --rwmixread=80 > --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 > --numjobs=8 --name=job1 > job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=16 > ... > fio-3.19 > Starting 8 processes > job1: Laying out IO file (1 file / 100MiB) > Jobs: 8 (f=8): [m(8)][3.5%][eta 08d:21h:09m:48s]] > > > > After step 2, fio test finished in 5 minutes. > After step 6, fio test didn't finish after 12 hours. When the VM "hang" (12h without finishing) there is an error: qemu-kvm: VFIO_MAP_DMA failed: No space left on device Looking at the Linux kernel source, we end in vfio_dma_do_map() in drivers/vfio/vfio_iommu_type1.c: if (!iommu->dma_avail) { ret = -ENOSPC; goto out_unlock; } Alex Williamson said this limit can be changed for testing purpose: static unsigned int dma_entry_limit __read_mostly = U16_MAX; MODULE_PARM_DESC(dma_entry_limit, "Maximum number of user DMA mappings per container (65535)."); So I tried: # modprobe vfio_iommu_type1 dma_entry_limit=$((0xffffff)) And your test passed (with horrible performances): # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][100.0%][r=115MiB/s,w=29.0MiB/s][r=29.4k,w=7435 IOPS][eta 00m:00s]s] job1: (groupid=0, jobs=8): err= 0: pid=1398: Wed Jan 27 02:56:22 2021 read: IOPS=29.7k, BW=116MiB/s (122MB/s)(33.0GiB/300002msec) slat (usec): min=3, max=32761, avg=52.90, stdev=77.13 clat (usec): min=129, max=93768, avg=3334.61, stdev=943.20 lat (usec): min=193, max=93826, avg=3388.05, stdev=947.23 clat percentiles (usec): | 1.00th=[ 2008], 5.00th=[ 2278], 10.00th=[ 2442], 20.00th=[ 2704], | 30.00th=[ 2900], 40.00th=[ 3097], 50.00th=[ 3261], 60.00th=[ 3490], | 70.00th=[ 3687], 80.00th=[ 3949], 90.00th=[ 4293], 95.00th=[ 4490], | 99.00th=[ 4883], 99.50th=[ 5211], 99.90th=[ 6783], 99.95th=[ 9372], | 99.99th=[39060] bw ( KiB/s): min=101082, max=127816, per=100.00%, avg=118918.66, stdev=335.38, samples=4784 iops : min=25267, max=31954, avg=29728.99, stdev=83.87, samples=4784 write: IOPS=7433, BW=29.0MiB/s (30.4MB/s)(8712MiB/300002msec); 0 zone resets slat (usec): min=3, max=25613, avg=81.57, stdev=99.55 clat (usec): min=132, max=93707, avg=3595.95, stdev=857.93 lat (usec): min=149, max=93743, avg=3678.08, stdev=865.10 clat percentiles (usec): | 1.00th=[ 2278], 5.00th=[ 2638], 10.00th=[ 2835], 20.00th=[ 3097], | 30.00th=[ 3261], 40.00th=[ 3425], 50.00th=[ 3556], 60.00th=[ 3720], | 70.00th=[ 3916], 80.00th=[ 4113], 90.00th=[ 4359], 95.00th=[ 4555], | 99.00th=[ 4883], 99.50th=[ 5014], 99.90th=[ 6915], 99.95th=[ 9634], | 99.99th=[36963] bw ( KiB/s): min=25082, max=33768, per=100.00%, avg=29776.92, stdev=172.10, samples=4784 iops : min= 6270, max= 8442, avg=7444.10, stdev=43.02, samples=4784 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.75%, 4=79.55%, 10=19.63%, 20=0.02%, 50=0.02% lat (msec) : 100=0.01% cpu : usr=2.63%, sys=8.63%, ctx=5048257, majf=0, minf=131 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=8906102,2230196,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=33.0GiB (36.5GB), run=300002-300002msec WRITE: bw=29.0MiB/s (30.4MB/s), 29.0MiB/s-29.0MiB/s (30.4MB/s-30.4MB/s), io=8712MiB (9135MB), run=300002-300002msec Disk stats (read/write): vda: ios=8902053/2229124, merge=31/29, ticks=24529673/6297896, in_queue=30827568, util=100.00% I will escalate this problem to Stefan Hajnoczi.
(In reply to Philippe Mathieu-Daudé from comment #25) > > After step 6, fio test didn't finish after 12 hours. > > When the VM "hang" (12h without finishing) there is an error: > > qemu-kvm: VFIO_MAP_DMA failed: No space left on device > This issue was tracked by Bug 1848881 - qemu-kvm: VFIO_MAP_DMA failed: Invalid argument with nvme:// related comments: https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c2 https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c3 https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c6 https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c7 https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c8, tested with raw format and q35 machine type, guest installs successfully.
(In reply to Philippe Mathieu-Daudé from comment #25) > (In reply to Xueqiang Wei from comment #2) > > 6. boot the guest after the installation. do fio test on /home/test > > When the VM "hang" (12h without finishing) there is an error: > > qemu-kvm: VFIO_MAP_DMA failed: No space left on device > > Looking at the Linux kernel source, we end in vfio_dma_do_map() in > drivers/vfio/vfio_iommu_type1.c: > > if (!iommu->dma_avail) { > ret = -ENOSPC; > goto out_unlock; > } This issue has been assigned to a different BZ: bug 1934172
Done
Since QE triggered Bug test through "fixed in version", Philippe, could you remove "Fixed In Version:" field? Thanks.
(In reply to Xueqiang Wei from comment #35) > Since QE triggered Bug test through "fixed in version", Philippe, could you > remove "Fixed In Version:" field? Thanks. Done.
Hi Philippe, Could you set the DTM? Thanks.
Is there any issues or code that still needs to be tracked through this BZ? I'm a bit confused since the only apparent issue was closed with bug 1934172.
This bug is blocked by bug 1848881.
That's fine.
This bug is ready to go since Bug 1848881 has now been fixed. There is no code change associated with this bug. It is a high-level bug that tracks the QEMU NVMe userspace driver.
Hi Stefan, I tried to verify this bug as below. And the I/O performance improved a lot via virtual machines, but still slower compared with the performance via host directly(Read: ~19% degration; write: ~35% degration). Could you please help to check whether the result is okay? Thanks. Tested env: qemu-kvm-6.0.0-27.module+el8.5.0+12121+c40c8708 kernel-modules-4.18.0-330.el8.x86_64 Steps: Test the NVMe block performance via virtual guest: 1. Install guest on the NVMe disk /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine q35 \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 15360 \ -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2 \ -cpu 'Haswell-noTSX',+kvm_pv_unhalt \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -object iothread,id=iothread0 \ -object iothread,id=iothread1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0 \ -netdev tap,id=idEFQ4i1,vhost=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm \ -monitor stdio \ -chardev socket,server=on,path=/var/tmp/monitor-qmpmonitor1-20210721-024113-AsZ7KYro,id=qmp_id_qmpmonitor1,wait=off \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \ -device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \ -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap \ -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap \ -device scsi-hd,id=nvme1,drive=drive_nvme1 \ -device pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x1.0x6,bus=pcie.0,chassis=6 \ -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-6,addr=0x0 \ -blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/linux/RHEL-8.4.0-20210503.1-x86_64-dvd1.iso,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \ -device scsi-cd,id=cd1,drive=drive_cd1,write-cache=on \ 2. Do fio test in the guest # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][100.0%][r=651MiB/s,w=161MiB/s][r=167k,w=41.3k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=39152: Wed Aug 18 05:44:37 2021 read: IOPS=175k, BW=685MiB/s (719MB/s)(201GiB/300003msec) slat (usec): min=2, max=1084, avg= 5.59, stdev= 4.30 clat (usec): min=12, max=8797, avg=640.88, stdev=567.03 lat (usec): min=32, max=8800, avg=646.58, stdev=567.00 clat percentiles (usec): | 1.00th=[ 99], 5.00th=[ 149], 10.00th=[ 190], 20.00th=[ 251], | 30.00th=[ 314], 40.00th=[ 400], 50.00th=[ 502], 60.00th=[ 586], | 70.00th=[ 676], 80.00th=[ 791], 90.00th=[ 1303], 95.00th=[ 2089], | 99.00th=[ 2737], 99.50th=[ 2900], 99.90th=[ 3392], 99.95th=[ 4178], | 99.99th=[ 5604] bw ( KiB/s): min=524920, max=826223, per=100.00%, avg=702643.65, stdev=7011.59, samples=4784 iops : min=131230, max=206555, avg=175660.57, stdev=1752.91, samples=4784 write: IOPS=43.9k, BW=171MiB/s (180MB/s)(50.2GiB/300003msec); 0 zone resets slat (usec): min=2, max=1209, avg= 6.17, stdev= 4.84 clat (usec): min=3, max=11172, avg=322.63, stdev=229.52 lat (usec): min=23, max=11177, avg=328.90, stdev=229.53 clat percentiles (usec): | 1.00th=[ 33], 5.00th=[ 63], 10.00th=[ 94], 20.00th=[ 149], | 30.00th=[ 194], 40.00th=[ 233], 50.00th=[ 269], 60.00th=[ 310], | 70.00th=[ 367], 80.00th=[ 457], 90.00th=[ 652], 95.00th=[ 783], | 99.00th=[ 1012], 99.50th=[ 1090], 99.90th=[ 1631], 99.95th=[ 2114], | 99.99th=[ 3490] bw ( KiB/s): min=130631, max=209416, per=100.00%, avg=175726.23, stdev=1804.07, samples=4784 iops : min=32657, max=52354, avg=43931.25, stdev=451.02, samples=4784 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.69%, 100=2.37% lat (usec) : 250=21.71%, 500=31.58%, 750=23.88%, 1000=9.46% lat (msec) : 2=5.85%, 4=4.43%, 10=0.05%, 20=0.01% cpu : usr=4.94%, sys=13.42%, ctx=21400099, majf=0, minf=155 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=52625300,13161245,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=685MiB/s (719MB/s), 685MiB/s-685MiB/s (719MB/s-719MB/s), io=201GiB (216GB), run=300003-300003msec WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=50.2GiB (53.9GB), run=300003-300003msec Disk stats (read/write): dm-2: ios=52611588/13157829, merge=0/0, ticks=32267355/3792809, in_queue=36060164, util=100.00%, aggrios=52625300/13161320, aggrmerge=0/7, aggrticks=32357463/3825128, aggrin_queue=36182591, aggrutil=100.00% sda: ios=52625300/13161320, merge=0/7, ticks=32357463/3825128, in_queue=36182591, util=100.00% Test the NVMe block performance via host directly: 1. Mkfs for the NVMe block in the host # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 372G 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 371G 0 part ├─rhel_dell--per740xd--01-root │ 253:0 0 70G 0 lvm / ├─rhel_dell--per740xd--01-swap │ 253:1 0 31.4G 0 lvm [SWAP] └─rhel_dell--per740xd--01-home 253:2 0 269.7G 0 lvm /home sdb 8:16 0 558.4G 0 disk nvme0n1 259:0 0 745.2G 0 disk ├─nvme0n1p1 259:1 0 400G 0 part /mnt └─nvme0n1p2 259:2 0 345.2G 0 part # mkfs.xfs /dev/nvme0n1 nvme0n1 nvme0n1p1 nvme0n1p2 [root@dell-per740xd-01 ~]# mkfs.xfs /dev/nvme0n1p1 meta-data=/dev/nvme0n1p1 isize=512 agcount=4, agsize=26214400 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=104857600, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=51200, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Discarding blocks...Done. 2. Mount NVMe block and do fio test # mount /dev/nvme0n1p1 /mnt/ # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1 job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.19 Starting 8 processes job1: Laying out IO file (1 file / 100MiB) Jobs: 8 (f=8): [m(8)][100.0%][r=1065MiB/s,w=265MiB/s][r=273k,w=67.9k IOPS][eta 00m:00s] job1: (groupid=0, jobs=8): err= 0: pid=3204: Wed Aug 18 06:21:30 2021 read: IOPS=271k, BW=1059MiB/s (1111MB/s)(310GiB/300003msec) slat (nsec): min=1564, max=490348, avg=4939.65, stdev=3115.13 clat (usec): min=3, max=10776, avg=440.68, stdev=675.77 lat (usec): min=25, max=10781, avg=445.68, stdev=675.66~35% degration clat percentiles (usec): | 1.00th=[ 44], 5.00th=[ 79], 10.00th=[ 86], 20.00th=[ 96], | 30.00th=[ 108], 40.00th=[ 124], 50.00th=[ 149], 60.00th=[ 188], | 70.00th=[ 265], 80.00th=[ 486], 90.00th=[ 1598], 95.00th=[ 2212], | 99.00th=[ 2704], 99.50th=[ 2999], 99.90th=[ 4359], 99.95th=[ 4621], | 99.99th=[ 5211] bw ( MiB/s): min= 982, max= 1172, per=100.00%, avg=1060.89, stdev= 3.85, samples=4784 iops : min=251512, max=300036, avg=271587.33, stdev=985.88, samples=4784 write: IOPS=67.8k, BW=265MiB/s (278MB/s)(77.6GiB/300003msec); 0 zone resets slat (nsec): min=1644, max=376985, avg=5759.48, stdev=4071.35 clat (usec): min=2, max=8316, avg=96.50, stdev=286.17 lat (usec): min=16, max=8320, avg=102.33, stdev=286.27 clat percentiles (usec): | 1.00th=[ 18], 5.00th=[ 20], 10.00th=[ 22], 20.00th=[ 26], | 30.00th=[ 31], 40.00th=[ 36], 50.00th=[ 42], 60.00th=[ 50], | 70.00th=[ 61], 80.00th=[ 79], 90.00th=[ 125], 95.00th=[ 219], | 99.00th=[ 1516], 99.50th=[ 2409], 99.90th=[ 3490], 99.95th=[ 3982], | 99.99th=[ 5342] bw ( KiB/s): min=245376, max=302080, per=100.00%, avg=271626.80, stdev=1127.93, samples=4784 iops : min=61344, max=75520, avg=67906.69, stdev=281.98, samples=4784 lat (usec) : 4=0.01%, 10=0.01%, 20=1.34%, 50=11.77%, 100=23.47% lat (usec) : 250=37.36%, 500=9.77%, 750=3.51%, 1000=1.79% lat (msec) : 2=5.19%, 4=5.64%, 10=0.16%, 20=0.01% cpu : usr=8.73%, sys=19.55%, ctx=53203380, majf=0, minf=761 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=81348879,20339988,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=1059MiB/s (1111MB/s), 1059MiB/s-1059MiB/s (1111MB/s-1111MB/s), io=310GiB (333GB), run=300003-300003msec WRITE: bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=77.6GiB (83.3GB), run=300003-300003msec Disk stats (read/write): nvme0n1: ios=81328363/20334880, merge=0/3, ticks=35267074/1549526, in_queue=36816600, util=100.00% Results: The read IOPS: host directly:virtual machine = 217k:175k (~19% degration) The write IOPS: host directly:virtual machine = 67.8k:43.9k (~35% degration)
Overhead compared to bare metal is expected because further optimizations are still in development and will be added later (separate from this BZ). Some ways to make the performance comparison more direct (but there will still be a gap, so don't worry about rerunning right now): - Use virtio-blk instead of virtio-scsi (overhead is generally lower than virtio-scsi). - Use filename=$DEV where DEV is a block device (virtio-blk in the guest and an NVMe device/partition on the host) to avoid extra software layers that make it harder to compare results. - Remove -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap. It's not needed and adds a little overhead. Use the nvme blockdev node directly instead. - Run another fio job with iodepth=1 numjobs=1 to measure latency. The iodepth=16 numjobs=8 job tries to saturate the drive by queuing up many I/O requests, which is interesting, but it's also useful to benchmark a latency-sensitive workload to see the latency of a single request in isolation. Thanks!
Set this bug as verified accordingly, thanks Stefan.