Bug 1900136 - NVMe VFIO driver improvements in QEMU (TechPreview)
Summary: NVMe VFIO driver improvements in QEMU (TechPreview)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.3
Assignee: Stefan Hajnoczi
QA Contact: Tingting Mao
URL:
Whiteboard:
Depends On: 1848881
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-20 22:33 UTC by Ademar Reis
Modified: 2022-11-04 04:04 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1918914 (view as bug list)
Environment:
Last Closed: 2021-11-19 16:19:42 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)

Description Ademar Reis 2020-11-20 22:33:57 UTC
We've made substantial changes to the NVMe driver in QEMU and, although we don't consider it fully supported, we want to encourage layered products to give it a try (it's still TechPreview).

This BZ is to bring some awareness to the changes and request some additional testing of the driver.

This is a copy&paste from a document shared with QE earlier, by Stefan. It includes testing procedures:

----

A userspace NVMe driver has been available in QEMU but was experimental until recently. It is now ready to be used when the physical storage is a local NVMe PCI device. Disk I/O performance is improved over the traditional file-backed block drivers in QEMU.

The entire PCI adapter is assigned to a single guest. The host cannot access the NVMe device while the guest is running. Users may choose to use VFIO Device Assignment instead for even lower overhead if they do not require live migration.
Documentation
This feature is available on x86. POWER and aarch64 are not yet supported, but may be available by the release date.

The userspace NVMe driver is a good choice when I/O performance is a priority but VFIO Device Assignment cannot be used. Storage migration and other storage features are available with the userspace NVMe driver.

The libvirt domain XML is as follows:
<disk type='nvme' device='disk'>
  <driver name='qemu' type='raw'/>
  <source type='pci' managed='yes' namespace='1'>
    <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
  </source>
  <target dev='vde' bus='virtio'/>
</disk>

The NVMe namespace can be selected on drives that support multiple namespaces using <source namespace=’N’>.


--- Testing ---

Requires: A host with a spare NVMe drive that is not in use by the host operating system.

Configure a virtio-blk device with the userspace NVMe driver as shown in the libvirt domain XML above. Define an IOThread and assign the virtio-blk device. When the guest boots it sees a virtio-blk device. Comparing the I/O performance to the host /dev/nvme0n1 block device and aio=native shows that the userspace NVMe driver is at least as fast as the host block device.

Comment 1 Yanghang Liu 2020-11-30 09:05:19 UTC
Use the following xml to start a vm with a disk:

> <disk type='nvme' device='disk'>
>  <driver name='qemu' type='raw'/>
>  <source type='pci' managed='yes' namespace='1'>
>    <address domain='0x0000' bus='0x65' slot='0x00' function='0x0'/>
>  </source>
>  <target dev='vde' bus='virtio'/>
> </disk>


The qemu cmd line is like:

-blockdev {"driver":"nvme","device":"0000:65:00.0","namespace":1,"node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} \
-blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"} \
-device virtio-blk-pci,bus=pci.4,addr=0x0,drive=libvirt-1-format,id=virtio-disk4 \


Xueqiang,could you have a look ?

Comment 2 Xueqiang Wei 2020-12-03 16:16:02 UTC
According to Description, do fio test on host /dev/nvme0n1 block device, and on a virtio-blk device with the userspace NVMe driver in guest.

Comparing the I/O performance between them, the userspace NVMe driver is much lower than the host block device. 


Details:

Version:
kernel-4.18.0-255.el8.x86_64
qemu-kvm-5.2.0-0.module+el8.4.0+8855+a9e237a9


1. create a partition on host /dev/nvme0n1, an mount it to /home/fio_nvme

# lsblk
NAME                             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                8:0    0   372G  0 disk 
├─sda1                             8:1    0     1G  0 part /boot
└─sda2                             8:2    0   371G  0 part 
  ├─rhel_dell--per740xd--01-root 253:0    0    70G  0 lvm  /
  ├─rhel_dell--per740xd--01-swap 253:1    0  31.4G  0 lvm  [SWAP]
  └─rhel_dell--per740xd--01-home 253:2    0 269.7G  0 lvm  /home
sdb                                8:16   0 558.4G  0 disk 
└─sdb1                             8:17   0 558.4G  0 part 
nvme0n1                          259:0    0 745.2G  0 disk 
└─nvme0n1p1                      259:2    0 745.2G  0 part /home/fio_nvme


2. do fio test on /home/fio_nvme

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/fio_nvme/test   --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1

job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][100.0%][r=547MiB/s,w=137MiB/s][r=140k,w=34.9k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=32471: Thu Dec  3 09:31:22 2020
  read: IOPS=133k, BW=521MiB/s (546MB/s)(153GiB/300005msec)
    slat (nsec): min=1791, max=256544, avg=7424.69, stdev=3190.10
    clat (usec): min=33, max=12300, avg=634.19, stdev=703.31
     lat (usec): min=46, max=12306, avg=641.74, stdev=703.23
    clat percentiles (usec):
     |  1.00th=[   90],  5.00th=[  105], 10.00th=[  123], 20.00th=[  167],
     | 30.00th=[  215], 40.00th=[  273], 50.00th=[  347], 60.00th=[  449],
     | 70.00th=[  603], 80.00th=[  906], 90.00th=[ 1860], 95.00th=[ 2343],
     | 99.00th=[ 2835], 99.50th=[ 3064], 99.90th=[ 4555], 99.95th=[ 4883],
     | 99.99th=[ 5735]
   bw (  KiB/s): min=487200, max=587008, per=100.00%, avg=534129.38, stdev=2211.75, samples=4784
   iops        : min=121800, max=146752, avg=133532.32, stdev=552.93, samples=4784
  write: IOPS=33.3k, BW=130MiB/s (137MB/s)(38.2GiB/300005msec); 0 zone resets
    slat (nsec): min=1924, max=348193, avg=8215.24, stdev=3844.45
    clat (usec): min=11, max=13123, avg=1259.29, stdev=1435.59
     lat (usec): min=22, max=13130, avg=1267.63, stdev=1435.34
    clat percentiles (usec):
     |  1.00th=[   29],  5.00th=[   56], 10.00th=[  100], 20.00th=[  182],
     | 30.00th=[  265], 40.00th=[  375], 50.00th=[  553], 60.00th=[  865],
     | 70.00th=[ 1532], 80.00th=[ 2638], 90.00th=[ 3523], 95.00th=[ 3982],
     | 99.00th=[ 5997], 99.50th=[ 6456], 99.90th=[ 7373], 99.95th=[ 7832],
     | 99.99th=[ 9110]
   bw (  KiB/s): min=118944, max=149098, per=100.00%, avg=133549.00, stdev=608.14, samples=4784
   iops        : min=29736, max=37274, avg=33387.24, stdev=152.03, samples=4784
  lat (usec)   : 20=0.01%, 50=0.90%, 100=3.91%, 250=30.07%, 500=25.66%
  lat (usec)   : 750=11.62%, 1000=5.82%
  lat (msec)   : 2=9.93%, 4=10.95%, 10=1.14%, 20=0.01%
  cpu          : usr=7.14%, sys=15.25%, ctx=27755970, majf=0, minf=379
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=39999049,10001019,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=521MiB/s (546MB/s), 521MiB/s-521MiB/s (546MB/s-546MB/s), io=153GiB (164GB), run=300005-300005msec
  WRITE: bw=130MiB/s (137MB/s), 130MiB/s-130MiB/s (137MB/s-137MB/s), io=38.2GiB (40.0GB), run=300005-300005msec

Disk stats (read/write):
  nvme0n1: ios=39986219/9997872, merge=0/3, ticks=25066034/12163862, in_queue=37229896, util=100.00%


3.  Configure the userspace NVMe driver

Unbind the host NVMe controller from host
# echo 0000:bc:00.0 > /sys/bus/pci/devices/0000\:bc\:00.0/driver/unbind

Bind the host NVMe controller to the host vfio-pci driver
# echo 144d a822 > /sys/bus/pci/drivers/vfio-pci/new_id


4. create a image 

# qemu-img create -f raw nvme://0000:bc:00.0/1 20G
# qemu-img info nvme://0000:bc:00.0/1
image: nvme://0000:bc:00.0/1
file format: raw
virtual size: 745 GiB (800166076416 bytes)
disk size: unavailable


5. Configure a virtio-blk device with the userspace NVMe driver, define an IOThread and assign the virtio-blk device.
   And then install rhel8.4 guest on it

/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35 \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 15360  \
    -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2  \
    -cpu 'Haswell-noTSX',+kvm_pv_unhalt \
    -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/monitor-qmpmonitor1-20200522-125204-4Vi7sqOR,server,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/monitor-catch_monitor-20200522-125204-4Vi7sqOR,server,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idX2dIhI \
    -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/serial-serial0-20200522-125204-4Vi7sqOR,server,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20200522-125204-4Vi7sqOR,path=/var/tmp/avocado_xpeuo28b/seabios-20200522-125204-4Vi7sqOR,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20200522-125204-4Vi7sqOR,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -object iothread,id=iothread1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idEFQ4i1,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \
    -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap \
    -device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \
    -device pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x1.0x6,bus=pcie.0,chassis=6 \
    -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-6,addr=0x0 \
    -blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/linux/RHEL-8.4.0-20200905.n.0-x86_64-dvd1.iso,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \
    -device scsi-cd,id=cd1,drive=drive_cd1,write-cache=on,bootindex=1 \


6. boot the guest after the installation. do fio test on /home/test

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/test   --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][3.5%][eta 08d:21h:09m:48s]]



After step 2, fio test finished in 5 minutes.
After step 6, fio test didn't finish after 12 hours.



Hi Philippe, Ademar,

I want to confirm the following items, please help correct me. Many thanks.

1. I think we also need to compare the I/O performance between the host /dev/nvme0n1 block device 
   and NVMe Device Assignment, right? (NVMe Device Assignment, e.g. -device vfio-pci,host=0000:65:00.0,id=pf2,bus=root.5,addr=0x0)
   And this bug is just track nvme userspace driver right?

2. Parameter 'aio' is unexpected when boot a virtio-blk device with the userspace NVMe driver

qemu cmd lines:
    -object iothread,id=iothread1 \
    -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \
    -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap,aio=native  \
    -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap \
    -device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \

error message:
qemu-kvm: -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap,aio=native: Parameter 'aio' is unexpected

Do we need a rfe bug to track this issue?

Comment 4 Philippe Mathieu-Daudé 2020-12-03 22:10:11 UTC
(In reply to Xueqiang Wei from comment #2)
[...]
> 4. create a image 
> 
> # qemu-img create -f raw nvme://0000:bc:00.0/1 20G

Since you use a 20G size here,

> # qemu-img info nvme://0000:bc:00.0/1
> image: nvme://0000:bc:00.0/1
> file format: raw
> virtual size: 745 GiB (800166076416 bytes)
> disk size: unavailable

...
>     -blockdev
> node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-
> only=on,discard=unmap \
>     -blockdev
> node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,
> discard=unmap \

I think you should use ...,size=20G here.

Comment 5 Xueqiang Wei 2020-12-04 12:11:17 UTC
(In reply to Philippe Mathieu-Daudé from comment #4)
> (In reply to Xueqiang Wei from comment #2)
> [...]
> > 4. create a image 
> > 
> > # qemu-img create -f raw nvme://0000:bc:00.0/1 20G
> 
> Since you use a 20G size here,
> 
> > # qemu-img info nvme://0000:bc:00.0/1
> > image: nvme://0000:bc:00.0/1
> > file format: raw
> > virtual size: 745 GiB (800166076416 bytes)
> > disk size: unavailable
> 
> ...
> >     -blockdev
> > node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-
> > only=on,discard=unmap \
> >     -blockdev
> > node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,
> > discard=unmap \
> 
> I think you should use ...,size=20G here.



1. Create a image with size=30G, and add size=32212254720 to blockdev command lines.
  Install rhel8.4 guest on it, the installation didn't finish after 2 hours. The details are shown in below.

2. If I don't use size=32212254720 in command lines, the installation finished in 30 minutes.

3. Philippe, please check the two questions I asked in Comment 2. Many thanks. 



Details:

# qemu-img create -f raw nvme://0000:bc:00.0/1 30G
Formatting 'nvme://0000:bc:00.0/1', fmt=raw size=32212254720

# qemu-img info nvme://0000:bc:00.0/1
image: nvme://0000:bc:00.0/1
file format: raw
virtual size: 745 GiB (800166076416 bytes)
disk size: unavailable

qemu cmd lines:
-object iothread,id=iothread1 \
-device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \
-blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap,size=32212254720 \
-device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \


screenshot: http://fileshare.englab.nay.redhat.com/pub/section2/kvm/xuwei/bug/installation_screenshot.png

Comment 6 Xueqiang Wei 2020-12-13 10:45:58 UTC
Tested with qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a, not hit the issue mentioned in Comment 5. But the fio test in guest still can't not finish, even the progress is shown to be 100%.

Details:

Versions:
kernel-4.18.0-260.el8.x86_64
qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a


1. create raw image on nvme device
# qemu-img create -f raw nvme://0000:bc:00.0/1 30G
Formatting 'nvme://0000:bc:00.0/1', fmt=raw size=32212254720

# qemu-img info nvme://0000:bc:00.0/1
image: nvme://0000:bc:00.0/1
file format: raw
virtual size: 745 GiB (800166076416 bytes)
disk size: unavailable

2. install rhel8.4 on it
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35 \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 15360  \
    -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2  \
    -cpu 'Skylake-Server',+kvm_pv_unhalt \
    -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/monitor-qmpmonitor1-20200522-125204-4Vi7sqOR,server,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/monitor-catch_monitor-20200522-125204-4Vi7sqOR,server,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idX2dIhI \
    -chardev socket,nowait,path=/var/tmp/avocado_xpeuo28b/serial-serial0-20200522-125204-4Vi7sqOR,server,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20200522-125204-4Vi7sqOR,path=/var/tmp/avocado_xpeuo28b/seabios-20200522-125204-4Vi7sqOR,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20200522-125204-4Vi7sqOR,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -object iothread,id=iothread1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idEFQ4i1,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \
    -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap,size=32212254720 \
    -device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \
    -device pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x1.0x6,bus=pcie.0,chassis=6 \
    -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-6,addr=0x0 \
    -blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/linux/RHEL-8.4.0-20201209.n.0-x86_64-dvd1.iso,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \
    -device scsi-cd,id=cd1,drive=drive_cd1,write-cache=on,bootindex=1 \


3. check info in guest
# uname -r
kernel-4.18.0-259.el8.x86_64

# lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0            11:0    1  9.3G  0 rom  /run/media/xuwei/RHEL-8-4-0-BaseOS-x86_64
vda           252:0    0   30G  0 disk 
├─vda1        252:1    0    1G  0 part /boot
└─vda2        252:2    0   29G  0 part 
  ├─rhel-root 253:0    0   26G  0 lvm  /
  └─rhel-swap 253:1    0    3G  0 lvm  [SWAP]


4. fio test
# mkdir -p /home/fio_nvme/

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/fio_nvme/test   --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][100.0%][eta 00m:00s]   *************  fio test didn't finish after 12 hours


5. check the vm status:
QEMU 5.2.0 monitor - type 'help' for more information
(qemu) c
(qemu) info status 
VM status: running


After step 2, guest install successfully, and disk size is 30G in guest.
After step 4, fio test didn't finish after 12 hours, even the progress is shown to be 100%.
              check the vm status, it's running.



Hi Philippe,

Please help check if I missed some message or steps.

By the way, the two questions I asked in Comment 2, please also help check them. Many thanks. 


Hi Yanghang,

Please help test it with your nvme environment, check if you will also hit it. Thanks.

Comment 7 Yanghang Liu 2020-12-13 10:51:43 UTC
(In reply to Xueqiang Wei from comment #6)

> Hi Yanghang,
> 
> Please help test it with your nvme environment, check if you will also hit
> it. Thanks.

Hi
My machine with NVME disk is now performing any other tests task.
I will test it immediately once I regain the machine.

Comment 8 Yanghang Liu 2020-12-14 15:57:17 UTC
Hi Philippe,

Could you check the following test sceniro:

  1.prepare a host with nvme disk and bind the driver of host nvme disk to vfio-pci
    # virsh nodedev-detach pci_$domain_$bus_$device_$function

  2.assign the nvme disk to the vm and start the vm
    the xml of this nvme is like:

      <hostdev mode='subsystem' type='pci' managed='yes'>
        <driver name='vfio'/>
        <source>
          <address domain='$domain' bus='$bus' slot='$device' function='$device'/>
        </source>
        <alias name='hostdev0'/>
      </hostdev>


    the qemu cmd line of this nvme is like:

      -device vfio-pci,host=$domain:$bus:$device.$device,id=nvme_disk,addr=0x0 

  3.do some performance test for the disk in the vm



According to your description in comment0, it seems to me that we don’t need to cover the test scenarios above for this bug.

And the qemu command line and domain xml used to verify this bug should be the same as the one as I posted in the comment 1.

Is my understanding correct?

If I have any misunderstandings, or I need to do some additional testing for this bug, please feel free to let me know.

Thanks in advance.

Comment 11 Yanghang Liu 2020-12-22 08:19:45 UTC
> Please help test it with your nvme environment,check if you will also hit it.
In my test environment, I encountered the same problem as Xueqiang mentioned in the comment 6.


Test Env:
host:
qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.x86_64
4.18.0-262.el8.dt4.x86_64
guest:
4.18.0-262.el8.x86_64



Test Step:
(1)
# virsh nodedev-detach pci_0000_65_00_0

(2)
# qemu-img create -f raw nvme://0000:65:00.0/1 30G
Formatting 'nvme://0000:65:00.0/1', fmt=raw size=32212254720

#qemu-img  info  nvme://0000:65:00.0/1
image: nvme://0000:65:00.0/1
file format: raw
virtual size: 745 GiB (800166076416 bytes)
disk size: unavailable\

(3) install a vm on the userspace nvme disk 
-object iothread,id=iothread1 \
-blockdev node-name=nvme_image1,driver=nvme,device=0000:65:00.0,namespace=1,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap,size=32212254720 \
-device virtio-blk-pci,id=nvme1,drive=drive_nvme1,bootindex=0,bus=root.2,addr=0x0,iothread=iothread1 \
...
-device virtio-scsi-pci,id=virtio_scsi_pci2,bus=root.4,addr=0x0 \
-blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/iso/RHEL-8.4.0-20201217.n.0-x86_64-dvd1.iso,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \
-device scsi-cd,id=cd1,drive=drive_cd1,write-cache=on,bootindex=1 \
...

(4) do fio test in the vm

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/fio_nvme_test   --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][100.0%][eta 00m:00s]                                           
Jobs: 8 (f=8): [m(8)][100.0%][eta 00m:00s]
Jobs: 8 (f=8): [m(8)][100.0%][eta 00m:00s]  <---- This fio test cannot be completed.

Some error dmesg:
INFO : task X blocked for more than 120 seconds.
Not tainted 4.18.0-262.el8.x86_64 # 1

Comment 15 Yanghang Liu 2021-01-06 05:55:17 UTC
According to the bug description and comment1, this bug should belong to the NVME User Space driver part

Assign QA Contact to Xueqiang first.

Please feel free to ping me if there is anything I can help.

Comment 16 Xueqiang Wei 2021-01-14 03:08:04 UTC
With the same steps in Comment 6, I retested it on qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f, the fio test in guest still can't finish.

Versions:
Host:
kernel-4.18.0-270.el8.x86_64
qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f
Guest:
kernel-4.18.0-259.el8.x86_64


In guest:
1. check dmesg before fio test, not found "call trace"
# dmesg | grep "Call Trace"
#

2. check dmesg during fio test, found "call trace"
# dmesg | grep "Call Trace"
[  615.352754] Call Trace:
[  738.222864] Call Trace:
[  738.223445] Call Trace:

[  861.093906] INFO: task in:imjournal:1604 blocked for more than 120 seconds.
[  861.093909]       Not tainted 4.18.0-259.el8.x86_64 #1
[  861.093910] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  861.093947] in:imjournal    D    0  1604      1 0x00000080
[  861.093949] Call Trace:
[  861.093959]  __schedule+0x2a6/0x700
[  861.093963]  schedule+0x38/0xa0
[  861.093965]  io_schedule+0x12/0x40
[  861.093969]  wait_on_page_bit+0x137/0x230
[  861.093973]  ? xas_find+0x173/0x1b0
[  861.093977]  ? file_check_and_advance_wb_err+0xd0/0xd0
[  861.093984]  truncate_inode_pages_range+0x484/0x8b0
[  861.094038]  ? xfs_rename+0x5f7/0x9b0 [xfs]
[  861.094049]  ? __d_move+0x296/0x510
[  861.094053]  ? __inode_wait_for_writeback+0x7f/0xf0
[  861.094058]  ? init_wait_var_entry+0x50/0x50
[  861.094062]  evict+0x183/0x1a0
[  861.094065]  __dentry_kill+0xd5/0x170
[  861.094068]  dentry_kill+0x4d/0x190
[  861.094071]  dput.part.34+0xd9/0x120
[  861.094075]  do_renameat2+0x39d/0x530
[  861.094080]  __x64_sys_rename+0x1c/0x20
[  861.094084]  do_syscall_64+0x5b/0x1a0
[  861.094088]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  861.094090] RIP: 0033:0x7fb7278da9bb
[  861.094094] Code: Bad RIP value.
[  861.094095] RSP: 002b:00007fb72516cae8 EFLAGS: 00000213 ORIG_RAX: 0000000000000052
[  861.094098] RAX: ffffffffffffffda RBX: 00007fb72516caf0 RCX: 00007fb7278da9bb
[  861.094099] RDX: 000055a7da65b250 RSI: 000055a7da65b940 RDI: 00007fb72516caf0
[  861.094100] RBP: 000055a7da65b170 R08: 00007fb71806e9f0 R09: 0000000000000003
[  861.094102] R10: 000000000000003f R11: 0000000000000213 R12: 0000000000000000
[  861.094103] R13: 00007fb718020df0 R14: 0000000000000051 R15: 00007fb725ab5c38

Comment 18 Philippe Mathieu-Daudé 2021-01-20 00:32:53 UTC
Hi Yanghang,

(In reply to Yanghang Liu from comment #8)
> Hi Philippe,
> 
> Could you check the following test sceniro:
> 
>   1.prepare a host with nvme disk and bind the driver of host nvme disk to
> vfio-pci
>     # virsh nodedev-detach pci_$domain_$bus_$device_$function
> 
>   2.assign the nvme disk to the vm and start the vm
>     the xml of this nvme is like:
> 
>       <hostdev mode='subsystem' type='pci' managed='yes'>
>         <driver name='vfio'/>
>         <source>
>           <address domain='$domain' bus='$bus' slot='$device'
> function='$device'/>
>         </source>
>         <alias name='hostdev0'/>
>       </hostdev>
> 
> 
>     the qemu cmd line of this nvme is like:
> 
>       -device
> vfio-pci,host=$domain:$bus:$device.$device,id=nvme_disk,addr=0x0 
> 
>   3.do some performance test for the disk in the vm
> 
> 
> 
> According to your description in comment0, it seems to me that we don’t need
> to cover the test scenarios above for this bug.

Indeed. Per comment #0 this is for when "VFIO Device Assignment cannot be used".
So we do not want to test the "-device vfio-pci,host=..." command in this BZ.

> And the qemu command line and domain xml used to verify this bug should be
> the same as the one as I posted in the comment 1.
> 
> Is my understanding correct?

Correct, I am using the format from comment #1.

Manually I use:
-drive file=nvme://0000:04:00.0/1,if=none,id=drive0 -device virtio-blk-pci,drive=drive0

Or with libvirt:

    <disk type='nvme' device='disk'>
      <driver name='qemu' type='raw'/>
      <source type='pci' managed='no' namespace='1'>
        <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      </source>
      <target dev='vdb' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </disk>

Expanded to:

-blockdev {"driver":"nvme","device":"0000:04:00.0","namespace":1,"node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"} -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xa,drive=libvirt-1-format,id=virtio-disk1

I created a 80G ext2 partition on the NVMe drive:

# cat /proc/partitions 
major minor  #blocks  name

 252        0   16777216 vda
 252        1    1048576 vda1
 252        2   15727616 vda2
 252       16  366292584 vdb
 252       17   83886080 vdb1 <---
 253        0   14045184 dm-0
 253        1    1679360 dm-1

Then mounted it on /mnt and ran your test:

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1     
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
Jobs: 8 (f=8): [m(8)][100.0%][r=253MiB/s,w=63.0MiB/s][r=64.7k,w=16.1k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=1472: Tue Jan 19 19:25:39 2021
  read: IOPS=57.3k, BW=224MiB/s (235MB/s)(65.6GiB/300001msec)
    slat (usec): min=2, max=30277, avg=95.85, stdev=541.92
    clat (usec): min=47, max=64752, avg=1685.80, stdev=3262.41
     lat (usec): min=51, max=64920, avg=1782.13, stdev=3402.31
    clat percentiles (usec):
     |  1.00th=[   93],  5.00th=[  103], 10.00th=[  112], 20.00th=[  180],
     | 30.00th=[  204], 40.00th=[  260], 50.00th=[  273], 60.00th=[  293],
     | 70.00th=[  420], 80.00th=[ 2180], 90.00th=[ 5932], 95.00th=[ 8717],
     | 99.00th=[15270], 99.50th=[17171], 99.90th=[23462], 99.95th=[25297],
     | 99.99th=[31327]
   bw (  KiB/s): min=88776, max=501955, per=100.00%, avg=229412.70, stdev=8163.76, samples=4776
   iops        : min=22194, max=125487, avg=57352.19, stdev=2040.94, samples=4776
  write: IOPS=14.3k, BW=55.9MiB/s (58.7MB/s)(16.4GiB/300001msec); 0 zone resets
    slat (usec): min=2, max=30249, avg=155.89, stdev=720.80
    clat (usec): min=48, max=61061, avg=1644.51, stdev=3208.78
     lat (usec): min=53, max=61092, avg=1800.93, stdev=3441.64
    clat percentiles (usec):
     |  1.00th=[   91],  5.00th=[  104], 10.00th=[  115], 20.00th=[  180],
     | 30.00th=[  204], 40.00th=[  260], 50.00th=[  273], 60.00th=[  289],
     | 70.00th=[  404], 80.00th=[ 1958], 90.00th=[ 5735], 95.00th=[ 8586],
     | 99.00th=[15008], 99.50th=[17171], 99.90th=[22938], 99.95th=[24773],
     | 99.99th=[30802]
   bw (  KiB/s): min=22176, max=126822, per=100.00%, avg=57349.80, stdev=2056.94, samples=4776
   iops        : min= 5544, max=31704, avg=14336.51, stdev=514.23, samples=4776
  lat (usec)   : 50=0.01%, 100=3.38%, 250=32.55%, 500=34.89%, 750=1.97%
  lat (usec)   : 1000=1.99%
  lat (msec)   : 2=4.91%, 4=4.28%, 10=12.29%, 20=3.49%, 50=0.24%
  lat (msec)   : 100=0.01%
  cpu          : usr=2.36%, sys=12.00%, ctx=3372753, majf=0, minf=136
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=17188322,4296826,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=224MiB/s (235MB/s), 224MiB/s-224MiB/s (235MB/s-235MB/s), io=65.6GiB (70.4GB), run=300001-300001msec
  WRITE: bw=55.9MiB/s (58.7MB/s), 55.9MiB/s-55.9MiB/s (58.7MB/s-58.7MB/s), io=16.4GiB (17.6GB), run=300001-300001msec

Disk stats (read/write):
  vdb: ios=17173202/4293091, merge=20/0, ticks=910678/237756, in_queue=1148434, util=100.00%

Checking guest stats after the test:

{ "execute": "query-blockstats" }
{
    "return": [
        {
            "device": "virtio-disk1",
            "parent": {
                "node-name": "#block014",
                "driver-specific": {
                    "aligned-accesses": 17563525,
                    "driver": "nvme",
                    "completion-errors": 0,
                    "unaligned-accesses": 4
                }
            },
            "stats": {
                "unmap_operations": 0,
                "unmap_merged": 0,
                "flush_total_time_ns": 105668,
                "wr_highest_offset": 114294784,
                "wr_total_time_ns": 324639788037,
                "failed_wr_operations": 0,
                "failed_rd_operations": 0,
                "wr_merged": 43,
                "wr_bytes": 14381838336,
                "timed_stats": [
                ],
                "failed_unmap_operations": 0,
                "failed_flush_operations": 0,
                "account_invalid": true,
                "rd_total_time_ns": 1321636726291,
                "invalid_unmap_operations": 0,
                "flush_operations": 1,
                "wr_operations": 3511189,
                "unmap_bytes": 0,
                "rd_merged": 1346,
                "rd_bytes": 57570702848,
                "unmap_total_time_ns": 0,
                "invalid_flush_operations": 0,
                "account_failed": true,
                "idle_time_ns": 20045456348,
                "rd_operations": 14053718,
                "invalid_wr_operations": 0,
                "invalid_rd_operations": 0
            },
            "node-name": "#block105",
            "qdev": "/machine/peripheral-anon/device[4]/virtio-backend"
        },

Running the same test on the host:

/dev/nvme0n1p1 on /mnt type ext2 (rw,relatime,seclabel)

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.21
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][100.0%][r=1663MiB/s,w=417MiB/s][r=426k,w=107k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=46916: Tue Jan 19 16:14:34 2021
  read: IOPS=426k, BW=1663MiB/s (1744MB/s)(487GiB/300002msec)
    slat (nsec): min=1947, max=456132, avg=4679.92, stdev=1690.68
    clat (usec): min=59, max=2144, avg=234.07, stdev=13.28
     lat (usec): min=62, max=2149, avg=238.97, stdev=12.97
    clat percentiles (usec):
     |  1.00th=[  212],  5.00th=[  219], 10.00th=[  221], 20.00th=[  225],
     | 30.00th=[  227], 40.00th=[  231], 50.00th=[  233], 60.00th=[  235],
     | 70.00th=[  237], 80.00th=[  241], 90.00th=[  253], 95.00th=[  265],
     | 99.00th=[  277], 99.50th=[  285], 99.90th=[  302], 99.95th=[  306],
     | 99.99th=[  326]
   bw (  MiB/s): min= 1638, max= 1680, per=100.00%, avg=1665.44, stdev= 0.70, samples=4792
   iops        : min=419474, max=430258, avg=426352.63, stdev=178.78, samples=4792
  write: IOPS=106k, BW=416MiB/s (436MB/s)(122GiB/300002msec); 0 zone resets
    slat (usec): min=2, max=367, avg= 5.34, stdev= 1.93
    clat (usec): min=66, max=2351, avg=236.07, stdev=13.43
     lat (usec): min=73, max=2356, avg=241.65, stdev=13.06
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[  221], 10.00th=[  223], 20.00th=[  227],
     | 30.00th=[  229], 40.00th=[  231], 50.00th=[  235], 60.00th=[  237],
     | 70.00th=[  239], 80.00th=[  243], 90.00th=[  255], 95.00th=[  265],
     | 99.00th=[  277], 99.50th=[  289], 99.90th=[  306], 99.95th=[  310],
     | 99.99th=[  326]
   bw (  KiB/s): min=411256, max=442344, per=100.00%, avg=426231.17, stdev=589.19, samples=4792
   iops        : min=102814, max=110586, avg=106557.77, stdev=147.30, samples=4792
  lat (usec)   : 100=0.01%, 250=89.13%, 500=10.86%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=16.16%, sys=34.43%, ctx=128107520, majf=0, minf=510
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=127731635,31923845,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=1663MiB/s (1744MB/s), 1663MiB/s-1663MiB/s (1744MB/s-1744MB/s), io=487GiB (523GB), run=300002-300002msec
  WRITE: bw=416MiB/s (436MB/s), 416MiB/s-416MiB/s (436MB/s-436MB/s), io=122GiB (131GB), run=300002-300002msec

Disk stats (read/write):
  nvme0n1: ios=127713173/31919281, merge=0/3, ticks=29747708/7397612, in_queue=37145320, util=100.00%

There is big performance difference, but no kernel crash.

Note that I am using a different host kernel, tomorrow I plan to reinstall with the same
used in your test: kernel-4.18.0-270.el8.x86_64

Comment 19 Philippe Mathieu-Daudé 2021-01-20 18:59:05 UTC
I reinstalled my host with RHEL8.4-AV and did the same tests.

Host:
kernel-4.18.0-275.el8.x86_64
qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f
Guest:
kernel-4.18.0-275.el8.x86_64

# virsh dumpxml rhel84c
...
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/rhel84c.qcow2' index='2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>
    <disk type='nvme' device='disk'>
      <driver name='qemu' type='raw'/>
      <source type='pci' managed='no' namespace='1' index='1'>
        <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      </source>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </disk>

[root@guest ~]# cat /proc/partitions 
major minor  #blocks  name

 252        0   16777216 vda
 252        1    1048576 vda1
 252        2   15727616 vda2
 252       16  366292584 vdb
 253        0   14045184 dm-0
 253        1    1679360 dm-1

[root@guest ~]# mke2fs -F /dev/vdb 
mke2fs 1.45.6 (20-Mar-2020)
Found a dos partition table in /dev/vdb
Creating filesystem with 91573146 4k blocks and 22896640 inodes
Filesystem UUID: 46efb6ba-47ed-4a30-ac46-7e6a6a71aa0a
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968

Allocating group tables: done                            
Writing inode tables: done                            
Writing superblocks and filesystem accounting information: done     

[root@guest ~]# mount /dev/vdb /mnt
[root@guest ~]# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][100.0%][r=616MiB/s,w=155MiB/s][r=158k,w=39.6k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=1561: Wed Jan 20 13:56:49 2021
  read: IOPS=152k, BW=595MiB/s (623MB/s)(174GiB/300001msec)
    slat (usec): min=2, max=34921, avg=31.73, stdev=219.39
    clat (usec): min=28, max=52468, avg=629.41, stdev=1146.41
     lat (usec): min=35, max=52545, avg=661.42, stdev=1186.37
    clat percentiles (usec):
     |  1.00th=[   77],  5.00th=[   91], 10.00th=[  101], 20.00th=[  125],
     | 30.00th=[  281], 40.00th=[  383], 50.00th=[  465], 60.00th=[  545],
     | 70.00th=[  635], 80.00th=[  750], 90.00th=[  930], 95.00th=[ 1156],
     | 99.00th=[ 6128], 99.50th=[ 8356], 99.90th=[14484], 99.95th=[16450],
     | 99.99th=[23462]
   bw (  KiB/s): min=282317, max=899654, per=100.00%, avg=609671.56, stdev=9967.15, samples=4784
   iops        : min=70579, max=224912, avg=152417.61, stdev=2491.79, samples=4784
  write: IOPS=38.1k, BW=149MiB/s (156MB/s)(43.5GiB/300001msec); 0 zone resets
    slat (usec): min=2, max=34865, avg=57.19, stdev=310.85
    clat (usec): min=9, max=52387, avg=656.92, stdev=1196.70
     lat (usec): min=41, max=52459, avg=714.46, stdev=1268.16
    clat percentiles (usec):
     |  1.00th=[   77],  5.00th=[   92], 10.00th=[  104], 20.00th=[  128],
     | 30.00th=[  285], 40.00th=[  396], 50.00th=[  486], 60.00th=[  570],
     | 70.00th=[  660], 80.00th=[  775], 90.00th=[  963], 95.00th=[ 1205],
     | 99.00th=[ 6521], 99.50th=[ 8717], 99.90th=[15008], 99.95th=[16909],
     | 99.99th=[23987]
   bw (  KiB/s): min=70531, max=226711, per=100.00%, avg=152402.62, stdev=2508.06, samples=4784
   iops        : min=17632, max=56677, avg=38100.38, stdev=627.02, samples=4784
  lat (usec)   : 10=0.01%, 50=0.01%, 100=9.15%, 250=18.54%, 500=25.97%
  lat (usec)   : 750=26.13%, 1000=12.29%
  lat (msec)   : 2=4.93%, 4=1.07%, 10=1.58%, 20=0.31%, 50=0.02%
  lat (msec)   : 100=0.01%
  cpu          : usr=6.30%, sys=15.73%, ctx=19379351, majf=0, minf=143
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=45665464,11415186,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=595MiB/s (623MB/s), 595MiB/s-595MiB/s (623MB/s-623MB/s), io=174GiB (187GB), run=300001-300001msec
  WRITE: bw=149MiB/s (156MB/s), 149MiB/s-149MiB/s (156MB/s-156MB/s), io=43.5GiB (46.8GB), run=300001-300001msec

Disk stats (read/write):
  vdb: ios=45635088/11407656, merge=0/0, ticks=7636538/1932687, in_queue=9569225, util=100.00%
[root@guest ~]# 

Test finished in 5min, still no kernel crash.

Comment 21 Philippe Mathieu-Daudé 2021-01-21 12:17:43 UTC
Performance improved using the following XML configuration:

  <iothreads>1</iothreads>
  <devices>
   ...
    <disk type='nvme' device='disk'>
      <driver name='qemu' type='raw' iothread='1'/>
      <source type='pci' managed='no' namespace='1' index='1'>
        <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      </source>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </disk>

[root@guest ~]# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
Jobs: 8 (f=8): [m(8)][100.0%][r=418MiB/s,w=105MiB/s][r=107k,w=26.8k IOPS][eta 00m:00s] 
job1: (groupid=0, jobs=8): err= 0: pid=1470: Thu Jan 21 06:50:11 2021
  read: IOPS=108k, BW=423MiB/s (444MB/s)(124GiB/300001msec)
    slat (usec): min=2, max=39593, avg=49.25, stdev=523.16
    clat (usec): min=7, max=78085, avg=893.77, stdev=3211.67
     lat (usec): min=32, max=78133, avg=943.23, stdev=3349.10
    clat percentiles (usec):
     |  1.00th=[   96],  5.00th=[  100], 10.00th=[  102], 20.00th=[  108],
     | 30.00th=[  109], 40.00th=[  110], 50.00th=[  111], 60.00th=[  113],
     | 70.00th=[  118], 80.00th=[  130], 90.00th=[  486], 95.00th=[ 6980],
     | 99.00th=[16450], 99.50th=[22152], 99.90th=[31327], 99.95th=[33817],
     | 99.99th=[42206]
   bw (  KiB/s): min=73594, max=973981, per=100.00%, avg=434044.37, stdev=20074.88, samples=4776
   iops        : min=18398, max=243494, avg=108509.89, stdev=5018.73, samples=4776
  write: IOPS=27.1k, BW=106MiB/s (111MB/s)(31.0GiB/300001msec); 0 zone resets
    slat (usec): min=2, max=39670, avg=88.26, stdev=732.69
    clat (usec): min=22, max=78077, avg=859.92, stdev=3119.56
     lat (usec): min=26, max=78701, avg=948.42, stdev=3360.97
    clat percentiles (usec):
     |  1.00th=[   95],  5.00th=[  100], 10.00th=[  102], 20.00th=[  108],
     | 30.00th=[  109], 40.00th=[  110], 50.00th=[  111], 60.00th=[  113],
     | 70.00th=[  118], 80.00th=[  130], 90.00th=[  453], 95.00th=[ 6456],
     | 99.00th=[16319], 99.50th=[21103], 99.90th=[30802], 99.95th=[32637],
     | 99.99th=[41157]
   bw (  KiB/s): min=19034, max=245008, per=100.00%, avg=108541.22, stdev=5038.50, samples=4776
   iops        : min= 4758, max=61251, avg=27134.15, stdev=1259.63, samples=4776
  lat (usec)   : 10=0.01%, 50=0.01%, 100=4.44%, 250=83.88%, 500=1.85%
  lat (usec)   : 750=1.41%, 1000=0.48%
  lat (msec)   : 2=0.45%, 4=1.02%, 10=3.66%, 20=2.18%, 50=0.64%
  lat (msec)   : 100=0.01%
  cpu          : usr=3.43%, sys=7.95%, ctx=2331181, majf=0, minf=152
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=32511521,8130273,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=423MiB/s (444MB/s), 423MiB/s-423MiB/s (444MB/s-444MB/s), io=124GiB (133GB), run=300001-300001msec
  WRITE: bw=106MiB/s (111MB/s), 106MiB/s-106MiB/s (111MB/s-111MB/s), io=31.0GiB (33.3GB), run=300001-300001msec

Disk stats (read/write):
  vdb: ios=32496594/8126521, merge=0/0, ticks=625043/156398, in_queue=781441, util=100.00%

Comment 22 Philippe Mathieu-Daudé 2021-01-21 18:00:59 UTC
Using Xueqiang Wei hardware I get the same hang reported in comment #16 in the guest:

> 2. check dmesg during fio test, found "call trace"
> # dmesg | grep "Call Trace"
> [  615.352754] Call Trace:
> [  738.222864] Call Trace:
> [  738.223445] Call Trace:
> 
> [  861.093906] INFO: task in:imjournal:1604 blocked for more than 120
> seconds.
> [  861.093909]       Not tainted 4.18.0-259.el8.x86_64 #1
> [  861.093910] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  861.093947] in:imjournal    D    0  1604      1 0x00000080
> [  861.093949] Call Trace:
> [  861.093959]  __schedule+0x2a6/0x700
> [  861.093963]  schedule+0x38/0xa0
> [  861.093965]  io_schedule+0x12/0x40
> [  861.093969]  wait_on_page_bit+0x137/0x230
> [  861.093973]  ? xas_find+0x173/0x1b0
> [  861.093977]  ? file_check_and_advance_wb_err+0xd0/0xd0
> [  861.093984]  truncate_inode_pages_range+0x484/0x8b0
> [  861.094038]  ? xfs_rename+0x5f7/0x9b0 [xfs]
> [  861.094049]  ? __d_move+0x296/0x510
> [  861.094053]  ? __inode_wait_for_writeback+0x7f/0xf0
> [  861.094058]  ? init_wait_var_entry+0x50/0x50
> [  861.094062]  evict+0x183/0x1a0
> [  861.094065]  __dentry_kill+0xd5/0x170
> [  861.094068]  dentry_kill+0x4d/0x190
> [  861.094071]  dput.part.34+0xd9/0x120
> [  861.094075]  do_renameat2+0x39d/0x530
> [  861.094080]  __x64_sys_rename+0x1c/0x20
> [  861.094084]  do_syscall_64+0x5b/0x1a0
> [  861.094088]  entry_SYSCALL_64_after_hwframe+0x65/0xca
> [  861.094090] RIP: 0033:0x7fb7278da9bb
> [  861.094094] Code: Bad RIP value.
> [  861.094095] RSP: 002b:00007fb72516cae8 EFLAGS: 00000213 ORIG_RAX:
> 0000000000000052
> [  861.094098] RAX: ffffffffffffffda RBX: 00007fb72516caf0 RCX:
> 00007fb7278da9bb
> [  861.094099] RDX: 000055a7da65b250 RSI: 000055a7da65b940 RDI:
> 00007fb72516caf0
> [  861.094100] RBP: 000055a7da65b170 R08: 00007fb71806e9f0 R09:
> 0000000000000003
> [  861.094102] R10: 000000000000003f R11: 0000000000000213 R12:
> 0000000000000000
> [  861.094103] R13: 00007fb718020df0 R14: 0000000000000051 R15:
> 00007fb725ab5c38

Comment 23 Xueqiang Wei 2021-01-25 15:00:25 UTC
Hi Philippe,

1. Could you please share the details of your’s nvme device? Maybe QE needs to purchase a new nvme device.

2. I tested with another nvme disk, not hit the issue reported in Comment 16. 

Versions:
Host:
kernel-4.18.0-270.el8.x86_64
qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a
Guest:
kernel-4.18.0-259.el8.x86_64

in the host:
#  lspci|grep -i "Non-Volatile"
81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)

# lspci -n -s 0000:81:00.0
81:00.0 0108: 8086:0953 (rev 01)

# echo 0000:81:00.0 > /sys/bus/pci/devices/0000\:81\:00.0/driver/unbind
# echo 8086 0953 > /sys/bus/pci/drivers/vfio-pci/new_id

# qemu-img create -f raw nvme://0000:81:00.0/1 30G
Formatting 'nvme://0000:81:00.0/1', fmt=raw size=32212254720

# qemu-img info nvme://0000:81:00.0/1
image: nvme://0000:81:00.0/1
file format: raw
virtual size: 373 GiB (400088457216 bytes)
disk size: unavailable

in the host:
# mkfs.xfs /dev/nvme0n1
# mount /dev/nvme0n1 /mnt/nvme_fio_test/

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/nvme_fio_test/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][1.0%][r=888MiB/s,w=221MiB/s][r=227k,w=56.5k IOPS][eta 04m:Jobs: 8 (f=8): [m(8)][1.3%][r=862MiB/s,w=214MiB/s][r=221k,w=54.9k IOPS][eta 04m:Jobs: 8 (f=8): [m(8)]......[99.7%][r=790MiB/s,w=196MiB/s][r=202k,w=50.3k IOPS][eta 00mJobs: 8 (f=8): [m(8)][100.0%][r=811MiB/s,w=204MiB/s][r=208k,w=52.3k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=8638: Fri Jan 22 02:31:42 2021
  read: IOPS=219k, BW=857MiB/s (899MB/s)(251GiB/300002msec)
    slat (usec): min=3, max=276, avg= 6.57, stdev= 3.24
    clat (usec): min=7, max=17087, avg=545.15, stdev=662.55
     lat (usec): min=29, max=17093, avg=551.95, stdev=662.55
    clat percentiles (usec):
     |  1.00th=[   79],  5.00th=[   93], 10.00th=[  104], 20.00th=[  131],
     | 30.00th=[  172], 40.00th=[  229], 50.00th=[  297], 60.00th=[  396],
     | 70.00th=[  537], 80.00th=[  717], 90.00th=[ 1418], 95.00th=[ 2245],
     | 99.00th=[ 2868], 99.50th=[ 3032], 99.90th=[ 3359], 99.95th=[ 3687],
     | 99.99th=[12256]
   bw (  KiB/s): min=785320, max=978192, per=100.00%, avg=878976.52, stdev=5890.75, samples=4784
   iops        : min=196330, max=244546, avg=219744.10, stdev=1472.68, samples=4784
  write: IOPS=54.9k, BW=214MiB/s (225MB/s)(62.8GiB/300002msec); 0 zone resets
    slat (usec): min=4, max=396, avg= 7.23, stdev= 3.94
    clat (usec): min=4, max=14517, avg=110.82, stdev=180.05
     lat (usec): min=17, max=14523, avg=118.29, stdev=180.10
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   16], 10.00th=[   18], 20.00th=[   24],
     | 30.00th=[   32], 40.00th=[   43], 50.00th=[   57], 60.00th=[   75],
     | 70.00th=[  104], 80.00th=[  159], 90.00th=[  260], 95.00th=[  359],
     | 99.00th=[  816], 99.50th=[ 1090], 99.90th=[ 1713], 99.95th=[ 1991],
     | 99.99th=[ 5538]
   bw (  KiB/s): min=192472, max=250000, per=100.00%, avg=219831.45, stdev=1548.75, samples=4784
   iops        : min=48118, max=62500, avg=54957.86, stdev=387.19, samples=4784
  lat (usec)   : 10=0.01%, 20=2.88%, 50=6.37%, 100=11.10%, 250=31.95%
  lat (usec)   : 500=21.31%, 750=11.16%, 1000=4.47%
  lat (msec)   : 2=5.53%, 4=5.19%, 10=0.02%, 20=0.01%
  cpu          : usr=11.13%, sys=22.81%, ctx=48265710, majf=0, minf=1984
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=65833373,16465027,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=857MiB/s (899MB/s), 857MiB/s-857MiB/s (899MB/s-899MB/s), io=251GiB (270GB), run=300002-300002msec
  WRITE: bw=214MiB/s (225MB/s), 214MiB/s-214MiB/s (225MB/s-225MB/s), io=62.8GiB (67.4GB), run=300002-300002msec

Disk stats (read/write):
  nvme0n1: ios=65802590/16457320, merge=0/3, ticks=35459756/1622270, in_queue=37082027, util=100.00%


in the guest:
# lsblk
NAME           MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0             11:0    1  9.3G  0 rom  /run/media/root/RHEL-8-4-0-BaseOS-x86_64
vda            252:0    0   30G  0 disk 
vdb            252:16   0   20G  0 disk 
├─vdb1         252:17   0    1G  0 part /boot
└─vdb2         252:18   0   19G  0 part 
  ├─rhel_vm--197--177-root
  │            253:0    0   17G  0 lvm  /
  └─rhel_vm--197--177-swap
               253:1    0    2G  0 lvm  [SWAP]

# mkdir -p /home/fio_nvme/
# mkfs.xfs /dev/vda
# mount /dev/vda /home/fio_nvme/

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/fio_nvme/test   --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][1.0%][r=403MiB/s,w=102MiB/s][r=103k,w=26.0k IOPS][eta 04m:Jobs: 8 (f=8): [m(8)][1.3%][r=411MiB/s,w=102MiB/s][r=105k,w=26.2k IOPS][eta 04m:Jobs: 8 (f=8): [m(8)]......[99.3%][r=518MiB/s,w=129MiB/s][r=133k,w=33.0k IOPS][eta 00mJobs: 8 (f=8): [m(8)][99.7%][r=511MiB/s,w=129MiB/s][r=131k,w=32.9k IOPS][eta 00mJobs: 8 (f=8): [m(8)][100.0%][r=522MiB/s,w=132MiB/s][r=134k,w=33.7k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=3181: Fri Jan 22 15:54:10 2021
  read: IOPS=129k, BW=504MiB/s (528MB/s)(148GiB/300003msec)
    slat (usec): min=5, max=5240, avg=11.19, stdev= 8.44
    clat (usec): min=48, max=30506, avg=840.15, stdev=607.15
     lat (usec): min=62, max=30512, avg=851.70, stdev=606.89
    clat percentiles (usec):
     |  1.00th=[  188],  5.00th=[  253], 10.00th=[  314], 20.00th=[  486],
     | 30.00th=[  611], 40.00th=[  676], 50.00th=[  742], 60.00th=[  816],
     | 70.00th=[  906], 80.00th=[ 1029], 90.00th=[ 1303], 95.00th=[ 1942],
     | 99.00th=[ 2966], 99.50th=[ 3163], 99.90th=[ 3687], 99.95th=[ 4948],
     | 99.99th=[23987]
   bw (  KiB/s): min=313457, max=617248, per=100.00%, avg=516616.82, stdev=7492.02, samples=4784
   iops        : min=78363, max=154312, avg=129153.98, stdev=1873.02, samples=4784
  write: IOPS=32.2k, BW=126MiB/s (132MB/s)(36.9GiB/300003msec); 0 zone resets
    slat (usec): min=5, max=5309, avg=12.39, stdev= 8.81
    clat (usec): min=24, max=29131, avg=541.07, stdev=385.53
     lat (usec): min=37, max=29139, avg=553.83, stdev=385.33
    clat percentiles (usec):
     |  1.00th=[  151],  5.00th=[  202], 10.00th=[  237], 20.00th=[  285],
     | 30.00th=[  330], 40.00th=[  379], 50.00th=[  441], 60.00th=[  529],
     | 70.00th=[  644], 80.00th=[  799], 90.00th=[  988], 95.00th=[ 1139],
     | 99.00th=[ 1401], 99.50th=[ 1516], 99.90th=[ 1762], 99.95th=[ 3654],
     | 99.99th=[ 9634]
   bw (  KiB/s): min=76696, max=157896, per=100.00%, avg=129156.98, stdev=1903.94, samples=4784
   iops        : min=19174, max=39474, avg=32289.06, stdev=476.01, samples=4784
  lat (usec)   : 50=0.01%, 100=0.02%, 250=6.31%, 500=21.69%, 750=28.23%
  lat (usec)   : 1000=24.14%
  lat (msec)   : 2=15.80%, 4=3.74%, 10=0.05%, 20=0.01%, 50=0.01%
  cpu          : usr=9.40%, sys=21.46%, ctx=10682480, majf=0, minf=147
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=38684995,9671505,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=504MiB/s (528MB/s), 504MiB/s-504MiB/s (528MB/s-528MB/s), io=148GiB (158GB), run=300003-300003msec
  WRITE: bw=126MiB/s (132MB/s), 126MiB/s-126MiB/s (132MB/s-132MB/s), io=36.9GiB (39.6GB), run=300003-300003msec

Disk stats (read/write):
  vda: ios=38668202/9667305, merge=0/5, ticks=30154026/4415942, in_queue=34569968, util=100.00%


3. according to Comment 20 and item 2 in this comment, not every nvme device works well.
   Next, how to deal with this issue? What do you think about it? Many thanks.

Get the hang reported in comment 16 in the guest with the following nvme device.
#  lspci|grep -i "Non-Volatile"
bc:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172Xa/172Xb (rev 01)

Not get the same hang reported in comment 16 in the guest with the following nvme device.
#  lspci|grep -i "Non-Volatile"
81:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)

Comment 24 Philippe Mathieu-Daudé 2021-01-26 19:23:05 UTC
(In reply to Xueqiang Wei from comment #23)
> Hi Philippe,
> 
> 1. Could you please share the details of your’s nvme device?

[root@virtlab505] # nvme intel id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x8086
ssvid     : 0x8086
sn        : PHKS917200LB375AGN  
mn        : INTEL SSDPED1K375GA                     
fr        : E2010435
...

> Maybe QE needs to purchase a new nvme device.

No, it has to work :)

Comment 25 Philippe Mathieu-Daudé 2021-01-26 19:32:53 UTC
(In reply to Xueqiang Wei from comment #2)
> 6. boot the guest after the installation. do fio test on /home/test
> 
> # fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1
> --filename=/home/test   --ioengine=libaio --size=100M --rwmixread=80
> --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1
> --numjobs=8 --name=job1
> job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=libaio, iodepth=16
> ...
> fio-3.19
> Starting 8 processes
> job1: Laying out IO file (1 file / 100MiB)
> Jobs: 8 (f=8): [m(8)][3.5%][eta 08d:21h:09m:48s]]
> 
> 
> 
> After step 2, fio test finished in 5 minutes.
> After step 6, fio test didn't finish after 12 hours.

When the VM "hang" (12h without finishing) there is an error:

qemu-kvm: VFIO_MAP_DMA failed: No space left on device

Looking at the Linux kernel source, we end in vfio_dma_do_map() in drivers/vfio/vfio_iommu_type1.c:

	if (!iommu->dma_avail) {
		ret = -ENOSPC;
		goto out_unlock;
	}

Alex Williamson said this limit can be changed for testing purpose:

static unsigned int dma_entry_limit __read_mostly = U16_MAX;
MODULE_PARM_DESC(dma_entry_limit,
		 "Maximum number of user DMA mappings per container (65535).");

So I tried:

# modprobe vfio_iommu_type1 dma_entry_limit=$((0xffffff))

And your test passed (with horrible performances):

# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/fio_nvme_test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][100.0%][r=115MiB/s,w=29.0MiB/s][r=29.4k,w=7435 IOPS][eta 00m:00s]s]
job1: (groupid=0, jobs=8): err= 0: pid=1398: Wed Jan 27 02:56:22 2021
  read: IOPS=29.7k, BW=116MiB/s (122MB/s)(33.0GiB/300002msec)
    slat (usec): min=3, max=32761, avg=52.90, stdev=77.13
    clat (usec): min=129, max=93768, avg=3334.61, stdev=943.20
     lat (usec): min=193, max=93826, avg=3388.05, stdev=947.23
    clat percentiles (usec):
     |  1.00th=[ 2008],  5.00th=[ 2278], 10.00th=[ 2442], 20.00th=[ 2704],
     | 30.00th=[ 2900], 40.00th=[ 3097], 50.00th=[ 3261], 60.00th=[ 3490],
     | 70.00th=[ 3687], 80.00th=[ 3949], 90.00th=[ 4293], 95.00th=[ 4490],
     | 99.00th=[ 4883], 99.50th=[ 5211], 99.90th=[ 6783], 99.95th=[ 9372],
     | 99.99th=[39060]
   bw (  KiB/s): min=101082, max=127816, per=100.00%, avg=118918.66, stdev=335.38, samples=4784
   iops        : min=25267, max=31954, avg=29728.99, stdev=83.87, samples=4784
  write: IOPS=7433, BW=29.0MiB/s (30.4MB/s)(8712MiB/300002msec); 0 zone resets
    slat (usec): min=3, max=25613, avg=81.57, stdev=99.55
    clat (usec): min=132, max=93707, avg=3595.95, stdev=857.93
     lat (usec): min=149, max=93743, avg=3678.08, stdev=865.10
    clat percentiles (usec):
     |  1.00th=[ 2278],  5.00th=[ 2638], 10.00th=[ 2835], 20.00th=[ 3097],
     | 30.00th=[ 3261], 40.00th=[ 3425], 50.00th=[ 3556], 60.00th=[ 3720],
     | 70.00th=[ 3916], 80.00th=[ 4113], 90.00th=[ 4359], 95.00th=[ 4555],
     | 99.00th=[ 4883], 99.50th=[ 5014], 99.90th=[ 6915], 99.95th=[ 9634],
     | 99.99th=[36963]
   bw (  KiB/s): min=25082, max=33768, per=100.00%, avg=29776.92, stdev=172.10, samples=4784
   iops        : min= 6270, max= 8442, avg=7444.10, stdev=43.02, samples=4784
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.75%, 4=79.55%, 10=19.63%, 20=0.02%, 50=0.02%
  lat (msec)   : 100=0.01%
  cpu          : usr=2.63%, sys=8.63%, ctx=5048257, majf=0, minf=131
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=8906102,2230196,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=33.0GiB (36.5GB), run=300002-300002msec
  WRITE: bw=29.0MiB/s (30.4MB/s), 29.0MiB/s-29.0MiB/s (30.4MB/s-30.4MB/s), io=8712MiB (9135MB), run=300002-300002msec

Disk stats (read/write):
  vda: ios=8902053/2229124, merge=31/29, ticks=24529673/6297896, in_queue=30827568, util=100.00%

I will escalate this problem to Stefan Hajnoczi.

Comment 26 Xueqiang Wei 2021-01-27 03:47:25 UTC
(In reply to Philippe Mathieu-Daudé from comment #25)
> > After step 6, fio test didn't finish after 12 hours.
> 
> When the VM "hang" (12h without finishing) there is an error:
> 
> qemu-kvm: VFIO_MAP_DMA failed: No space left on device
> 

This issue was tracked by Bug 1848881 - qemu-kvm: VFIO_MAP_DMA failed: Invalid argument with nvme://

related comments:
https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c2
https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c3
https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c6
https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c7
https://bugzilla.redhat.com/show_bug.cgi?id=1848881#c8, tested with raw format and q35 machine type, guest installs successfully.

Comment 30 Philippe Mathieu-Daudé 2021-03-02 16:23:16 UTC
(In reply to Philippe Mathieu-Daudé from comment #25)
> (In reply to Xueqiang Wei from comment #2)
> > 6. boot the guest after the installation. do fio test on /home/test
>
> When the VM "hang" (12h without finishing) there is an error:
> 
> qemu-kvm: VFIO_MAP_DMA failed: No space left on device
> 
> Looking at the Linux kernel source, we end in vfio_dma_do_map() in
> drivers/vfio/vfio_iommu_type1.c:
> 
> 	if (!iommu->dma_avail) {
> 		ret = -ENOSPC;
> 		goto out_unlock;
> 	}

This issue has been assigned to a different BZ: bug 1934172

Comment 34 Danilo de Paula 2021-03-23 20:54:02 UTC
Done

Comment 35 Xueqiang Wei 2021-06-10 16:27:15 UTC
Since QE triggered Bug test through "fixed in version", Philippe, could you remove "Fixed In Version:" field? Thanks.

Comment 36 Philippe Mathieu-Daudé 2021-06-10 16:34:05 UTC
(In reply to Xueqiang Wei from comment #35)
> Since QE triggered Bug test through "fixed in version", Philippe, could you
> remove "Fixed In Version:" field? Thanks.
Done.

Comment 37 Xueqiang Wei 2021-06-23 01:48:21 UTC
Hi Philippe,

Could you set the DTM? Thanks.

Comment 38 Klaus Heinrich Kiwi 2021-06-28 18:41:54 UTC
Is there any issues or code that still needs to be tracked through this BZ? I'm a bit confused since the only apparent issue was closed with bug 1934172.

Comment 39 Philippe Mathieu-Daudé 2021-07-01 17:15:52 UTC
This bug is blocked by bug 1848881.

Comment 43 Stefan Hajnoczi 2021-07-26 14:32:21 UTC
That's fine.

Comment 45 Stefan Hajnoczi 2021-08-02 16:17:41 UTC
This bug is ready to go since Bug 1848881 has now been fixed.

There is no code change associated with this bug. It is a high-level bug that tracks the QEMU NVMe userspace driver.

Comment 47 Tingting Mao 2021-08-18 10:32:16 UTC
Hi Stefan,

I tried to verify this bug as below. And the I/O performance improved a lot via virtual machines, but still slower compared with the performance via host directly(Read: ~19% degration; write: ~35% degration). Could you please help to check whether the result is okay?

Thanks.



Tested env:
qemu-kvm-6.0.0-27.module+el8.5.0+12121+c40c8708
kernel-modules-4.18.0-330.el8.x86_64


Steps:
Test the NVMe block performance via virtual guest:
1. Install guest on the NVMe disk 
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35 \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 15360  \
    -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2  \
    -cpu 'Haswell-noTSX',+kvm_pv_unhalt \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -object iothread,id=iothread1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idEFQ4i1,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio \
    -chardev socket,server=on,path=/var/tmp/monitor-qmpmonitor1-20210721-024113-AsZ7KYro,id=qmp_id_qmpmonitor1,wait=off  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=5 \
    -device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pcie-root-port-5,addr=0x0,iothread=iothread1 \
    -blockdev node-name=nvme_image1,driver=nvme,device=0000:bc:00.0,namespace=1,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap \
    -device scsi-hd,id=nvme1,drive=drive_nvme1 \
    -device pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x1.0x6,bus=pcie.0,chassis=6 \
    -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-6,addr=0x0 \
    -blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/linux/RHEL-8.4.0-20210503.1-x86_64-dvd1.iso,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \
    -device scsi-cd,id=cd1,drive=drive_cd1,write-cache=on \

2. Do fio test in the guest
# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/home/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][100.0%][r=651MiB/s,w=161MiB/s][r=167k,w=41.3k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=39152: Wed Aug 18 05:44:37 2021
  read: IOPS=175k, BW=685MiB/s (719MB/s)(201GiB/300003msec)
    slat (usec): min=2, max=1084, avg= 5.59, stdev= 4.30
    clat (usec): min=12, max=8797, avg=640.88, stdev=567.03
     lat (usec): min=32, max=8800, avg=646.58, stdev=567.00
    clat percentiles (usec):
     |  1.00th=[   99],  5.00th=[  149], 10.00th=[  190], 20.00th=[  251],
     | 30.00th=[  314], 40.00th=[  400], 50.00th=[  502], 60.00th=[  586],
     | 70.00th=[  676], 80.00th=[  791], 90.00th=[ 1303], 95.00th=[ 2089],
     | 99.00th=[ 2737], 99.50th=[ 2900], 99.90th=[ 3392], 99.95th=[ 4178],
     | 99.99th=[ 5604]
   bw (  KiB/s): min=524920, max=826223, per=100.00%, avg=702643.65, stdev=7011.59, samples=4784
   iops        : min=131230, max=206555, avg=175660.57, stdev=1752.91, samples=4784
  write: IOPS=43.9k, BW=171MiB/s (180MB/s)(50.2GiB/300003msec); 0 zone resets
    slat (usec): min=2, max=1209, avg= 6.17, stdev= 4.84
    clat (usec): min=3, max=11172, avg=322.63, stdev=229.52
     lat (usec): min=23, max=11177, avg=328.90, stdev=229.53
    clat percentiles (usec):
     |  1.00th=[   33],  5.00th=[   63], 10.00th=[   94], 20.00th=[  149],
     | 30.00th=[  194], 40.00th=[  233], 50.00th=[  269], 60.00th=[  310],
     | 70.00th=[  367], 80.00th=[  457], 90.00th=[  652], 95.00th=[  783],
     | 99.00th=[ 1012], 99.50th=[ 1090], 99.90th=[ 1631], 99.95th=[ 2114],
     | 99.99th=[ 3490]
   bw (  KiB/s): min=130631, max=209416, per=100.00%, avg=175726.23, stdev=1804.07, samples=4784
   iops        : min=32657, max=52354, avg=43931.25, stdev=451.02, samples=4784
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.69%, 100=2.37%
  lat (usec)   : 250=21.71%, 500=31.58%, 750=23.88%, 1000=9.46%
  lat (msec)   : 2=5.85%, 4=4.43%, 10=0.05%, 20=0.01%
  cpu          : usr=4.94%, sys=13.42%, ctx=21400099, majf=0, minf=155
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=52625300,13161245,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=685MiB/s (719MB/s), 685MiB/s-685MiB/s (719MB/s-719MB/s), io=201GiB (216GB), run=300003-300003msec
  WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=50.2GiB (53.9GB), run=300003-300003msec

Disk stats (read/write):
    dm-2: ios=52611588/13157829, merge=0/0, ticks=32267355/3792809, in_queue=36060164, util=100.00%, aggrios=52625300/13161320, aggrmerge=0/7, aggrticks=32357463/3825128, aggrin_queue=36182591, aggrutil=100.00%
  sda: ios=52625300/13161320, merge=0/7, ticks=32357463/3825128, in_queue=36182591, util=100.00%



Test the NVMe block performance via host directly:
1. Mkfs for the NVMe block in the host
# lsblk 
NAME                            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                               8:0    0   372G  0 disk 
├─sda1                            8:1    0     1G  0 part /boot
└─sda2                            8:2    0   371G  0 part 
  ├─rhel_dell--per740xd--01-root
  │                             253:0    0    70G  0 lvm  /
  ├─rhel_dell--per740xd--01-swap
  │                             253:1    0  31.4G  0 lvm  [SWAP]
  └─rhel_dell--per740xd--01-home
                                253:2    0 269.7G  0 lvm  /home
sdb                               8:16   0 558.4G  0 disk 
nvme0n1                         259:0    0 745.2G  0 disk 
├─nvme0n1p1                     259:1    0   400G  0 part /mnt
└─nvme0n1p2                     259:2    0 345.2G  0 part 
# mkfs.xfs /dev/nvme0n1
nvme0n1    nvme0n1p1  nvme0n1p2  
[root@dell-per740xd-01 ~]# mkfs.xfs /dev/nvme0n1p1
meta-data=/dev/nvme0n1p1         isize=512    agcount=4, agsize=26214400 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=104857600, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=51200, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.

2. Mount NVMe block and do fio test
# mount /dev/nvme0n1p1 /mnt/
# fio --rw=randrw --bs=4k --iodepth=16 --runtime=5m --direct=1 --filename=/mnt/test --ioengine=libaio --size=100M --rwmixread=80 --randrepeat=0 --norandommap=1 --group_reporting=1 --time_based=1 --numjobs=8 --name=job1
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 8 processes
job1: Laying out IO file (1 file / 100MiB)
Jobs: 8 (f=8): [m(8)][100.0%][r=1065MiB/s,w=265MiB/s][r=273k,w=67.9k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=3204: Wed Aug 18 06:21:30 2021
  read: IOPS=271k, BW=1059MiB/s (1111MB/s)(310GiB/300003msec)
    slat (nsec): min=1564, max=490348, avg=4939.65, stdev=3115.13
    clat (usec): min=3, max=10776, avg=440.68, stdev=675.77
     lat (usec): min=25, max=10781, avg=445.68, stdev=675.66~35% degration
    clat percentiles (usec):
     |  1.00th=[   44],  5.00th=[   79], 10.00th=[   86], 20.00th=[   96],
     | 30.00th=[  108], 40.00th=[  124], 50.00th=[  149], 60.00th=[  188],
     | 70.00th=[  265], 80.00th=[  486], 90.00th=[ 1598], 95.00th=[ 2212],
     | 99.00th=[ 2704], 99.50th=[ 2999], 99.90th=[ 4359], 99.95th=[ 4621],
     | 99.99th=[ 5211]
   bw (  MiB/s): min=  982, max= 1172, per=100.00%, avg=1060.89, stdev= 3.85, samples=4784
   iops        : min=251512, max=300036, avg=271587.33, stdev=985.88, samples=4784
  write: IOPS=67.8k, BW=265MiB/s (278MB/s)(77.6GiB/300003msec); 0 zone resets
    slat (nsec): min=1644, max=376985, avg=5759.48, stdev=4071.35
    clat (usec): min=2, max=8316, avg=96.50, stdev=286.17
     lat (usec): min=16, max=8320, avg=102.33, stdev=286.27
    clat percentiles (usec):
     |  1.00th=[   18],  5.00th=[   20], 10.00th=[   22], 20.00th=[   26],
     | 30.00th=[   31], 40.00th=[   36], 50.00th=[   42], 60.00th=[   50],
     | 70.00th=[   61], 80.00th=[   79], 90.00th=[  125], 95.00th=[  219],
     | 99.00th=[ 1516], 99.50th=[ 2409], 99.90th=[ 3490], 99.95th=[ 3982],
     | 99.99th=[ 5342]
   bw (  KiB/s): min=245376, max=302080, per=100.00%, avg=271626.80, stdev=1127.93, samples=4784
   iops        : min=61344, max=75520, avg=67906.69, stdev=281.98, samples=4784
  lat (usec)   : 4=0.01%, 10=0.01%, 20=1.34%, 50=11.77%, 100=23.47%
  lat (usec)   : 250=37.36%, 500=9.77%, 750=3.51%, 1000=1.79%
  lat (msec)   : 2=5.19%, 4=5.64%, 10=0.16%, 20=0.01%
  cpu          : usr=8.73%, sys=19.55%, ctx=53203380, majf=0, minf=761
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=81348879,20339988,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=1059MiB/s (1111MB/s), 1059MiB/s-1059MiB/s (1111MB/s-1111MB/s), io=310GiB (333GB), run=300003-300003msec
  WRITE: bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=77.6GiB (83.3GB), run=300003-300003msec

Disk stats (read/write):
  nvme0n1: ios=81328363/20334880, merge=0/3, ticks=35267074/1549526, in_queue=36816600, util=100.00%



Results:
The read IOPS: host directly:virtual machine = 217k:175k (~19% degration)
The write IOPS: host directly:virtual machine = 67.8k:43.9k (~35% degration)

Comment 48 Stefan Hajnoczi 2021-08-20 06:59:22 UTC
Overhead compared to bare metal is expected because further optimizations are still in development and will be added later (separate from this BZ).

Some ways to make the performance comparison more direct (but there will still be a gap, so don't worry about rerunning right now):
- Use virtio-blk instead of virtio-scsi (overhead is generally lower than virtio-scsi).
- Use filename=$DEV where DEV is a block device (virtio-blk in the guest and an NVMe device/partition on the host) to avoid extra software layers that make it harder to compare results.
- Remove -blockdev node-name=drive_nvme1,driver=raw,file=nvme_image1,read-only=off,discard=unmap. It's not needed and adds a little overhead. Use the nvme blockdev node directly instead.
- Run another fio job with iodepth=1 numjobs=1 to measure latency. The iodepth=16 numjobs=8 job tries to saturate the drive by queuing up many I/O requests, which is interesting, but it's also useful to benchmark a latency-sensitive workload to see the latency of a single request in isolation.

Thanks!

Comment 49 Tingting Mao 2021-08-20 07:10:53 UTC
Set this bug as verified accordingly, thanks Stefan.


Note You need to log in before you can comment on or make changes to this bug.