RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2024544 - Fio workers hangs when running fio with 32 jobs iodepth 32 and QEMU's userspace NVMe driver
Summary: Fio workers hangs when running fio with 32 jobs iodepth 32 and QEMU's userspa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Stefan Hajnoczi
QA Contact: Tingting Mao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-18 10:13 UTC by Lukáš Doktor
Modified: 2022-05-17 12:31 UTC (History)
7 users (show)

Fixed In Version: qemu-kvm-6.2.0-4.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-17 12:25:10 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Fio job resulting in hang (295 bytes, text/plain)
2021-11-18 10:13 UTC, Lukáš Doktor
no flags Details
Libvirt xml file with nvme disk which is still resulting in hanged fio jobs (3.90 KB, application/xml)
2022-01-15 17:44 UTC, Lukáš Doktor
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src qemu-kvm merge_requests 58 0 None None None 2022-01-06 10:36:36 UTC
Red Hat Issue Tracker RHELPLAN-103135 0 None None None 2021-11-18 10:14:20 UTC
Red Hat Product Errata RHBA-2022:2307 0 None None None 2022-05-17 12:25:57 UTC

Description Lukáš Doktor 2021-11-18 10:13:45 UTC
Created attachment 1842534 [details]
Fio job resulting in hang

Description of problem:
I'm trying to saturate a passthrough NVMe disk by using high iodepth and high number of jobs. With 32 jobs and iodept 32 it never finishes and the guest becomes unavailable (not able to ssh-in) and won't recover.

Version-Release number of selected component (if applicable):
Host: RHEL-9.0.0-20211109.d.7
Guest: RHEL-9.0.0-20211109.d.7
* qemu-kvm-core-6.1.0-6.el9.x86_64
* libvirt-7.9.0-1.el9.x86_64

How reproducible:
Always

Steps to Reproduce:
1. grubby --args="intel_iommu=on iommu=pt" --update-kernel="$(grubby --default-kernel)"
2. Download RHEL cloudinit image
3. virt-install --import --disk '/var/lib/libvirt/images/RHEL-9.0.0-20211109.d.7-virtlab722.DefaultLibvirt0.qcow2' --memory '20480' --name 'virtlab722.DefaultLibvirt0' --os-variant 'rhel8.0' --vcpus '10' --serial file,path='/var/log/libvirt/RHEL-9.0.0-20211109.d.7-virtlab722.DefaultLibvirt0.qcow2__serial.log' --disk format=raw,discard=unmap,type=nvme,source.type=pci,source.address.bus=0x0000,source.address.bus=0x89,source.address.slot=0x00,source.address.function=0x0,source.namespace=1
4. In guest create the provided fio.job
5. In guest run: fio fio.job

Actual results:
job-/dev/vdb: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.27
Starting 32 processes
^Cbs: 32 (f=0): [/(32)][-.-%][eta 00m:35s]
fio: terminating on signal 2
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
fio: job 'job-/dev/vdb' (state=3) hasn't exited in 300 seconds, it appears to be stuck. Doing forceful exit of this job.
^C
fio: terminating on signal 2
^C
fio: terminating on signal 2


Expected results:
It should finish after 30s

Additional info:
The fio process starts, the counter goes from ~32s down to 0s, then it hangs for a bit and shows the 35s eta. Then it hangs for infinity.

Sending ctrl+c produces the "hasn't exited in 300 seconds..." message but still the process won't finish. Subsequent ctrl+c only produce the "fio: terminating on signal 2" but never finishes.

Ssh to the machine becomes impossible, there is nothing interesting on the serial console, ctrl+Z on the fio process hangs so it seems to be stuck at io.

Killing the qemu process with SIGTERM, "virsh destroy", "virsh reset", ... won't help, only kill -9 destroys the machine.

Comment 2 Lukáš Doktor 2021-11-18 10:37:14 UTC
Note I tried various settings, it:

* hanged with iodepth 32 and numjobs 16
* passed with iodepth 32 and numjobs 10 (guest has 10 cpus)
* hanged with iodepth 64 and numjobs 10
  - this combination actually reported some bytes being read but after a few seconds it stopped and hanged

I also tried the upstream qemu 8d5fcb1990bc64b62c0bc12121fe510940be5664, resulted in the same behaviour.

Also forgot to mention that the qemu utilization is high (200-450% CPU) even long after the fio should have finished.

Next thing I tried was sshing to the machine prior to the hang. I managed to issue a "killall fio" which attempted to kill the fio process (output in the other ssh session) but without a success. Running "killall -9 fio" resulted in no new output and fio process still running. Any IO in the other session resulted in hang (ls).

Comment 3 Tingting Mao 2021-11-19 05:55:46 UTC
Hi Lukas,

I have tried in my env, there is no the issue. Could you please help to check again? Or I could share my host machine to you if needed.


Tested with:
qemu-kvm-6.1.0-6.el9
kernel-5.14.0-0.rc4.35.el9.x86_64
(host)# lsblk 
NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0                            7:0    0  10.2G  1 loop /tmp/avocado_avocado.utils.iso9660hm7alik6
                                                         /tmp/avocado_avocado.utils.iso966020i3ee7j
sda                              8:0    0 558.9G  0 disk 
├─sda1                           8:1    0     1G  0 part /boot
└─sda2                           8:2    0 557.9G  0 part 
  ├─rhel_ibm--x3650m5--09-root 253:0    0    70G  0 lvm  /
  ├─rhel_ibm--x3650m5--09-swap 253:1    0  31.2G  0 lvm  [SWAP]
  └─rhel_ibm--x3650m5--09-home 253:2    0 456.7G  0 lvm  /home
sr0                             11:0    1  1024M  0 rom  
sr1                             11:1    1  1024M  0 rom  
nvme0n1                        259:0    0 372.6G  0 disk 



Steps:
1. Boot up a guest with NVMe disk(passthrough)
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu 'Broadwell',+kvm_pv_unhalt \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/monitor-qmpmonitor1-20211118-213644-6foHYCIo,id=qmp_id_qmpmonitor1,wait=off,server=on  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/monitor-catch_monitor-20211118-213644-6foHYCIo,id=qmp_id_catch_monitor,wait=off,server=on  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idS9Utfv \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/serial-serial0-20211118-213644-6foHYCIo,id=chardev_serial0,wait=off,server=on \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20211118-213644-6foHYCIo,path=/tmp/avocado_c9_4zzvx/seabios-20211118-213644-6foHYCIo,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20211118-213644-6foHYCIo,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=native,filename=/home/kvm_autotest_root/images/rhel850-64-virtio.raw,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pcie-root-port-2,addr=0x0 \
    -blockdev node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,aio=native,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_disk1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,bus=pcie-root-port-3,addr=0x0 \
    -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \
    -device virtio-net-pci,mac=9a:37:37:37:37:9e,id=id33GX8j,netdev=idxsHdTt,bus=pcie-root-port-4,addr=0x0  \
    -netdev tap,id=idxsHdTt  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=6

2. Do FIO in the guest over the NVMe data disk
2.1 FIO with 64 iodepth and 32 jobs for 2 times
# fio fio.job 
job-/dev/vdb: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
...
fio-3.19
Starting 32 processes
Jobs: 32 (f=32): [R(32)][100.0%][r=1227MiB/s][r=314k IOPS][eta 00m:00s]
job-/dev/vdb: (groupid=0, jobs=1): err= 0: pid=1906: Fri Nov 19 11:19:52 2021
  read: IOPS=9668, BW=37.8MiB/s (39.6MB/s)(11.1GiB/300001msec)
    slat (nsec): min=2000, max=103047k, avg=97805.27, stdev=1291463.90
    clat (usec): min=6, max=105033, avg=6519.75, stdev=9554.72
     lat (usec): min=418, max=105062, avg=6618.18, stdev=9610.37
    clat percentiles (usec):
     |  1.00th=[ 1745],  5.00th=[ 1827], 10.00th=[ 1876], 20.00th=[ 1926],
     | 30.00th=[ 1958], 40.00th=[ 2008], 50.00th=[ 2040], 60.00th=[ 2089],
     | 70.00th=[ 2147], 80.00th=[11207], 90.00th=[20055], 95.00th=[24249],
     | 99.00th=[38011], 99.50th=[45351], 99.90th=[56886], 99.95th=[63177],
     | 99.99th=[76022]
......
......

Run status group 0 (all jobs):
   READ: bw=1227MiB/s (1287MB/s), 37.5MiB/s-40.5MiB/s (39.3MB/s-42.4MB/s), io=359GiB (386GB), run=300001-300013msec

Disk stats (read/write):
  vdb: ios=95811627/0, merge=5309/0, ticks=14913997/0, in_queue=14913997, util=100.00%

Note:
# cat fio.job
[global]
bs=4k
runtime=300
ioengine=libaio
iodepth=64
direct=1
sync=0
time_based=1
clocksource=gettimeofday
ramp_time=5
write_bw_log=fio
write_iops_log=fio
write_lat_log=fio
log_avg_msec=1000
write_hist_log=fio
log_hist_msec=10000

[job-/dev/vdb]
filename=/dev/vdb
rw=read
size=4096M
numjobs=32

2.2 FIO with 32 iodepth and 32 jobs for 2 times


Results:
Both step2.1 and step2.2 finished with no errors and the consumed time is 5min/300s.

Comment 4 Lukáš Doktor 2021-11-19 08:06:20 UTC
Hello Tingting, I'm sorry for incorrectly stating the passthrough device, it is actually the nvme driver. You can use:

    virt-install --import --disk '/var/lib/libvirt/images/RHEL-9.0.0-20211109.d.7-virtlab722.DefaultLibvirt0.qcow2' --memory '20480' --name 'virtlab722.DefaultLibvirt0' --os-variant 'rhel8.0' --vcpus '10' --serial file,path='/var/log/libvirt/RHEL-9.0.0-20211109.d.7-virtlab722.DefaultLibvirt0.qcow2__serial.log' --disk format=raw,discard=unmap,type=nvme,source.type=pci,source.address.bus=0x0000,source.address.bus=0x89,source.address.slot=0x00,source.address.function=0x0,source.namespace=1

to create the machine, on my system it results into something like this:

/usr/local/bin/qemu-system-x86_64
-name guest=virtlab722.DefaultLibvirt0,debug-threads=on
-S
-object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-5-virtlab722.DefaultLi/master-key.aes"}
-machine pc-q35-6.2,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram
-cpu Cascadelake-Server,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off
-m 20480
-object {"qom-type":"memory-backend-ram","id":"pc.ram","size":21474836480}
-overcommit mem-lock=off
-smp 10,sockets=10,cores=1,threads=1
-uuid 0d7d3275-7f24-4775-a94d-a6d89ee2659d
-no-user-config
-nodefaults
-chardev socket,id=charmonitor,fd=22,server=on,wait=off
-mon chardev=charmonitor,id=monitor,mode=control
-rtc base=utc,driftfix=slew
-global kvm-pit.lost_tick_policy=delay
-no-hpet
-no-shutdown
-global ICH9-LPC.disable_s3=1
-global ICH9-LPC.disable_s4=1
-boot strict=on
-device {"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}
-device {"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}
-device {"driver":"pcie-root-port","port":18,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x2.0x2"}
-device {"driver":"pcie-root-port","port":19,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x2.0x3"}
-device {"driver":"pcie-root-port","port":20,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x2.0x4"}
-device {"driver":"pcie-root-port","port":21,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x2.0x5"}
-device {"driver":"pcie-root-port","port":22,"chassis":7,"id":"pci.7","bus":"pcie.0","addr":"0x2.0x6"}
-device {"driver":"pcie-root-port","port":23,"chassis":8,"id":"pci.8","bus":"pcie.0","addr":"0x2.0x7"}
-device {"driver":"qemu-xhci","p2":15,"p3":15,"id":"usb","bus":"pci.2","addr":"0x0"}
-device {"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.3","addr":"0x0"}
-blockdev {"driver":"file","filename":"/var/lib/libvirt/images/RHEL-9.0.0-20211117.d.7-virtlab722.DefaultLibvirt0.qcow2","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"}
-blockdev {"node-name":"libvirt-2-format","read-only":false,"driver":"qcow2","file":"libvirt-2-storage","backing":null}
-device {"driver":"virtio-blk-pci","bus":"pci.4","addr":"0x0","drive":"libvirt-2-format","id":"virtio-disk0","bootindex":1}
-blockdev {"driver":"nvme","device":"0000:89:00.0","namespace":1,"node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}
-blockdev {"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"raw","file":"libvirt-1-storage"}
-device {"driver":"virtio-blk-pci","bus":"pci.5","addr":"0x0","drive":"libvirt-1-format","id":"virtio-disk1"}
-netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25
-device {"driver":"virtio-net-pci","netdev":"hostnet0","id":"net0","mac":"52:54:00:2c:88:b0","bus":"pci.1","addr":"0x0"}
-add-fd set=3,fd=27
-chardev file,id=charserial0,path=/dev/fdset/3,append=on
-device {"driver":"isa-serial","chardev":"charserial0","id":"serial0"}
-chardev socket,id=charchannel0,fd=26,server=on,wait=off
-device {"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.qemu.guest_agent.0"}
-device {"driver":"usb-tablet","id":"input0","bus":"usb.0","port":"1"}
-audiodev id=audio1,driver=none
-vnc 127.0.0.1:0,audiodev=audio1
-device {"driver":"VGA","id":"video0","vgamem_mb":16,"bus":"pcie.0","addr":"0x1"}
-device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.6","addr":"0x0"}
-object {"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"}
-device {"driver":"virtio-rng-pci","rng":"objrng0","id":"rng0","bus":"pci.7","addr":"0x0"}
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
-msg timestamp=on

On my system it reproduces 100% of time.

Comment 6 Tingting Mao 2021-11-22 02:31:37 UTC
Tried with the way of NVMe driver, it hangs in my system too.


Tested with:
qemu-kvm-6.1.0-6.el9
kernel-5.14.0-0.rc4.35.el9.x86_64


Steps:
1. Boot up a guest with NVMe disk(Userspace Driver)
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu 'Broadwell',+kvm_pv_unhalt \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/monitor-qmpmonitor1-20211118-213644-6foHYCIo,id=qmp_id_qmpmonitor1,wait=off,server=on  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/monitor-catch_monitor-20211118-213644-6foHYCIo,id=qmp_id_catch_monitor,wait=off,server=on  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idS9Utfv \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/serial-serial0-20211118-213644-6foHYCIo,id=chardev_serial0,wait=off,server=on \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20211118-213644-6foHYCIo,path=/tmp/avocado_c9_4zzvx/seabios-20211118-213644-6foHYCIo,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20211118-213644-6foHYCIo,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=native,filename=/home/kvm_autotest_root/images/rhel900-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pcie-root-port-2,addr=0x0 \
    -blockdev node-name=file_disk1,driver=nvme,auto-read-only=on,discard=unmap,device=0000:bc:00.0,namespace=1 \
    -blockdev node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_disk1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,bus=pcie-root-port-3,addr=0x0 \
    -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \
    -device virtio-net-pci,mac=9a:37:37:37:37:9e,id=id33GX8j,netdev=idxsHdTt,bus=pcie-root-port-4,addr=0x0  \
    -netdev tap,id=idxsHdTt  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=6

2. Fio test in guest
* hanged with iodepth 32 and numjobs 32
* hanged with iodepth 32 and numjobs 16
* passed with iodepth 32 and numjobs 10 (guest has 10 cpus)
* hanged with iodepth 64 and numjobs 10

Note:
(guest)# cat fio.job
[global]
bs=4k
runtime=300
ioengine=libaio
iodepth=64
direct=1
sync=0
time_based=1
clocksource=gettimeofday
ramp_time=5
write_bw_log=fio
write_iops_log=fio
write_lat_log=fio
log_avg_msec=1000
write_hist_log=fio
log_hist_msec=10000

[job-/dev/vdb]
filename=/dev/vdb
rw=read
size=4096M
numjobs=10

Comment 7 Stefan Hajnoczi 2021-12-06 16:37:57 UTC
I was unable to reproduce the hang. Is there a machine I can access to debug this issue? Thanks!

Comment 8 Stefan Hajnoczi 2021-12-08 15:27:20 UTC
Patch posted upstream:
https://patchew.org/QEMU/20211208152246.244585-1-stefanha@redhat.com/

Comment 11 Lukáš Doktor 2022-01-15 17:44:00 UTC
Created attachment 1850988 [details]
Libvirt xml file with nvme disk which is still resulting in hanged fio jobs

Hello Stefan, I'm not sure whether it's related, but I'm still having troubles with my attached nvme disks even with your fix. This time I'm attaching the disk via:

    <disk type="block" device="disk">
      <driver name="qemu" type="raw" cache="none" discard="unmap" io="native" iothread="1" packed="on"/>
      <source dev="/dev/nvme0n1"/>
      <target dev="vdb" bus="virtio"/>
    </disk>

and even with 10 jobs and iodepth 8 I'm getting failures. I tried in-RHEL9 qemu as well as v6.2.0 and recent qemu master 91f5f7a5df1fda8c34677a7c49ee8a4bb5b56a36 (which includes your fix) and using this fio job:

        [global]
        bs=4k
        runtime=300
        ioengine=libaio
        iodepth=2
        direct=1
        sync=0
        time_based=1
        clocksource=gettimeofday
        ramp_time=5
        write_bw_log=fio
        write_iops_log=fio
        write_lat_log=fio
        log_avg_msec=1000
        write_hist_log=fio
        log_hist_msec=10000

        [job-/dev/vdb]
        filename=/dev/vdb
        rw=read
        size=4096M
        numjobs=10

with 15 consequent runs it always hangs, with 3 consequent runs it sometimes survives but usually fails.

Note that this time the machine is destroyable via "virsh destroy $machine" so it's not a complete hang but killing the fio process is impossible even via `killall -9 fio`, the main process's strace returns:

    newfstatat(AT_FDCWD, "/tmp/fio-dump-status", 0x7ffe28c9d550, 0) = -1 ENOENT (No such file or directory)
    clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, NULL) = 0
    wait4(1372, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1373, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1374, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1375, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1376, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1377, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1378, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1380, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    newfstatat(AT_FDCWD, "/tmp/fio-dump-status", 0x7ffe28c9d550, 0) = -1 ENOENT (No such file or directory)
    clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, NULL) = 0
    wait4(1372, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1373, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1374, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1375, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1376, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1377, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1378, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    wait4(1380, 0x7ffe28c9e5c4, WNOHANG, NULL) = 0
    newfstatat(AT_FDCWD, "/tmp/fio-dump-status", 0x7ffe28c9d550, 0) = -1 ENOENT (No such file or directory)
    clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, ^Cstrace: Process 1360 detached

and the individual threads:

    strace: Process 1372 attached
    io_getevents(0x7f4bb7189000, 1, 1, ^Cstrace: Process 1372 detached

Note that the machine is booted with:

    BOOT_IMAGE=(hd0,gpt3)/boot/vmlinuz-5.14.0-1.7.1.el9.x86_64 root=UUID=b0bb50ab-82ac-45de-bbd8-51a4314e7719 console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M

and rc.local:

    #!/bin/bash -x
    # Variables
    # Number of pages per a single numa node (eg. 10)
    MEM_PER_NODE='10'
    # Number of numa nodes (eg. 2)
    NODES='2'
    # Size of a hugepage in kb (eg. 1048576)
    HUGEPAGE_KB='1048576'
    # Path to the performed setup paths (usually /var/lib/runperf/persistent_setup_finished)
    PERFORMED_SETUP_PATH='/var/lib/runperf/persistent_setup_finished'

    # Spread hugepages on all nodes equally
    ALL_GOOD=0
    for I in $(seq 10); do
        ALL_GOOD=1
        for NODE in $(seq 0 $(($NODES - 1))); do
            HP_PATH="/sys/devices/system/node/node$NODE/hugepages/hugepages-${HUGEPAGE_KB}kB/nr_hugepages"
            echo "$MEM_PER_NODE" > "$HP_PATH"
            [ "$(cat $HP_PATH)" -eq "$MEM_PER_NODE" ] || ALL_GOOD=0
        done
        [ "$ALL_GOOD" -eq 1 ] && break || true
        sleep 0.5
        echo 3 > /proc/sys/vm/drop_caches
    done
    mkdir /hugepages
    mount -t hugetlbfs -o pagesize=${HUGEPAGE_KB}K none /hugepages

    # Move non-libvirt tasks to the last cpu
    if [ -e '/sys/fs/cgroup/cgroup.procs' ]; then
        # cgroup2
        echo '+cpuset' > /sys/fs/cgroup/cgroup.subtree_control
        LAST_CPU="$(($(getconf _NPROCESSORS_ONLN) - 1))"
        for CGROUP_CPUS in /sys/fs/cgroup/*/cpuset.cpus; do
            [[ "$CGROUP_CPUS" = *'machine.slice'* ]] && continue
            echo $LAST_CPU > $CGROUP_CPUS
        done
        RUNPERF_CGROUP=$(mktemp -d /sys/fs/cgroup/runperf-XXXXXX)
        echo "$LAST_CPU" > "$RUNPERF_CGROUP/cpuset.cpus"
        for I in $(seq 3); do
            for TASK in $(ps axo pid); do
                # Some tasks are unmovable, ignore the result
                [[ "$(cat /proc/$TASK/cmdline)" = *'libvirtd'* ]] || echo $TASK >> "$RUNPERF_CGROUP/cgroup.procs" || true
            done
        done
    else
        # cgroup
        RUNPERF_CGROUP=$(mktemp -d /sys/fs/cgroup/cpuset/runperf-XXXXXX)
        cat /sys/fs/cgroup/cpuset/cpuset.mems > "$RUNPERF_CGROUP/cpuset.mems"
        echo $(($(getconf _NPROCESSORS_ONLN) - 1)) > "$RUNPERF_CGROUP/cpuset.cpus"
        for I in $(seq 3); do
            for TASK in $(cat /sys/fs/cgroup/cpuset/tasks); do
                # Some tasks are unmovable, ignore the result
                [[ "$(cat /proc/$TASK/cmdline)" = *'libvirtd'* ]] || echo $TASK >> "$RUNPERF_CGROUP/tasks" || true
            done
        done
    fi

    # Let run-perf know the persistent setup is ready
    echo rc_local >> "$PERFORMED_SETUP_PATH"

Please find the full machine xml-file attached.

Comment 12 Lukáš Doktor 2022-01-15 17:46:25 UTC
> this fio job:
> 
>         [global]
>         bs=4k
>         runtime=300
>         ioengine=libaio
>         iodepth=2
>         direct=1
>         sync=0
>         time_based=1
>         clocksource=gettimeofday
>         ramp_time=5
>         write_bw_log=fio
>         write_iops_log=fio
>         write_lat_log=fio
>         log_avg_msec=1000
>         write_hist_log=fio
>         log_hist_msec=10000
> 
>         [job-/dev/vdb]
>         filename=/dev/vdb
>         rw=read
>         size=4096M
>         numjobs=10

Oups, I attached a testing commit with iodepth 2, which actually passed (in 15 iterations), the one which is failing looks like this:


        [global]
        bs=4k
        runtime=300
        ioengine=libaio
        iodepth=8
        direct=1
        sync=0
        time_based=1
        clocksource=gettimeofday
        ramp_time=5
        write_bw_log=fio
        write_iops_log=fio
        write_lat_log=fio
        log_avg_msec=1000
        write_hist_log=fio
        log_hist_msec=10000

        [job-/dev/vdb]
        filename=/dev/vdb
        rw=read
        size=4096M
        numjobs=10

Comment 13 Stefan Hajnoczi 2022-01-17 10:57:24 UTC
(In reply to Lukas Doktor from comment #11)
> Created attachment 1850988 [details]
> Libvirt xml file with nvme disk which is still resulting in hanged fio jobs
> 
> Hello Stefan, I'm not sure whether it's related, but I'm still having
> troubles with my attached nvme disks even with your fix. This time I'm
> attaching the disk via:
> 
>     <disk type="block" device="disk">
>       <driver name="qemu" type="raw" cache="none" discard="unmap"
> io="native" iothread="1" packed="on"/>
>       <source dev="/dev/nvme0n1"/>
>       <target dev="vdb" bus="virtio"/>
>     </disk>

Hi Lukas,
This domain XML does not use the userspace NVMe driver. The hang you're experiencing must have a different root cause from the one in this BZ.

Can you provide access to a machine where I/O is stuck? I'll attach gdb to the QEMU process and investigate what happened to the emulated virtio-blk device and Linux AIO block driver.

Thanks!

Comment 15 Yanan Fu 2022-01-18 06:49:56 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 18 Tingting Mao 2022-01-19 10:59:55 UTC
Verified this bug as below:


Tested with:
qemu-kvm-6.2.0-4.el9
kernel-5.14.0-39.el9.x86_64


Steps:
1. Boot guest with a NVMe data disk
# /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu 'Broadwell',+kvm_pv_unhalt \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/monitor-qmpmonitor1-20211118-213644-6foHYCIo,id=qmp_id_qmpmonitor1,wait=off,server=on  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/monitor-catch_monitor-20211118-213644-6foHYCIo,id=qmp_id_catch_monitor,wait=off,server=on  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idS9Utfv \
    -chardev socket,path=/tmp/avocado_c9_4zzvx/serial-serial0-20211118-213644-6foHYCIo,id=chardev_serial0,wait=off,server=on \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20211118-213644-6foHYCIo,path=/tmp/avocado_c9_4zzvx/seabios-20211118-213644-6foHYCIo,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20211118-213644-6foHYCIo,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=native,filename=$1,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pcie-root-port-2,addr=0x0 \
    -blockdev node-name=file_disk1,driver=nvme,auto-read-only=on,discard=unmap,device=0000:bc:00.0,namespace=1 \
    -blockdev node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_disk1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,bus=pcie-root-port-3,addr=0x0 \
    -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \
    -device virtio-net-pci,mac=9a:37:37:37:37:9e,id=id33GX8j,netdev=idxsHdTt,bus=pcie-root-port-4,addr=0x0  \
    -netdev tap,id=idxsHdTt  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=6

2. Do fio test in guest
(guest) # fio fio.job

Note:
(guest)# cat fio.job
[global]
bs=4k
runtime=300
ioengine=libaio
iodepth=64
direct=1
sync=0
time_based=1
clocksource=gettimeofday
ramp_time=5
write_bw_log=fio
write_iops_log=fio
write_lat_log=fio
log_avg_msec=1000
write_hist_log=fio
log_hist_msec=10000

[job-/dev/vdb]
filename=/dev/vdb
rw=read
size=4096M
numjobs=10


Results:
* passed with iodepth 32 and numjobs 32
* passed with iodepth 32 and numjobs 16
* passed with iodepth 32 and numjobs 10
* passed with iodepth 64 and numjobs 10
* passed with iodepth 64 and numjobs 64

Comment 19 Lukáš Doktor 2022-01-24 17:40:57 UTC
Clearing the need info as the machine has been provided and the issue from comment #11 is likely not related to this BZ.

Comment 21 errata-xmlrpc 2022-05-17 12:25:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: qemu-kvm), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2307


Note You need to log in before you can comment on or make changes to this bug.