2176702 – [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully after hot-plug it repeatly

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2176702 - [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully after hot-plug it repeatly

Summary: [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully after hot-plug it r...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Stefano Garzarella
QA Contact:	qing.wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2207634 2208473
TreeView+	depends on / blocked

Reported:	2023-03-09 02:19 UTC by bfu
Modified:	2023-11-07 09:18 UTC (History)
CC List:	24 users (show)
Fixed In Version:	qemu-kvm-8.0.0-9.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-07 08:27:12 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/centos-stream/src qemu-kvm merge_requests 184	None	opened	scsi: fix issue with Linux guest and unit attention	2023-07-17 07:57:46 UTC
Red Hat Issue Tracker	RHELPLAN-151175	None	None	None	2023-03-09 02:20:06 UTC
Red Hat Product Errata	RHSA-2023:6368	None	None	None	2023-11-07 08:28:45 UTC

Description bfu 2023-03-09 02:19:25 UTC

Description of problem:
After plug&un-plug scsi disk several time, there will be an I/O error and then cannot find any disk with "ls /dev/[vhs]d* | grep -v [0-9]$"

Version-Release number of selected component (if applicable):
compose id: RHEL-9.2.0-20230228.28
kernel version: kernel-5.14.0-283.el9.s390x
qemu version: qemu-kvm-7.2.0-10.el9.s390x

introduced by:
compose id: RHEL-9.2.0-20221122.2
kernel version: kernel-5.14.0-197.el9.s390x
qemu version: qemu-kvm-7.1.0-4.el9.s390x


How reproducible:
100% with auto
0% with manual debugging

Steps to Reproduce:
1. boot guest with a scsi disk
-device '{"id": "virtio_scsi_ccw0", "driver": "virtio-scsi-ccw"}' \
    -blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kar/vt_test_images/rhel920-s390x-virtio-scsi.qcow2", "cache": {"direct": true, "no-flush": false}}' \
    -blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \

2. create a qcow2 file with 1G
# /usr/bin/qemu-img create -f qcow2 /home/kar/vt_test_images/storage0.qcow2 1G

3. hotplug a scsi-hd disk through qmp
{"execute": "qmp_capabilities", "id": "YINWP12P"}
{"execute": "blockdev-add", "arguments": {"node-name": "file_stg0", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kar/vt_test_images/storage0.qcow2", "cache": {"direct": true, "no-flush": false}}, "id": "OokNtJB0"}
{"execute": "blockdev-add", "arguments": {"node-name": "drive_stg0", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_stg0"}, "id": "aewQuCD8"}
{"execute": "device_add", "arguments": {"driver": "scsi-hd", "id": "stg0", "drive": "drive_stg0", "write-cache": "on", "bus": "virtio_scsi_ccw0.0"}, "id": "tzzoHjOF"}

4. unplug the scsi-hd disk through qmp
{"execute": "device_del", "arguments": {"id": "stg0"}, "id": "2kVnwMsp"}
{"execute": "blockdev-del", "arguments": {"node-name": "drive_stg0"}, "id": "a0pG9Mpo"}
{"execute": "blockdev-del", "arguments": {"node-name": "file_stg0"}, "id": "xnjIobL0"}

5. repeat setp3 and step4 and check the disk info whenever the scsi-hd is hot-plug
there will be an I/O error and the scsi-hd get failed to be hot-plugged and after a while I cannot find any disk in the guest:
2023-03-08 01:56:57: 192.168.122.89[root@localhost ~]#
2023-03-08 01:56:59: 192.168.122.89/dev/sda
2023-03-08 01:56:59: 192.168.122.89[root@localhost ~]#
2023-03-08 01:56:59: 192.168.122.890
2023-03-08 01:56:59: 192.168.122.89[root@localhost ~]#
2023-03-08 01:57:00: 192.168.122.89bash: grep: command not found
2023-03-08 01:57:00: 192.168.122.89bash: ls: command not found
2023-03-08 01:57:01: 192.168.122.89[root@localhost ~]#
2023-03-08 01:57:01: 192.168.122.89127


Actual results:
Failed to hot-plug the scsi-hd and then we could not find any disk inside the guest
There will be an I/O error happened
2023-03-08 01:57:00: [   36.959048] I/O error, dev sda, sector 4196352 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 2
2023-03-08 01:57:00: [   36.959069] I/O error, dev sda, sector 2168 op 0x1:(WRITE) flags 0x1000 phys_seg 2 prio class 2
2023-03-08 01:57:00: [   36.959081] I/O error, dev sda, sector 33656536 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
2023-03-08 01:57:00: [   36.959092] I/O error, dev sda, sector 2049 op 0x1:(WRITE) flags 0x1000 phys_seg 2 prio class 2
2023-03-08 01:57:00: [   36.959098] I/O error, dev sda, sector 2056 op 0x1:(WRITE) flags 0x1000 phys_seg 4 prio class 2
2023-03-08 01:57:00: [   36.959102] I/O error, dev sda, sector 34337912 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 2
2023-03-08 01:57:00: [   36.959106] I/O error, dev sda, sector 34337928 op 0x1:(WRITE) flags 0x100000 phys_seg 3 prio class 2
2023-03-08 01:57:00: [   36.959113] I/O error, dev sda, sector 16298784 op 0x1:(WRITE) flags 0x100000 phys_seg 2 prio class 2
2023-03-08 01:57:00: [   36.959120] I/O error, dev sda, sector 33639016 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2
2023-03-08 01:57:00: [   36.959125] I/O error, dev sda, sector 33640928 op 0x1:(WRITE) flags 0x100000 phys_seg 2 prio class 2


Expected results:
scsi-hd could successfully hot-plugged repeatly


Additional info:

Comment 1 bfu 2023-03-09 02:24:22 UTC

*** Bug 2168891 has been marked as a duplicate of this bug. ***

Comment 3 qing.wang 2023-03-09 06:40:59 UTC

It does not hit this issue on x86

python ConfigTest.py --testcase=block_hotplug.block_scsi.fmt_qcow2.default.with_plug.with_repetition.one_pci --driveformat=virtio_scsi --firmware=ovmf --guestname=RHEL.9.2.0 --nrepeat=50

Red Hat Enterprise Linux release 9.2 Beta (Plow)
5.14.0-283.el9.x86_64
qemu-kvm-7.2.0-10.el9.x86_64
seabios-bin-1.16.1-1.el9.noarch
edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch
libvirt-9.0.0-7.el9.x86_64
virtio-win-prewhql-0.1-234.iso

Comment 4 Thomas Huth 2023-03-09 08:09:43 UTC

@bfu : Since the "Regression" keyword has been added: Do you know whether it was still working fine with an older version of qemu-kvm and the kernel?

Comment 6 Stefan Hajnoczi 2023-03-09 21:22:08 UTC

The I/O errors are expected due to the device being unplugged.

When the device does not appear inside the guest after hotplug, please run:

  # echo "- - -" > /sys/class/scsi_host/hostX/scan

Where "hostX" is the SCSI host for the virtio-scsi device. If you're not sure you can scan all SCSI hosts.

This will tell us whether the device is accessible on the SCSI bus. If the device appears after this command, then the issue may be a bug in the virtio-scsi event queue that notifies the guest when it needs to rescan.

Comment 8 Zhenyu Zhang 2023-03-10 01:25:54 UTC

It does not hit this issue on aarch64

All pass the following cases:
virtio_serial_file_transfer.iommu_enabled.pty.default,block_hotplug.block_scsi.fmt_qcow2.default.with_plug.with_repetition.one_pci,block_hotplug.block_scsi.fmt_qcow2.default.with_plug.with_repetition.multi_pci,block_hotplug.block_scsi.fmt_raw.default.with_plug.with_repetition.one_pci,block_hotplug.block_scsi.fmt_raw.default.with_plug.with_repetition.multi_pci


Red Hat Enterprise Linux release 9.2 Beta (Plow)
5.14.0-283.el9.aarch64
qemu-kvm-7.2.0-10.el9
edk2-aarch64-20221207gitfff6d81270b5-7.el9.noarch
libvirt-9.0.0-7.el9.aarch64

Comment 9 qing.wang 2023-03-10 06:49:22 UTC

(In reply to qing.wang from comment #3)
> It does not hit this issue on x86
> 
> python ConfigTest.py
> --testcase=block_hotplug.block_scsi.fmt_qcow2.default.with_plug.
> with_repetition.one_pci --driveformat=virtio_scsi --firmware=ovmf
> --guestname=RHEL.9.2.0 --nrepeat=50
> 
> Red Hat Enterprise Linux release 9.2 Beta (Plow)
> 5.14.0-283.el9.x86_64
> qemu-kvm-7.2.0-10.el9.x86_64
> seabios-bin-1.16.1-1.el9.noarch
> edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch
> libvirt-9.0.0-7.el9.x86_64
> virtio-win-prewhql-0.1-234.iso

Same result on seabios

 python ConfigTest.py --testcase=block_hotplug.block_scsi.fmt_qcow2.default.with_plug.with_repetition.one_pci --driveformat=virtio_scsi --firmware=default_bios --guestname=RHEL.9.1.0 --nrepeat=30 --netdst=virbr0

Comment 10 Peixiu Hou 2023-03-10 09:11:38 UTC

Did not hit this issue on Win11 guest:

python ConfigTest.py --testcase=block_hotplug.block_scsi.fmt_qcow2.default.with_plug.with_repetition.one_pci --driveformat=virtio_scsi --firmware=ovmf --guestname=Win11 --clone=no --nrepeat=10

kernel-5.14.0-249.el9.x86_64
qemu-kvm-7.2.0-8.el9.x86_64
edk2-ovmf-20221207gitfff6d81270b5-6.el9.noarch
virtio-win-prewhql-234

Thanks~
Peixiu

Comment 11 Thomas Huth 2023-03-10 11:01:31 UTC

(In reply to Stefan Hajnoczi from comment #6)
> The I/O errors are expected due to the device being unplugged.

I think I can reproduce the issue now. The weird thing is: The SCSI errors only happen for the first disk (which has been cold-plugged and contains the root filesystem), so the guest gets completely unusable once the SCSI errors happen.

Comment 12 Thomas Huth 2023-03-13 12:45:36 UTC

The chances to reproduce this manually seem to be very, very low (I was just lucky on Friday, I guess).

@bfu : Could you please give me some instructions how to use your automatic reproducer?

Comment 13 Thomas Huth 2023-03-13 18:22:24 UTC

Ok, after running the device_add + device_del commands for really many times in a close loop in the shell, I was able to bisect the problem, I think. Looks like it broke here between QEMU v7.1 and v7.2:

 8cc5583abe6419e7faaebc9fbd109f34f4c850f2 is the first bad commit
 virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events

Looks like this commit changed the behavior of hotplugging, so that it could influence all devices on the bus now - which explains why the root disk now could die instead of only the hot-plugged disk!

Paolo, could you please have a look at this? I think that commit might have been a bad idea...

Comment 16 Thomas Huth 2023-03-14 08:40:05 UTC

FWIW, here's how I reproduced the issue now semi-manually without additional testing framework:

qemu-img create -f qcow2 /tmp/storage0.qcow2 1G

cat > /tmp/cmd.txt <<EOF
{ "execute": "qmp_capabilities" }
{ "execute": "device_add", "arguments": {"driver": "scsi-hd", "id": "stg0", "drive": "drive_stg0", "write-cache": "on", "bus": "virtio_scsi_ccw0.0"} }
EOF

cat >/tmp/cmd_unplug.txt <<EOF
{ "execute": "qmp_capabilities" }
{ "execute": "device_del", "arguments": {"id": "stg0"} }
EOF

/usr/libexec/qemu-kvm -nographic -m 4G -accel kvm \
 -device virtio-scsi-ccw,id=virtio_scsi_ccw0 \
 -drive if=none,id=dr1,file=/var/lib/libvirt/images/rhel9.qcow2 \
 -device scsi-hd,drive=dr1 \
 -blockdev file,node-name=file_stg0,filename=/tmp/storage0.qcow2 \
 -blockdev qcow2,node-name=drive_stg0,file=file_stg0 \
 -qmp unix:/tmp/qemu-sock,server,wait=off -serial mon:stdio

Then, in another terminal window, run this once the guest has been started:

for ((y=0;y<20;y++)) ; do for ((x=0;x<50;x++)); do \
  nc -U /tmp/qemu-sock < /tmp/cmd_plug.txt ; \
  nc -U /tmp/qemu-sock < /tmp/cmd_unplug.txt ; \
done ; sleep 5 ; done

After a while, you'll see a message like this in the guest:

sd 0:0:0:0: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.

Note that 0:0:0:0 is the root disk sda - the hot-plugged disk is 0:0:1:0 instead.
So that's the first indication that the guest cannot handle the
sense data from commit 8cc5583abe6419.

After some additional time, the guest fails to access its
root disk sda completely and spills out messages like:

sd 0:0:0:0: [sda] tag#21 abort

followed by XFS file system error messages.

I think we should revert commit 8cc5583abe6419 ... Paolo?

Comment 17 Thomas Huth 2023-03-14 13:07:05 UTC

With the way that I described in comment 16, I can also reproduce the issue with upstream QEMU on a x86 host (just replace virtio-scsi-ccw with virtio-scsi-pci), so from what I can tell, this is not specific to s390x. (no clue why it didn't trigger for qing.wang in comment 3)

Comment 18 qing.wang 2023-03-17 08:43:52 UTC

Hit this issue on 

Red Hat Enterprise Linux release 9.2 Beta (Plow)
5.14.0-283.el9.x86_64
qemu-kvm-7.2.0-10.el9.x86_64
seabios-bin-1.16.1-1.el9.noarch
edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch
libvirt-9.0.0-7.el9.x86_64
virtio-win-prewhql-0.1-234.iso


Test steps
1.boot vm

/usr/libexec/qemu-kvm \
  -name testvm \
  -machine q35 \
  -m  6G \
  -smp 2 \
  -cpu host,vmx,+kvm_pv_unhalt \
  -device ich9-usb-ehci1,id=usb1 \
  -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
   \
   \
  -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x3,chassis=1 \
  -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x3.0x1,bus=pcie.0,chassis=2 \
  -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x3.0x2,bus=pcie.0,chassis=3 \
  -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x3.0x3,bus=pcie.0,chassis=4 \
  -device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x3.0x4,bus=pcie.0,chassis=5 \
  -device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x3.0x5,bus=pcie.0,chassis=6 \
  -device pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x3.0x6,bus=pcie.0,chassis=7 \
  -device pcie-root-port,id=pcie-root-port-7,port=0x7,addr=0x3.0x7,bus=pcie.0,chassis=8 \
  -device pcie-root-port,id=pcie_extra_root_port_0,bus=pcie.0,addr=0x4  \
  -device virtio-scsi-pci,id=scsi0,bus=pcie-root-port-5 \
  -device virtio-scsi-pci,id=scsi1,bus=pcie-root-port-6 \
  -blockdev driver=qcow2,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/kvm_autotest_root/images/rhel920-64-virtio-scsi.qcow2,node-name=drive_image1,file.aio=threads   \
  -device scsi-hd,id=os,drive=drive_image1,bus=scsi0.0,bootindex=0,serial=OS_DISK   \
  \
  -blockdev driver=qcow2,file.driver=file,file.filename=/home/kvm_autotest_root/images/data1.qcow2,node-name=data_image1   \
  -device scsi-hd,id=data1,drive=data_image1,bus=scsi0.0,bootindex=1,serial=DATA_DISK   \
  -vnc :5 \
  -monitor stdio \
  -qmp tcp:0:5955,server=on,wait=off \
  -device virtio-net-pci,mac=9a:b5:b6:b1:b2:b7,id=nic1,netdev=nicpci,bus=pcie-root-port-7 \
  -netdev tap,id=nicpci \
  -boot menu=on,reboot-timeout=1000,strict=off \
  \
  -chardev socket,id=socket-serial,path=/var/tmp/socket-serial,logfile=/var/tmp/file-serial.log,mux=on,server=on,wait=off \
  -serial chardev:socket-serial \
  -chardev file,path=/var/tmp/file-bios.log,id=file-bios \
  -device isa-debugcon,chardev=file-bios,iobase=0x402 \
  \
  -chardev socket,id=socket-qmp,path=/var/tmp/socket-qmp,logfile=/var/tmp/file-qmp.log,mux=on,server=on,wait=off \
  -mon chardev=socket-qmp,mode=control \
  -chardev socket,id=socket-hmp,path=/var/tmp/socket-hmp,logfile=/var/tmp/file-hmp.log,mux=on,server=on,wait=off \
  -mon chardev=socket-hmp,mode=readline \


2. run hotplug-unplug script

 cat dev_plug.txt
{ "execute": "qmp_capabilities" }
{ "execute": "device_add", "arguments": {"driver": "scsi-hd", "id": "data1", "drive": "data_image1", "write-cache": "on", "bus": "scsi0.0"} }

 cat dev_unplug.txt

{ "execute": "qmp_capabilities" }
{ "execute": "device_del", "arguments": {"id": "data1"} }


i=0
for ((y=0;y<20;y++)) ; do
	for ((x=0;x<50;x++)); do \
	  nc -U /var/tmp/socket-qmp < dev_plug.txt ; \
	  nc -U /var/tmp/socket-qmp < dev_unplug.txt ; \
  done ;
let i=i+1
echo ".... $i"
sleep 5 ;
done

3. wait 3 minutes to check the guest console

Comment 19 qing.wang 2023-03-17 08:47:03 UTC

(In reply to qing.wang from comment #18)
> Hit this issue on 
> 
> Red Hat Enterprise Linux release 9.2 Beta (Plow)
> 5.14.0-283.el9.x86_64
> qemu-kvm-7.2.0-10.el9.x86_64
> seabios-bin-1.16.1-1.el9.noarch
> edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch
> libvirt-9.0.0-7.el9.x86_64
> virtio-win-prewhql-0.1-234.iso
> 
> 
> Test steps
> 1.boot vm
> 
> /usr/libexec/qemu-kvm \
>   -name testvm \
>   -machine q35 \
>   -m  6G \
>   -smp 2 \
>   -cpu host,vmx,+kvm_pv_unhalt \
>   -device ich9-usb-ehci1,id=usb1 \
>   -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
>    \
>    \
>   -device
> pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x3,
> chassis=1 \
>   -device
> pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x3.0x1,bus=pcie.0,
> chassis=2 \
>   -device
> pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x3.0x2,bus=pcie.0,
> chassis=3 \
>   -device
> pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x3.0x3,bus=pcie.0,
> chassis=4 \
>   -device
> pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x3.0x4,bus=pcie.0,
> chassis=5 \
>   -device
> pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x3.0x5,bus=pcie.0,
> chassis=6 \
>   -device
> pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x3.0x6,bus=pcie.0,
> chassis=7 \
>   -device
> pcie-root-port,id=pcie-root-port-7,port=0x7,addr=0x3.0x7,bus=pcie.0,
> chassis=8 \
>   -device pcie-root-port,id=pcie_extra_root_port_0,bus=pcie.0,addr=0x4  \
>   -device virtio-scsi-pci,id=scsi0,bus=pcie-root-port-5 \
>   -device virtio-scsi-pci,id=scsi1,bus=pcie-root-port-6 \
>   -blockdev
> driver=qcow2,file.driver=file,cache.direct=off,cache.no-flush=on,file.
> filename=/home/kvm_autotest_root/images/rhel920-64-virtio-scsi.qcow2,node-
> name=drive_image1,file.aio=threads   \
>   -device
> scsi-hd,id=os,drive=drive_image1,bus=scsi0.0,bootindex=0,serial=OS_DISK   \
>   \
>   -blockdev
> driver=qcow2,file.driver=file,file.filename=/home/kvm_autotest_root/images/
> data1.qcow2,node-name=data_image1   \
>   -device
> scsi-hd,id=data1,drive=data_image1,bus=scsi0.0,bootindex=1,serial=DATA_DISK 
> \
>   -vnc :5 \
>   -monitor stdio \
>   -qmp tcp:0:5955,server=on,wait=off \
>   -device
> virtio-net-pci,mac=9a:b5:b6:b1:b2:b7,id=nic1,netdev=nicpci,bus=pcie-root-
> port-7 \
>   -netdev tap,id=nicpci \
>   -boot menu=on,reboot-timeout=1000,strict=off \
>   \
>   -chardev
> socket,id=socket-serial,path=/var/tmp/socket-serial,logfile=/var/tmp/file-
> serial.log,mux=on,server=on,wait=off \
>   -serial chardev:socket-serial \
>   -chardev file,path=/var/tmp/file-bios.log,id=file-bios \
>   -device isa-debugcon,chardev=file-bios,iobase=0x402 \
>   \
>   -chardev
> socket,id=socket-qmp,path=/var/tmp/socket-qmp,logfile=/var/tmp/file-qmp.log,
> mux=on,server=on,wait=off \
>   -mon chardev=socket-qmp,mode=control \
>   -chardev
> socket,id=socket-hmp,path=/var/tmp/socket-hmp,logfile=/var/tmp/file-hmp.log,
> mux=on,server=on,wait=off \
>   -mon chardev=socket-hmp,mode=readline \
> 
> 
> 2. run hotplug-unplug script
> 
>  cat dev_plug.txt
> { "execute": "qmp_capabilities" }
> { "execute": "device_add", "arguments": {"driver": "scsi-hd", "id": "data1",
> "drive": "data_image1", "write-cache": "on", "bus": "scsi0.0"} }
> 
>  cat dev_unplug.txt
> 
> { "execute": "qmp_capabilities" }
> { "execute": "device_del", "arguments": {"id": "data1"} }
> 
> 
> i=0
> for ((y=0;y<20;y++)) ; do
> 	for ((x=0;x<50;x++)); do \
> 	  nc -U /var/tmp/socket-qmp < dev_plug.txt ; \
> 	  nc -U /var/tmp/socket-qmp < dev_unplug.txt ; \
>   done ;
> let i=i+1
> echo ".... $i"
> sleep 5 ;
> done
> 
> 3. wait 3 minutes to check the guest console

BTY, reproduce this issue only in vm booting stage. it is hard to reproduce wait guest boot finish (eg. wait 60s then start test)

not reproduce on 
Red Hat Enterprise Linux release 9.1 (Plow)
5.14.0-162.18.1.el9_1.x86_64
qemu-kvm-7.0.0-13.el9_1.2.x86_64
seabios-bin-1.16.0-4.el9.noarch


@bfu,could you please add 60s delay then star to test in you automation ?

Comment 27 Stefano Garzarella 2023-06-29 08:33:32 UTC

Talking to Paolo, it could be a problem in Linux due to the device in QEMU throwing too many UNIT_ATTENTION due to hotplug and hotunplug events.

At this point, Linux thinks the device is broken. As a solution, Paolo suggests trying to limit these events.
I will try to see how to do that in QEMU.

@qinwang as a test, can you try putting a sleep between plug and unplug in order to limit the rate?

Comment 28 qing.wang 2023-07-03 03:46:33 UTC

(In reply to Stefano Garzarella from comment #27)
> Talking to Paolo, it could be a problem in Linux due to the device in QEMU
> throwing too many UNIT_ATTENTION due to hotplug and hotunplug events.
> 
> At this point, Linux thinks the device is broken. As a solution, Paolo
> suggests trying to limit these events.
> I will try to see how to do that in QEMU.
> 
> @qinwang as a test, can you try putting a sleep between plug and
> unplug in order to limit the rate?

In my understanding, This issue is related to the timing sequence.

This issue is originally reported as just a normal plug and unplug.

It may quickly reproduce by comment #16.
It looks like robust testing and may find issues in the software indeed.


It still may hit this issue after adding 2s sleep after plugging and unplugging. (not easy to reproduce, this sleep may reduce the frequency. but it maybe hides some issues)

http://fileshare.hosts.qa.psi.pek2.redhat.com/pub/section2/images_backup/qbugs/2176702/2023-07-02/
(search the "log I/O" in serial-serial0-avocado-vt-vm1.log)

Comment 29 Thomas Huth 2023-07-05 07:49:05 UTC

(In reply to Stefano Garzarella from comment #27)
> Talking to Paolo, it could be a problem in Linux due to the device in QEMU
> throwing too many UNIT_ATTENTION due to hotplug and hotunplug events.
> 
> At this point, Linux thinks the device is broken. As a solution, Paolo
> suggests trying to limit these events.

I had a quick try with this idea back in March, too, but IIRC it didn't really help, even if adding multiple seconds between the events. So I think it's best if we revert the patch now to fix the problem - it can be redone later in a better way once there is a proper understanding on what is really going on here in the Linux kernel here.

Comment 30 Stefano Garzarella 2023-07-05 08:00:56 UTC

Qing, Thomas, thank you very much for the tests.

Yep, I also tried to put 20 seconds between every event, but after few iterations the issue happened again.
So I agree to revert it for now, also because Oracle (the original author) already reverted it downstream.
I just sent the revert upstream: https://lore.kernel.org/qemu-devel/20230705071523.15496-1-sgarzare@redhat.com/

In the meantime, I'm trying to figure out if that patch triggered a hidden bug on Linux, but I don't have that much time and my SCSI knowledge is limited.

Comment 31 Stefano Garzarella 2023-07-11 17:16:04 UTC

I found a potential issue in the virtio-scsi Linux driver.
I wrote a report here asking some advice from SCSI guys: https://lore.kernel.org/qemu-devel/i3od362o6unuimlqna3aaedliaabauj6g545esg7txidd4s44e@bkx5des6zytx/

Comment 32 Stefano Garzarella 2023-07-12 13:51:38 UTC

I'm taking this BZ after talking with Paolo.

It seems that we found the problem in QEMU, and Paolo helped me to prepare a series that we just sent upstream: https://lore.kernel.org/qemu-devel/20230712134352.118655-1-sgarzare@redhat.com/

I'll prepare an RPM to test while we wait for the series to be reviewed and merged upstream, then I'll backport it downstream.

Comment 36 Yanan Fu 2023-07-25 05:33:59 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 39 qing.wang 2023-07-27 14:26:56 UTC

Passed test on 
Red Hat Enterprise Linux release 9.3 Beta (Plow)
5.14.0-342.el9.x86_64
qemu-kvm-8.0.0-9.el9.x86_64
seabios-bin-1.16.1-1.el9.noarch
edk2-ovmf-20230524-2.el9.noarch
virtio-win-prewhql-0.1-239.iso

python ConfigTest.py --testcase=multi_disk_wild_hotplug.without_delay --platform=x86_64 --guestname=RHEL.9.3.0 --driveformat=virtio_scsi  --imageformat=qcow2 --machines=q35 --firmware=default_bios --netdst=virbr0 --iothread_scheme=roundrobin --nr_iothreads=2 --customsparams="vm_mem_limit = 8G" --nrepeat=100
python ConfigTest.py --testcase=multi_disk_wild_hotplug.without_delay --platform=x86_64 --guestname=RHEL.9.3.0 --driveformat=virtio_blk  --imageformat=qcow2 --machines=q35 --firmware=default_bios --netdst=virbr0 --iothread_scheme=roundrobin --nr_iothreads=2 --customsparams="vm_mem_limit = 8G" --nrepeat=20
python ConfigTest.py --testcase=multi_disk_wild_hotplug --platform=x86_64 --guestname=RHEL.9.3.0 --driveformat=virtio_scsi  --imageformat=qcow2 --machines=q35 --firmware=default_bios --netdst=virbr0 --iothread_scheme=roundrobin --nr_iothreads=2 --customsparams="vm_mem_limit = 8G" --nrepeat=20 --clone=no

Comment 41 errata-xmlrpc 2023-11-07 08:27:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6368

Note You need to log in before you can comment on or make changes to this bug.

clegoate
cohuck
coli
hannsj_uhl
jinzhao
juzhang
knoel
kwolf
lijin
mst
pbonzini
phou
qinwang
ribarry
sgarzare
smitterl
stefanha
thuth
vgoyal
virt-maint
virt-qe-z
xuma
yiwei
zhenyzha