Bug 2207634
| Summary: | [qemu-kvm] Multiple hot-plug/hot-unplug virtio-scsi disks operations hit core dump | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Zhenyu Zhang <zhenyzha> |
| Component: | qemu-kvm | Assignee: | Stefano Garzarella <sgarzare> |
| qemu-kvm sub component: | virtio-blk,scsi | QA Contact: | Zhenyu Zhang <zhenyzha> |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | coli, eric.auger, gshan, hreitz, jinzhao, juzhang, kraxel, lijin, pbonzini, qinwang, shahuang, stefanha, vgoyal, virt-maint, xuwei, yihyu |
| Version: | 9.3 | Keywords: | TestOnly |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | aarch64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-08 05:59:56 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2176702 | ||
| Bug Blocks: | |||
Hi, Because apparently I can’t reproduce this easily myself (not reproducible on x86 as per comment 0), I’ll have to ask some questions: Do I understand the order of operations right? We hot-plug some disks, then start I/O on them in the guest, while that I/O is running, we detach the disks again; and after a couple of times (notably, while disks are being plugged in), we get I/O errors, but not on the hot-plugged disks, but on the system disk instead (image1, /dev/sda, dm-0). Then, applications in the guest dump core, perhaps because of those I/O errors. (The I/O errors seems to be led by: 2023-05-16 01:19:30: [ 174.274736] sd 0:0:0:0: device reset 2023-05-16 01:19:41: [ 185.455493] sd 0:0:0:0: [sda] tag#208 abort 2023-05-16 01:19:51: [ 195.695540] sd 0:0:0:0: [sda] tag#207 abort 2023-05-16 01:19:51: [ 195.696045] sd 0:0:0:0: device reset 2023-05-16 01:20:01: [ 205.935483] sd 0:0:0:0: [sda] tag#208 abort 2023-05-16 01:20:01: [ 205.935660] sd 0:0:0:0: Device offlined - not ready after error recovery 2023-05-16 01:20:01: [ 205.935665] sd 0:0:0:0: Device offlined - not ready after error recovery 2023-05-16 01:20:01: [ 205.935669] sd 0:0:0:0: Device offlined - not ready after error recovery 2023-05-16 01:20:01: [ 205.935820] sd 0:0:0:0: rejecting I/O to offline device ) There is no visible problem on the host, right? I.e., qemu does not crash, it does not hang, and it does not show errors e.g. in the QMP log? Is this also reproducible without the I/O (without step 3) in the guest? The fact that sda (the disk that is never plugged in or unplugged) suffers from a reset and I/O errors to me does look like the problem likely is in qemu’s device emulation. I’m Cc-ing Paolo, he’s the (virtio-)scsi maintainer, and also Stefan, because he’s no stranger to virtio-scsi either. (In reply to Hanna Czenczek from comment #2) > Because apparently I can’t reproduce this easily myself (not reproducible on > x86 as per comment 0), I’ll have to ask some questions: > > There is no visible problem on the host, right? I.e., qemu does not crash, > it does not hang, and it does not show errors e.g. in the QMP log? Hello Hanna, Yes, the qemu is still running (qemu) info status VM status: running (qemu) info pci (qemu) info qtree this is QMP detailed log: http://10.0.136.47/zhenyzha/multi_disk_shared_bus/test-results/1-Host_RHEL.m9.u3.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.3.0.aarch64.page_64k.io-github-autotest-qemu.multi_disk_random_hotplug.single_type.shared_bus.arm64-pci/ Please click to view qmpmonitor1-avocado-vt-vm1-pid-116894.log But the guest is hanging. [root@localhost home]# poweroff poweroff bash: poweroff: command not found [root@localhost home]# ls ls -bash: /usr/bin/ls: Input/output error Please click to view serial-serial0-avocado-vt-vm1.log > Is this also reproducible without the I/O (without step 3) in the guest? In my attempts, fast hotplugging without IO did not reproduce the issue. In my manual test, I always wait about 10s after hot-plug the disk. Then do the dd operation, and then hot-unplug the disk. Repeat this about 3-4 times and you will encounter this problem. > The fact that sda (the disk that is never plugged in or unplugged) suffers > from a reset and I/O errors to me does look like the problem likely is in > qemu’s device emulation. I’m Cc-ing Paolo, he’s the (virtio-)scsi > maintainer, and also Stefan, because he’s no stranger to virtio-scsi either. Hello xuwei and qinwang, In the test on the x86 platform, I see 200 disks are hot-plug/hot-unplug on the x86 platform without this issue. But on the ARM platform, I modify the script to only hot-plug/hot-unplug 17 disks it will hang. (stg_image_num = 17) python3 ConfigTest.py --guestname=RHEL.9.3.0..page_64k --platform=aarch64 --machines=arm64-pci --driveformat=virtio_scsi --nicmodel=virtio_net --mem=8192 --vcpu=4 --testcase=multi_disk_random_hotplug..single_type.shared_bus --netdst=virbr0 --clone=no Add CC to make sure our message is consistent. (In reply to Zhenyu Zhang from comment #3) > this is QMP detailed log: > http://10.0.136.47/zhenyzha/multi_disk_shared_bus/test-results/1-Host_RHEL. > m9.u3.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.3.0.aarch64.page_64k.io- > github-autotest-qemu.multi_disk_random_hotplug.single_type.shared_bus.arm64- > pci/ > Please click to view qmpmonitor1-avocado-vt-vm1-pid-116894.log And adding HMP log: http://pastebin.test.redhat.com/1100489 (In reply to Zhenyu Zhang from comment #4) > Hello xuwei and qinwang, > > In the test on the x86 platform, I see 200 disks are hot-plug/hot-unplug on > the x86 platform without this issue. > But on the ARM platform, I modify the script to only hot-plug/hot-unplug 17 > disks it will hang. (stg_image_num = 17) > > python3 ConfigTest.py --guestname=RHEL.9.3.0..page_64k --platform=aarch64 > --machines=arm64-pci --driveformat=virtio_scsi --nicmodel=virtio_net > --mem=8192 --vcpu=4 > --testcase=multi_disk_random_hotplug..single_type.shared_bus --netdst=virbr0 > --clone=no > > Add CC to make sure our message is consistent. This issue maybe dup or have same reason with Bug 2176702 - [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully after hot-plug it repeatly It looks like the disk order is wrong. It writes data on the os disk make os broken. 2023-05-16 01:19:30: [ 175.220064] sd 0:0:0:0: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. 2023-05-16 01:19:30: [ 175.220163] scsi 0:0:16:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 2023-05-16 01:19:41: [ 185.455493] sd 0:0:0:0: [sda] tag#208 abort 2023-05-16 01:19:51: [ 195.695540] sd 0:0:0:0: [sda] tag#207 abort 2023-05-16 01:19:51: [ 195.696021] sd 0:0:0:0: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. 2023-05-16 01:19:51: [ 195.696045] sd 0:0:0:0: device reset 2023-05-16 01:20:01: [ 205.935483] sd 0:0:0:0: [sda] tag#208 abort 2023-05-16 01:20:01: [ 205.935660] sd 0:0:0:0: Device offlined - not ready after error recovery (In reply to qing.wang from comment #6) > This issue maybe dup or have same reason with > Bug 2176702 - [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully > after hot-plug it repeatly > > It looks like the disk order is wrong. > It writes data on the os disk make os broken. I agree, it does seem very plausible that both have the same cause. (In reply to Hanna Czenczek from comment #7) > (In reply to qing.wang from comment #6) > > This issue maybe dup or have same reason with > > Bug 2176702 - [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully > > after hot-plug it repeatly > > > > It looks like the disk order is wrong. > > It writes data on the os disk make os broken. > > I agree, it does seem very plausible that both have the same cause. This bug comes from Bug 2203094 - Add more than 17 pcie-root-ports, display Out Of Resource - Comment 11 In view of the large difference in the number of hot-plug disks for each platform, 200 disks on the x86 platform have no hit. 17 disk hits on the ARM platform. I suggest keeping this bug for now and verifying again when bug 2176702 is resolved. |
Description of problem: Multiple hot-plug/hot-unplug virtio-scsi disks operations hit core dump Version-Release number of selected component (if applicable): host kernel: 5.14.0-311.el9.aarch64+64k guest kernel: 5.14.0-311.el9.aarch64+64k qemu: qemu-kvm-8.0.0-2.el9 edk2: edk2-aarch64-20230301gitf80f052277c8-2.el9.noarch Steps to Reproduce: 1. create 17 disk image files qemu-img create -f qcow2 /home/kvm_autotest_root/images/stg1.qcow2 128M ... qemu-img create -f qcow2 /home/kvm_autotest_root/images/stg17.qcow2 128M 2. Boot guest /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -blockdev '{"node-name": "file_aavmf_code", "driver": "file", "filename": "/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw", "auto-read-only": true, "discard": "unmap"}' \ -blockdev '{"node-name": "drive_aavmf_code", "driver": "raw", "read-only": true, "file": "file_aavmf_code"}' \ -blockdev '{"node-name": "file_aavmf_vars", "driver": "file", "filename": "/root/avocado/data/avocado-vt/avocado-vt-vm1_rhel930-aarch64-64k-virtio-scsi_qcow2_filesystem_VARS.fd", "auto-read-only": true, "discard": "unmap"}' \ -blockdev '{"node-name": "drive_aavmf_vars", "driver": "raw", "read-only": false, "file": "file_aavmf_vars"}' \ -machine virt,gic-version=host,memory-backend=mem-machine_mem,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \ -device '{"id": "pcie-root-port-0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x1", "chassis": 1}' \ -device '{"id": "pcie-pci-bridge-0", "driver": "pcie-pci-bridge", "addr": "0x0", "bus": "pcie-root-port-0"}' \ -nodefaults \ -device '{"id": "pcie-root-port-1", "port": 1, "driver": "pcie-root-port", "addr": "0x1.0x1", "bus": "pcie.0", "chassis": 2}' \ -device '{"driver": "virtio-gpu-pci", "bus": "pcie-root-port-1", "addr": "0x0"}' \ -m 8192 \ -object '{"size": 8589934592, "id": "mem-machine_mem", "qom-type": "memory-backend-ram"}' \ -smp 4,maxcpus=4,cores=2,threads=1,clusters=1,sockets=2 \ -cpu 'host' \ -chardev socket,path=/var/tmp/monitor-qmpmonitor1,wait=off,id=qmp_id_qmpmonitor1,server=on \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,path=/var/tmp/monitor-catch_monitor-2,wait=off,id=qmp_id_catch_monitor,server=on \ -mon chardev=qmp_id_catch_monitor,mode=control \ -serial unix:'/var/tmp/serial-serial0',server=on,wait=off \ -device '{"id": "pcie-root-port-2", "port": 2, "driver": "pcie-root-port", "addr": "0x1.0x2", "bus": "pcie.0", "chassis": 3}' \ -device '{"driver": "qemu-xhci", "id": "usb1", "bus": "pcie-root-port-2", "addr": "0x0"}' \ -device '{"driver": "usb-tablet", "id": "usb-tablet1", "bus": "usb1.0", "port": "1"}' \ -device '{"id": "pcie-root-port-3", "port": 3, "driver": "pcie-root-port", "addr": "0x1.0x3", "bus": "pcie.0", "chassis": 4}' \ -device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pcie-root-port-3", "addr": "0x0"}' \ -blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/rhel930-aarch64-64k-virtio-scsi.qcow2", "cache": {"direct": true, "no-flush": false}}' \ -blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \ -device '{"driver": "scsi-hd", "id": "image1", "drive": "drive_image1", "bootindex": 0, "write-cache": "on"}' \ -device '{"id": "pcie-root-port-4", "port": 4, "driver": "pcie-root-port", "addr": "0x1.0x4", "bus": "pcie.0", "chassis": 5}' \ -device '{"driver": "virtio-net-pci", "mac": "9a:3d:0b:61:1f:a5", "rombar": 0, "id": "idHxGKI6", "netdev": "idNFGXZ3", "bus": "pcie-root-port-4", "addr": "0x0"}' \ -netdev tap,id=idNFGXZ3,vhost=on \ -qmp tcp:0:5955,server,nowait \ -vnc :20 \ -rtc base=utc,clock=host,driftfix=slew \ -enable-kvm \ -device '{"id": "pcie_extra_root_port_0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x2", "chassis": 6}' \ -monitor stdio 2.after guest boot, run add_disks.sh on host [root@hpe-apollo80-02-n01 home]# cat add_disk.sh #!/bin/bash NUM_LUNS=17 add_devices() { exec 3<>/dev/tcp/localhost/5955 echo "$@" echo -e "{'execute':'qmp_capabilities'}" >&3 read response <&3 echo $response for i in $(seq 1 $NUM_LUNS) ; do cmd="{'execute': 'blockdev-add', 'arguments': {'node-name': 'file_stg${i}', 'driver': 'file','filename': '/home/kvm_autotest_root/images/stg${i}.qcow2'}}" echo "$cmd" echo -e "$cmd" >&3 read response <&3 echo "$response" cmd="{'execute': 'blockdev-add', 'arguments': {'node-name': 'drive_stg${i}', 'driver': 'qcow2', 'file': 'file_stg${i}'}} " echo "$cmd" echo -e "$cmd" >&3 read response <&3 echo "$response" cmd="{'execute': 'device_add', 'arguments': {'driver': 'scsi-hd', 'id': 'stg${i}', 'drive': 'drive_stg${i}', 'write-cache': 'on', 'bus': 'virtio_scsi_pci0.0'}}" echo "$cmd" echo -e "$cmd" >&3 read response <&3 echo "$response" # sleep 0.1 done } add_devices [root@hpe-apollo80-02-n01 home]# sh add_disk.sh 3. Verify disks after hotplug and do dd test in guest [root@localhost home]# cat dd_test.sh #!/bin/bash for x in {b..r} do dd if=/dev/sd$x of=/dev/null bs=1M count=100 iflag=direct && dd if=/dev/zero of=/dev/sd$x bs=1M count=100 oflag=direct done [root@localhost home]# sh dd_test.sh 4. run delete_disk.sh on host [root@hpe-apollo80-02-n01 home]# cat delete_disk.sh #!/bin/bash NUM_LUNS=17 delete_devices() { exec 3<>/dev/tcp/localhost/5955 echo "$@" echo -e "{'execute':'qmp_capabilities'}" >&3 read response <&3 echo $response for i in $(seq 1 $NUM_LUNS) ; do cmd="{'execute': 'device_del', 'arguments': {'id': 'stg${i}'}}" echo "$cmd" echo -e "$cmd" >&3 read response <&3 echo "$response" cmd="{'execute': 'blockdev-del', 'arguments': {'node-name': 'drive_stg${i}'}}" echo "$cmd" echo -e "$cmd" >&3 read response <&3 echo "$response" cmd="{'execute': 'blockdev-del', 'arguments': {'node-name': 'file_stg${i}'}}" echo "$cmd" echo -e "$cmd" >&3 read response <&3 echo "$response" sleep 0.1 done } delete_devices [root@hpe-apollo80-02-n01 home]# sh delete_disk.sh 5. Repeat steps 2-4 three times and encounter the following error: [ 247.133297] sd 0:0:12:0: Power-on or device reset occurred [ 247.133612] sd 0:0:12:0: [sdm] 262144 512-byte logical blocks: (134 MB/128 MiB) [ 247.133828] sd 0:0:12:0: [sdm] Write Protect is off [ 247.134164] sd 0:0:12:0: [sdm] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 247.154658] sd 0:0:9:0: [sdj] Attached SCSI disk [ 247.219121] I/O error, dev sda, sector 17677240 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2 [ 247.219177] dm-0: writeback error on inode 18326607, offset 0, sector 10152888 [ 247.243271] sd 0:0:12:0: [sdm] Attached SCSI disk [ 247.243350] sd 0:0:10:0: [sdk] Attached SCSI disk [ 247.243387] scsi host0: scsi scan: INQUIRY result too short (5), using 36 [ 247.243433] sd 0:0:11:0: [sdl] Attached SCSI disk [ 247.244975] scsi 0:0:13:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 247.246342] scsi 0:0:14:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 247.247721] scsi 0:0:15:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 247.249081] scsi 0:0:16:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 247.250452] scsi 0:0:17:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 247.303637] sd 0:0:13:0: Power-on or device reset occurred [ 247.303940] sd 0:0:13:0: [sdn] 262144 512-byte logical blocks: (134 MB/128 MiB) [ 247.304067] sd 0:0:13:0: [sdn] Write Protect is off [ 247.304348] sd 0:0:13:0: [sdn] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 247.305783] sd 0:0:13:0: Attached scsi generic sg13 type 0 [ 247.323554] sd 0:0:14:0: Attached scsi generic sg14 type 0 [ 247.324819] sd 0:0:15:0: Attached scsi generic sg15 type 0 [ 247.327185] sd 0:0:14:0: Power-on or device reset occurred [ 247.327468] sd 0:0:14:0: [sdo] 262144 512-byte logical blocks: (134 MB/128 MiB) [ 247.327581] sd 0:0:14:0: [sdo] Write Protect is off [ 247.327890] sd 0:0:14:0: [sdo] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 247.333185] sd 0:0:16:0: Power-on or device reset occurred [ 247.333494] sd 0:0:16:0: [sdq] 262144 512-byte logical blocks: (134 MB/128 MiB) [ 247.333600] sd 0:0:16:0: [sdq] Write Protect is off [ 247.333801] sd 0:0:16:0: [sdq] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 247.338167] sd 0:0:16:0: Attached scsi generic sg16 type 0 [ 247.343774] sd 0:0:17:0: Attached scsi generic sg17 type 0 [ 247.344424] sd 0:0:15:0: Power-on or device reset occurred [ 247.344789] sd 0:0:15:0: [sdp] 262144 512-byte logical blocks: (134 MB/128 MiB) [ 247.344895] sd 0:0:15:0: [sdp] Write Protect is off [ 247.345133] sd 0:0:15:0: [sdp] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 247.349557] sd 0:0:17:0: Power-on or device reset occurred [ 247.349858] sd 0:0:17:0: [sdr] 262144 512-byte logical blocks: (134 MB/128 MiB) [ 247.349984] sd 0:0:17:0: [sdr] Write Protect is off [ 247.350208] sd 0:0:17:0: [sdr] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 247.413588] sd 0:0:13:0: [sdn] Attached SCSI disk [ 247.413735] sd 0:0:14:0: [sdo] Attached SCSI disk [ 247.433168] sd 0:0:16:0: [sdq] Attached SCSI disk [ 247.449896] dm-0: writeback error on inode 18326607, offset 0, sector 10152896 [ 247.472951] sd 0:0:17:0: [sdr] Attached SCSI disk [ 247.473252] sd 0:0:15:0: [sdp] Attached SCSI disk [ 247.487992] dm-0: writeback error on inode 18326607, offset 0, sector 10152864 [ 247.521506] dm-0: writeback error on inode 18326607, offset 0, sector 10152832 [ 247.555697] dm-0: writeback error on inode 18326607, offset 0, sector 10152840 [ 247.575702] dm-0: writeback error on inode 18326608, offset 0, sector 10152776 [ 248.802755] XFS (dm-0): log I/O error -5 [ 248.803052] XFS (dm-0): Log I/O Error (0x2) detected at xlog_ioend_work+0x84/0x90 [xfs] (fs/xfs/xfs_log.c:1378). Shutting down filesystem. [ 248.803247] XFS (dm-0): Please unmount the filesystem and rectify the problem(s) [ 329.757172] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed [ 329.886041] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed Detailed log: http://10.0.136.47/zhenyzha/multi_disk_shared_bus/test-results/1-Host_RHEL.m9.u3.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.3.0.aarch64.page_64k.io-github-autotest-qemu.multi_disk_random_hotplug.single_type.shared_bus.arm64-pci/serial-serial0-avocado-vt-vm1.log Additional info: Tested related test cases, not hit it on x86. Versions: kernel-5.14.0-306.el9.x86_64 qemu-kvm-8.0.0-2.el9 edk2-ovmf-20230301gitf80f052277c8-3.el9.noarch 1. multi_disk_random_hotplug.single_type - the results were passed. Job link: http://fileshare.hosts.qa.psi.pek2.redhat.com/pub/logs/multi_disk_random_hotplug.single_type/results.html 2. multi_disk - the results were passed. Job link: http://fileshare.hosts.qa.psi.pek2.redhat.com/pub/logs/multi_disk_with_edk2-20230301gitf80f052277c8-3.el9/results.html