Bug 2207634
Summary: | [qemu-kvm] Multiple hot-plug/hot-unplug virtio-scsi disks operations hit core dump | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Zhenyu Zhang <zhenyzha> |
Component: | qemu-kvm | Assignee: | Stefano Garzarella <sgarzare> |
qemu-kvm sub component: | virtio-blk,scsi | QA Contact: | Zhenyu Zhang <zhenyzha> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | coli, eric.auger, gshan, hreitz, jinzhao, juzhang, kraxel, lijin, pbonzini, qinwang, shahuang, stefanha, vgoyal, virt-maint, xuwei, yihyu |
Version: | 9.3 | Keywords: | TestOnly |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | aarch64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-08-08 05:59:56 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2176702 | ||
Bug Blocks: |
Description
Zhenyu Zhang
2023-05-16 11:53:34 UTC
Hi, Because apparently I can’t reproduce this easily myself (not reproducible on x86 as per comment 0), I’ll have to ask some questions: Do I understand the order of operations right? We hot-plug some disks, then start I/O on them in the guest, while that I/O is running, we detach the disks again; and after a couple of times (notably, while disks are being plugged in), we get I/O errors, but not on the hot-plugged disks, but on the system disk instead (image1, /dev/sda, dm-0). Then, applications in the guest dump core, perhaps because of those I/O errors. (The I/O errors seems to be led by: 2023-05-16 01:19:30: [ 174.274736] sd 0:0:0:0: device reset 2023-05-16 01:19:41: [ 185.455493] sd 0:0:0:0: [sda] tag#208 abort 2023-05-16 01:19:51: [ 195.695540] sd 0:0:0:0: [sda] tag#207 abort 2023-05-16 01:19:51: [ 195.696045] sd 0:0:0:0: device reset 2023-05-16 01:20:01: [ 205.935483] sd 0:0:0:0: [sda] tag#208 abort 2023-05-16 01:20:01: [ 205.935660] sd 0:0:0:0: Device offlined - not ready after error recovery 2023-05-16 01:20:01: [ 205.935665] sd 0:0:0:0: Device offlined - not ready after error recovery 2023-05-16 01:20:01: [ 205.935669] sd 0:0:0:0: Device offlined - not ready after error recovery 2023-05-16 01:20:01: [ 205.935820] sd 0:0:0:0: rejecting I/O to offline device ) There is no visible problem on the host, right? I.e., qemu does not crash, it does not hang, and it does not show errors e.g. in the QMP log? Is this also reproducible without the I/O (without step 3) in the guest? The fact that sda (the disk that is never plugged in or unplugged) suffers from a reset and I/O errors to me does look like the problem likely is in qemu’s device emulation. I’m Cc-ing Paolo, he’s the (virtio-)scsi maintainer, and also Stefan, because he’s no stranger to virtio-scsi either. (In reply to Hanna Czenczek from comment #2) > Because apparently I can’t reproduce this easily myself (not reproducible on > x86 as per comment 0), I’ll have to ask some questions: > > There is no visible problem on the host, right? I.e., qemu does not crash, > it does not hang, and it does not show errors e.g. in the QMP log? Hello Hanna, Yes, the qemu is still running (qemu) info status VM status: running (qemu) info pci (qemu) info qtree this is QMP detailed log: http://10.0.136.47/zhenyzha/multi_disk_shared_bus/test-results/1-Host_RHEL.m9.u3.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.3.0.aarch64.page_64k.io-github-autotest-qemu.multi_disk_random_hotplug.single_type.shared_bus.arm64-pci/ Please click to view qmpmonitor1-avocado-vt-vm1-pid-116894.log But the guest is hanging. [root@localhost home]# poweroff poweroff bash: poweroff: command not found [root@localhost home]# ls ls -bash: /usr/bin/ls: Input/output error Please click to view serial-serial0-avocado-vt-vm1.log > Is this also reproducible without the I/O (without step 3) in the guest? In my attempts, fast hotplugging without IO did not reproduce the issue. In my manual test, I always wait about 10s after hot-plug the disk. Then do the dd operation, and then hot-unplug the disk. Repeat this about 3-4 times and you will encounter this problem. > The fact that sda (the disk that is never plugged in or unplugged) suffers > from a reset and I/O errors to me does look like the problem likely is in > qemu’s device emulation. I’m Cc-ing Paolo, he’s the (virtio-)scsi > maintainer, and also Stefan, because he’s no stranger to virtio-scsi either. Hello xuwei and qinwang, In the test on the x86 platform, I see 200 disks are hot-plug/hot-unplug on the x86 platform without this issue. But on the ARM platform, I modify the script to only hot-plug/hot-unplug 17 disks it will hang. (stg_image_num = 17) python3 ConfigTest.py --guestname=RHEL.9.3.0..page_64k --platform=aarch64 --machines=arm64-pci --driveformat=virtio_scsi --nicmodel=virtio_net --mem=8192 --vcpu=4 --testcase=multi_disk_random_hotplug..single_type.shared_bus --netdst=virbr0 --clone=no Add CC to make sure our message is consistent. (In reply to Zhenyu Zhang from comment #3) > this is QMP detailed log: > http://10.0.136.47/zhenyzha/multi_disk_shared_bus/test-results/1-Host_RHEL. > m9.u3.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.9.3.0.aarch64.page_64k.io- > github-autotest-qemu.multi_disk_random_hotplug.single_type.shared_bus.arm64- > pci/ > Please click to view qmpmonitor1-avocado-vt-vm1-pid-116894.log And adding HMP log: http://pastebin.test.redhat.com/1100489 (In reply to Zhenyu Zhang from comment #4) > Hello xuwei and qinwang, > > In the test on the x86 platform, I see 200 disks are hot-plug/hot-unplug on > the x86 platform without this issue. > But on the ARM platform, I modify the script to only hot-plug/hot-unplug 17 > disks it will hang. (stg_image_num = 17) > > python3 ConfigTest.py --guestname=RHEL.9.3.0..page_64k --platform=aarch64 > --machines=arm64-pci --driveformat=virtio_scsi --nicmodel=virtio_net > --mem=8192 --vcpu=4 > --testcase=multi_disk_random_hotplug..single_type.shared_bus --netdst=virbr0 > --clone=no > > Add CC to make sure our message is consistent. This issue maybe dup or have same reason with Bug 2176702 - [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully after hot-plug it repeatly It looks like the disk order is wrong. It writes data on the os disk make os broken. 2023-05-16 01:19:30: [ 175.220064] sd 0:0:0:0: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. 2023-05-16 01:19:30: [ 175.220163] scsi 0:0:16:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 2023-05-16 01:19:41: [ 185.455493] sd 0:0:0:0: [sda] tag#208 abort 2023-05-16 01:19:51: [ 195.695540] sd 0:0:0:0: [sda] tag#207 abort 2023-05-16 01:19:51: [ 195.696021] sd 0:0:0:0: LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. 2023-05-16 01:19:51: [ 195.696045] sd 0:0:0:0: device reset 2023-05-16 01:20:01: [ 205.935483] sd 0:0:0:0: [sda] tag#208 abort 2023-05-16 01:20:01: [ 205.935660] sd 0:0:0:0: Device offlined - not ready after error recovery (In reply to qing.wang from comment #6) > This issue maybe dup or have same reason with > Bug 2176702 - [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully > after hot-plug it repeatly > > It looks like the disk order is wrong. > It writes data on the os disk make os broken. I agree, it does seem very plausible that both have the same cause. (In reply to Hanna Czenczek from comment #7) > (In reply to qing.wang from comment #6) > > This issue maybe dup or have same reason with > > Bug 2176702 - [RHEL9][virtio-scsi] scsi-hd cannot hot-plug successfully > > after hot-plug it repeatly > > > > It looks like the disk order is wrong. > > It writes data on the os disk make os broken. > > I agree, it does seem very plausible that both have the same cause. This bug comes from Bug 2203094 - Add more than 17 pcie-root-ports, display Out Of Resource - Comment 11 In view of the large difference in the number of hot-plug disks for each platform, 200 disks on the x86 platform have no hit. 17 disk hits on the ARM platform. I suggest keeping this bug for now and verifying again when bug 2176702 is resolved. |