Bug 1999051
Summary: | Qemu VM with VirtIO gets paused directly after booting with: BLOCK_IO_ERROR, reason: "Invalid argument" | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Jean-Louis Dupond <jean-louis> | ||||||||
Component: | qemu-kvm | Assignee: | Virtualization Maintenance <virt-maint> | ||||||||
qemu-kvm sub component: | Storage | QA Contact: | qing.wang <qinwang> | ||||||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||||||
Severity: | urgent | ||||||||||
Priority: | medium | CC: | coli, elima, jinzhao, juzhang, kchamart, kkiwi, qinwang, quintela, timao, virt-maint | ||||||||
Version: | 8.4 | Keywords: | Triaged | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | 8.5 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-09-02 02:01:44 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Jean-Louis Dupond
2021-08-30 10:27:33 UTC
Created attachment 1819048 [details]
qemu args
Created attachment 1819049 [details]
libvirt xml
> image: /rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/f81d418f-ba14-444a-aaf7-0f3510fe5fd0/63672aa8-4748-4ddc-923c-> c06eb1b4ce3e > file format: qcow2 > virtual size: 47.5 GiB (51011125248 bytes) > disk size: 0 B this 0 size is a weird as it can be. Can you make sure that there is enough space/libvirt/qemu can write to that place on destination? > cluster_size: 65536 > Format specific information: Thanks, Juan. Hi, I know you've mentioned on #qemu (OFTC) that changing from 'virtio-blk' to 'virtio-scsi' has "fixed the issue" for you. Also, besides what Juan already asked above, one more: Since you're doing migration, it can help to have full QMP transactions from source and destination host. You can obtain that in the following way: (1) Install the `virt-admin` tool (2) Configure the log file to capture the output: $> virt-admin daemon-log-outputs "1:file:/var/log/libvirt/libvirtd.log" (3) Then, configure the dynamic log filters: $> virt-admin daemon-log-filters \ "1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util" That’s it; no need to restart the libvirt daemon. Repeat your test. And attach the /var/log/libvirt/libvirtd.log from source and destination to the bug. But first check what Juan has asked in comment#3, please. Created attachment 1819057 [details]
Full libvirt debug log for that VM
(In reply to Juan Quintela from comment #3) > > image: /rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/f81d418f-ba14-444a-aaf7-0f3510fe5fd0/63672aa8-4748-4ddc-923c-> c06eb1b4ce3e > > file format: qcow2 > > virtual size: 47.5 GiB (51011125248 bytes) > > disk size: 0 B > > this 0 size is a weird as it can be. Can you make sure that there is enough > space/libvirt/qemu can write to that place on destination? > There is plenty of space. If I just try to boot the VM on a non-upgraded hypervisor (so with qemu 5.2/libvirt 7.4) it boots fine. And that is on the same storage/network/... The disk size: 0 B ain't a big issue I think? Docs tell me: ' How much space the image file occupies on the host file system (may be shown as 0 if this information is unavailable, e.g. because there is no file system)' As this is a Windows machine, it might be the cause? This is the same on other VM's I checked also. Also like mentioned above I was able to have it working again by switching from VirtIO to VirtIO-SCSI for the disk. So it really seems to be some kind of Qemu issue. Please note that this happend first after a migration. But even a fresh start of the VM causes the same issue. So its unrelated to the migration. I also tried to clone the VM that has this issue. And it was 100% reproducible on the clone. Extracted for comment1 (i.e. removing lot of stuff around) "image": { "virtual-size": 51011125248, "filename": "/rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/f81d418f-ba14-444a-aaf7-0f3510fe5fd0/63672aa8-4748-4ddc-923c-c06eb1b4ce3e", "cluster-size": 65536, "format": "qcow2", "actual-size": 0, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "compression-type": "zlib", "lazy-refcounts": false, "refcount-bits": 16, "corrupt": false, "extended-l2": false}}, "dirty-flag": false}, "image": {"virtual-size": 56237228032, "filename": "/rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/f81d418f-ba14-444a-aaf7-0f3510fe5fd0/63672aa8-4748-4ddc-923c-c06eb1b4ce3e", "format": "host_device", "actual-size": 0, "dirty-flag": false}, you can see that the disk size is zero even for the block device. You can't create a qcow2 block device on something that has size 0. Later, Juan. This is on my system, a normal file, not a block device, but qcow2 header will use the same size. $ qemu-img create -f qcow2 /tmp/kk.img 10G Formatting '/tmp/kk.img', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=10737418240 lazy_refcounts=off refcount_bits=16 $ ll /tmp/kk.img -rw-r--r--. 1 quintela quintela 193K Aug 30 13:32 /tmp/kk.img $ du /tmp/kk.img 196K /tmp/kk.img $ qemu-img info /tmp/kk.img image: /tmp/kk.img file format: qcow2 virtual size: 10 GiB (10737418240 bytes) disk size: 196 KiB cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false extended l2: false $ Notice how the disk file uses 196K, and qemu-img sees 196KB as disk size. Comment one shows a disk size of 0. Notice that this is a completely empty qcow2 disk image. (In reply to Juan Quintela from comment #9) > This is on my system, a normal file, not a block device, but qcow2 header > will use the same size. > > > Notice how the disk file uses 196K, and qemu-img sees 196KB as disk size. > Comment one shows a disk size of 0. Notice that this is a completely empty > qcow2 disk image. Seems like this is some qemu-img reporting issue. # qemu-img info /rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/0353604b-fab6-4a9d-bf39-42cbf6d3e74c/eee5169b-f75f-4e93-983f-9d8033d3ae70 image: /rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/0353604b-fab6-4a9d-bf39-42cbf6d3e74c/eee5169b-f75f-4e93-983f-9d8033d3ae70 file format: qcow2 virtual size: 47.5 GiB (51011125248 bytes) disk size: 0 B cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false extended l2: false But then: # dd if=/rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/0353604b-fab6-4a9d-bf39-42cbf6d3e74c/eee5169b-f75f-4e93-983f-9d8033d3ae70 of=test.img 107741184+0 records in 107741184+0 records out 55163486208 bytes (55 GB, 51 GiB) copied, 277.342 s, 199 MB/s # qemu-img info test.img image: test.img file format: qcow2 virtual size: 47.5 GiB (51011125248 bytes) disk size: 51.4 GiB cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false extended l2: false So its qemu(-img) reporting a wrong size when its qcow2 on a blockdev. Its not a blockdev issue itself. Maybe this issue related to Bug 1989717 - [regression] blk_update_request: I/O error when generating load on iSCSI Bug 1994494 - VM remains in paused state when trying to write on a resized disk resides on iscsi Could you please try remove the vm error policy,does it hit crash on qemu-kvm: ../util/iov.c:428: qemu_iovec_init_extended: Assertion `mid_qiov->niov <= IOV_MAX' failed. From the log of Comment1 [1], /rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/f81d418f-ba14-444a-aaf7-0f3510fe5fd0/63672aa8-4748-4ddc-923c-c06eb1b4ce3e image is used as 'host_device'. While for block device image, qemu-img can not get the real size as doc says "How much space the image file occupies on the host file system (may be shown as 0 if this information is unavailable, e.g. because there is no file system)'" [1]-blockdev '{"driver":"host_device","filename":"/rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/f81d418f-ba14-444a-aaf7-0f3510fe5fd0/63672aa8-4748-4ddc-923c-c06eb1b4ce3e","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ (In reply to Jean-Louis Dupond from comment #10) > Seems like this is some qemu-img reporting issue. > > # qemu-img info > /rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/ > 0353604b-fab6-4a9d-bf39-42cbf6d3e74c/eee5169b-f75f-4e93-983f-9d8033d3ae70 > image: > /rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/ > 0353604b-fab6-4a9d-bf39-42cbf6d3e74c/eee5169b-f75f-4e93-983f-9d8033d3ae70 > file format: qcow2 > virtual size: 47.5 GiB (51011125248 bytes) > disk size: 0 B > cluster_size: 65536 > Format specific information: > compat: 1.1 > compression type: zlib > lazy refcounts: false > refcount bits: 16 > corrupt: false > extended l2: false > > But then: > # dd > if=/rhev/data-center/mnt/blockSD/17f5688c-11d0-4708-a52c-55ee43936f74/images/ > 0353604b-fab6-4a9d-bf39-42cbf6d3e74c/eee5169b-f75f-4e93-983f-9d8033d3ae70 > of=test.img > 107741184+0 records in > 107741184+0 records out > 55163486208 bytes (55 GB, 51 GiB) copied, 277.342 s, 199 MB/s > > # qemu-img info test.img > image: test.img > file format: qcow2 > virtual size: 47.5 GiB (51011125248 bytes) > disk size: 51.4 GiB > cluster_size: 65536 > Format specific information: > compat: 1.1 > compression type: zlib > lazy refcounts: false > refcount bits: 16 > corrupt: false > extended l2: false > > > So its qemu(-img) reporting a wrong size when its qcow2 on a blockdev. > Its not a blockdev issue itself. I think this should be because of speculative preallocation algorithm of File System. Like XFS, you could check https://linux-xfs.oss.sgi.narkive.com/jjjfnyI1/faq-xfs-speculative-preallocation for more details. (In reply to qing.wang from comment #12) > Could you please try remove the vm error policy,does it hit crash on > qemu-kvm: ../util/iov.c:428: qemu_iovec_init_extended: Assertion > `mid_qiov->niov <= IOV_MAX' failed. You mean the error_policy on the disk device right? -blockdev '{"driver":"host_device","filename":"/rhev/data-center/mnt/blockSD/ea2fe42d-2fa1-410f-894d-29621bb08716/images/95854454-a9ca-4979-9bf1-cbe7024c93aa/b3653cc3-e722-4091-bc34-88ff500ee98a","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device virtio-blk-pci,iothread=iothread1,bus=pci.0,addr=0x5,drive=libvirt-1-format,id=ua-95854454-a9ca-4979-9bf1-cbe7024c93aa,bootindex=1,write-cache=on,serial=95854454-a9ca-4979-9bf1-cbe7024c93aa \ Removed it, still throws the same error: 2021-08-31 07:58:18.694+0000: 375431: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"timestamp": {"seconds": 1630396698, "microseconds": 694539}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": false, "node-name": "libvirt-1-format", "reason": "Invalid argument", "operation": "read", "action": "report"}}] It doesn't go into a paused state now, but just hangs. Doesn't crash with the message I reported in the other bug (IOV_MAX one). Probably hang action prevented the crash. Could you please help to try change the device virtio-blk-pci to scsi-hd? It indicates the image is fine if boot succeed without any issue. If it boot succeed i think it still relevant with Bug 1994494. or try latest qemu-kvm (>=29) if this issue fixed? (In reply to Jean-Louis Dupond from comment #16) > (In reply to qing.wang from comment #12) > > Could you please try remove the vm error policy,does it hit crash on > > qemu-kvm: ../util/iov.c:428: qemu_iovec_init_extended: Assertion > > `mid_qiov->niov <= IOV_MAX' failed. > > You mean the error_policy on the disk device right? > > -blockdev > '{"driver":"host_device","filename":"/rhev/data-center/mnt/blockSD/ea2fe42d- > 2fa1-410f-894d-29621bb08716/images/95854454-a9ca-4979-9bf1-cbe7024c93aa/ > b3653cc3-e722-4091-bc34-88ff500ee98a","aio":"native","node-name":"libvirt-1- > storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true, > "discard":"unmap"}' \ > -blockdev > '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true, > "no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing": > null}' \ > -device > virtio-blk-pci,iothread=iothread1,bus=pci.0,addr=0x5,drive=libvirt-1-format, > id=ua-95854454-a9ca-4979-9bf1-cbe7024c93aa,bootindex=1,write-cache=on, > serial=95854454-a9ca-4979-9bf1-cbe7024c93aa \ > > > Removed it, still throws the same error: > 2021-08-31 07:58:18.694+0000: 375431: debug : > qemuMonitorJSONIOProcessLine:222 : Line [{"timestamp": {"seconds": > 1630396698, "microseconds": 694539}, "event": "BLOCK_IO_ERROR", "data": > {"device": "", "nospace": false, "node-name": "libvirt-1-format", "reason": > "Invalid argument", "operation": "read", "action": "report"}}] > > It doesn't go into a paused state now, but just hangs. > Doesn't crash with the message I reported in the other bug (IOV_MAX one). Probably hang action prevented the crash. Could you please help to try change the device virtio-blk-pci to scsi-hd? It indicates the image is fine if boot succeed without any issue. If it boot succeed i think it still relevant with Bug 1994494. or try latest qemu-kvm (>=29) if this issue fixed? (In reply to qing.wang from comment #17) > Probably hang action prevented the crash. > Could you please help to try change the device virtio-blk-pci to scsi-hd? > It indicates the image is fine if boot succeed without any issue. -device scsi-hd,bus=ua-18fca1d1-a616-48e9-94c0-0705f9b29e31.0,channel=0,scsi-id=0,lun=0,device_id=95854454-a9ca-4979-9bf1-cbe7024c93aa,drive=libvirt-1-format,id=ua-95854454-a9ca-4979-9bf1-cbe7024c93aa,bootindex=1,write-cache=on,serial=95854454-a9ca-4979-9bf1-cbe7024c93aa \ Then the VM boots fine and everything seems to be normal. It happens indeed only on plain VirtIO. Not on VirtIO-SCSI > > If it boot succeed i think it still relevant with Bug 1994494. > > or try latest qemu-kvm (>=29) if this issue fixed? As I am on CentOS Stream 8, I don't have a more recent version available. @ > As I am on CentOS Stream 8, I don't have a more recent version available.
> @
Maybe @etrunko could build a newer version?
As qemu-kvm-6.0.0-29.el8s was build today, I've directly tested it. Can confirm that this fixed the above issue! (In reply to Jean-Louis Dupond from comment #19) > > As I am on CentOS Stream 8, I don't have a more recent version available. > > @ > > Maybe @etrunko could build a newer version? qemu-kvm-6.0.0-29.el8s build is done https://cbs.centos.org/koji/taskinfo?taskID=2528662 Thanks Jean-Louis testing https://bugzilla.redhat.com/show_bug.cgi?id=1999051#c21 This issue closed by dup 1994494. *** This bug has been marked as a duplicate of bug 1994494 *** |