Description of problem: VM appeared to be running fine, vdsm reported an error and qemu marked the image corrupt, and the VM became paused. The issue came after LSM. However, I cannot seem to find the reason for the image being marked corrupt, no storage issues, the VM was running fine until here. Attempt to repair the image failed, it was unrecoverable. Version-Release number of selected component (if applicable): 4.4.3 How reproducible: Unknown Actual results: image got corrupted and paused the VM. Expected results: image was not corrupted. Additional info: In vdsm.log: ~~~ 2021-06-19 12:44:50,153+0200 INFO (libvirt/events) [virt.vm] (vmId='4200a8c1-bc83-3901-0278-73f8ef9c41ba') abnormal vm stop device ua-9aa7e3ea-5729-49a1-ac04-bd945a6005ae error (vm:4732) 2021-06-19 12:44:50,153+0200 INFO (libvirt/events) [virt.vm] (vmId='4200a8c1-bc83-3901-0278-73f8ef9c41ba') CPU stopped: onIOError (vm:5842) 2021-06-19 12:44:50,156+0200 INFO (libvirt/events) [virt.vm] (vmId='4200a8c1-bc83-3901-0278-73f8ef9c41ba') CPU stopped: onSuspend (vm:5842) 2021-06-19 12:44:50,174+0200 WARN (libvirt/events) [virt.vm] (vmId='4200a8c1-bc83-3901-0278-73f8ef9c41ba') device vdc reported I/O error (vm:3873) ~~~ engine.log ~~~ 2021-06-19 12:44:56,579+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-7) [] VM '4200a8c1-bc83-3901-0278-73f8ef9c41ba'(xxxxxx) moved from 'Up' --> 'Paused' 2021-06-19 12:44:56,596+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_PAUSED(1,025), VM xxxxx has been paused. 2021-06-19 12:44:56,606+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_PAUSED_ERROR(139), VM xxxxx has been paused due to unknown storage error. ~~~ $ tail -1 var/log/libvirt/qemu/xxxxx.log qcow2: Marking image as corrupt: Cluster allocation offset 0x30303030302000 unaligned (L2 offset: 0x3eb50000, L2 index: 0x11a3); further corruption events will be suppressed # qemu-img info /dev/583bce8f-fff6-4655-9fff-6ff2e48a0b5c/c19d500a-d990-464e-887e-72bd3fa8e3d2 image: /dev/583bce8f-fff6-4655-9fff-6ff2e48a0b5c/c19d500a-d990-464e-887e-72bd3fa8e3d2 file format: qcow2 virtual size: 100 GiB (107374182400 bytes) disk size: 0 B cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: true <<---- # qemu-img check /dev/583bce8f-fff6-4655-9fff-6ff2e48a0b5c/c19d500a-d990-464e-887e-72bd3fa8e3d2 ..... Leaked cluster 878243 refcount=1 reference=0 Leaked cluster 878244 refcount=1 reference=0 Leaked cluster 878245 refcount=1 reference=0 Leaked cluster 878246 refcount=1 reference=0 Leaked cluster 878247 refcount=1 reference=0 Leaked cluster 878248 refcount=1 reference=0 ERROR OFLAG_COPIED data cluster: l2_entry=8b1a000001000000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=2000200 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=5001c00 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=b56d080002200000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=ba9435090001da00 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8206fa1ed91eb71e refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=841d621d401d201d refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=bd1c9b1c781c5305 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=ab1b8a1b6a1b491b refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=9c1a7c1a581a361a refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8f196d194b192819 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=9f157b1559153915 refcount=0 ....... ............. .................... ERROR OFLAG_COPIED data cluster: l2_entry=100000056becd60 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=100000087c5cd60 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=100000089c6cd60 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=9491350900000000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=100000014bfcd60 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=100000036c0cd60 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=100000004c3cd60 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=100000047c9cd60 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=100000039c0cd60 refcount=1 ERROR OFLAG_COPIED data cluster: l2_entry=126f098 refcount=1 5157 errors were found on the image. Data may be corrupted, or further writes to the image may corrupt it. 812690 leaked clusters were found on the image. This means waste of disk space, but no harm to data. 402802/1638400 = 24.59% allocated, 2.59% fragmented, 0.18% compressed clusters Image end offset: 80823255040 We tried to repair the disk, but that failed: # qemu-img check -r all /dev/583bce8f-fff6-4655-9fff-6ff2e48a0b5c/c19d500a-d990-464e-887e-72bd3fa8e3d2 ....... ................ Leaked cluster 878244 refcount=1 reference=0 Leaked cluster 878245 refcount=1 reference=0 Leaked cluster 878246 refcount=1 reference=0 Leaked cluster 878247 refcount=1 reference=0 Leaked cluster 878248 refcount=1 reference=0 Rebuilding refcount structure ERROR writing refblock: No space left on device qemu-img: Check failed: No space left on device One thing possibly worth noting, is that this VM has 4 disks, several were live migrated to another SD. Two of them failed, appearing to hit BZ 1957776, however the disk in question (the one that caused the VM to pause, and became corrupt) was also LSM, however that one appeared to finish fine and the corruption/pause came an hour or more after that. Unable to determine what caused the issue, no storage issues were observed when it occurred. Logs will be attached soon.
there's been many storage-related change since 4.4.3 It may not be feasible to find a root cause on an old version, please upgrade.
Nir, can you please have a look?
We had similar reports in the past and never found the reason why qemu mark an image as corrupted. In general RHV is never modifying an image used by a VM, so image becoming corrupted can be a bug in qemu, or external program running on the host and accessing the image. amashah, can you check if there there is a backup application is accessing this host? We know about 2 backup applications accessing disk directly, bypassing RHV APIs. If this such backup application is involved, we cannot support this system. Do you see this issue with current version (4.4.6)?
Kevin, how do you recommend to handle this case? Which info do we need to understand this corruption?
So it looks like at least of the L2 tables has been overwritten with other data. The L2 entries mentioned in the error messages are completely invalid. I think for trying to identify where the corruption came from, our only indication is the kind of data that has been written over the L2 table, i.e. look at the hexdump at the offset where an L2 table should be (here at least 0x3eb50000, possibly others, too). Of course, if it looks like guest data, it's still not clear if it really comes from the guest or if an external tool has copied it. With access to the image, we could also use some other tools like John's qcheck (https://github.com/jnsnow/qcheck) to analyse the exact shape of the corruption, though in general this can't tell much about its origin.
either way our recommendation is to upgrade to 4.4.6. decreasing Sev accordingly
*** This bug has been marked as a duplicate of bug 1951507 ***