1978713 – qcow2: Marking image as corrupt: Cluster allocation offset 0x30303030302000 unaligned (L2 offset: 0x3eb50000, L2 index: 0x11a3); further corruption events will be suppressed

Bug 1978713 - qcow2: Marking image as corrupt: Cluster allocation offset 0x30303030302000 unaligned (L2 offset: 0x3eb50000, L2 index: 0x11a3); further corruption events will be suppressed

Summary: qcow2: Marking image as corrupt: Cluster allocation offset 0x30303030302000 u...

Keywords:
Status:	CLOSED DUPLICATE of bug 1951507
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.4.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-02 14:45 UTC by amashah
Modified:	2024-10-01 18:52 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-02 12:18:48 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6191881	0	None	None	None	2021-07-16 13:37:26 UTC

Description amashah 2021-07-02 14:45:07 UTC

Description of problem:
VM appeared to be running fine, vdsm reported an error and qemu marked the image corrupt, and the VM became paused.

The issue came after LSM.

However, I cannot seem to find the reason for the image being marked corrupt, no storage issues, the VM was running fine until here. 

Attempt to repair the image failed, it was unrecoverable.

Version-Release number of selected component (if applicable):
4.4.3

How reproducible:
Unknown

Actual results:
image got corrupted and paused the VM.

Expected results:
image was not corrupted.

Additional info:

In vdsm.log:

~~~
2021-06-19 12:44:50,153+0200 INFO  (libvirt/events) [virt.vm] (vmId='4200a8c1-bc83-3901-0278-73f8ef9c41ba') abnormal vm stop device ua-9aa7e3ea-5729-49a1-ac04-bd945a6005ae error  (vm:4732)
2021-06-19 12:44:50,153+0200 INFO  (libvirt/events) [virt.vm] (vmId='4200a8c1-bc83-3901-0278-73f8ef9c41ba') CPU stopped: onIOError (vm:5842)
2021-06-19 12:44:50,156+0200 INFO  (libvirt/events) [virt.vm] (vmId='4200a8c1-bc83-3901-0278-73f8ef9c41ba') CPU stopped: onSuspend (vm:5842)
2021-06-19 12:44:50,174+0200 WARN  (libvirt/events) [virt.vm] (vmId='4200a8c1-bc83-3901-0278-73f8ef9c41ba') device vdc reported I/O error (vm:3873)
~~~


engine.log

~~~
2021-06-19 12:44:56,579+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-7) [] VM '4200a8c1-bc83-3901-0278-73f8ef9c41ba'(xxxxxx) moved from 'Up' --> 'Paused'
2021-06-19 12:44:56,596+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_PAUSED(1,025), VM xxxxx has been paused.
2021-06-19 12:44:56,606+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_PAUSED_ERROR(139), VM xxxxx has been paused due to unknown storage error.
~~~


$ tail -1 var/log/libvirt/qemu/xxxxx.log
qcow2: Marking image as corrupt: Cluster allocation offset 0x30303030302000 unaligned (L2 offset: 0x3eb50000, L2 index: 0x11a3); further corruption events will be suppressed


# qemu-img info /dev/583bce8f-fff6-4655-9fff-6ff2e48a0b5c/c19d500a-d990-464e-887e-72bd3fa8e3d2
image: /dev/583bce8f-fff6-4655-9fff-6ff2e48a0b5c/c19d500a-d990-464e-887e-72bd3fa8e3d2
file format: qcow2
virtual size: 100 GiB (107374182400 bytes)
disk size: 0 B
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: true  <<----


# qemu-img check /dev/583bce8f-fff6-4655-9fff-6ff2e48a0b5c/c19d500a-d990-464e-887e-72bd3fa8e3d2
..... 
Leaked cluster 878243 refcount=1 reference=0
Leaked cluster 878244 refcount=1 reference=0
Leaked cluster 878245 refcount=1 reference=0
Leaked cluster 878246 refcount=1 reference=0
Leaked cluster 878247 refcount=1 reference=0
Leaked cluster 878248 refcount=1 reference=0
ERROR OFLAG_COPIED data cluster: l2_entry=8b1a000001000000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=2000200 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=5001c00 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=b56d080002200000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=ba9435090001da00 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8206fa1ed91eb71e refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=841d621d401d201d refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=bd1c9b1c781c5305 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=ab1b8a1b6a1b491b refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=9c1a7c1a581a361a refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8f196d194b192819 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=9f157b1559153915 refcount=0
.......
.............
....................
ERROR OFLAG_COPIED data cluster: l2_entry=100000056becd60 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=100000087c5cd60 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=100000089c6cd60 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=9491350900000000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=100000014bfcd60 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=100000036c0cd60 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=100000004c3cd60 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=100000047c9cd60 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=100000039c0cd60 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=126f098 refcount=1

5157 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

812690 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
402802/1638400 = 24.59% allocated, 2.59% fragmented, 0.18% compressed clusters
Image end offset: 80823255040



We tried to repair the disk, but that failed:

# qemu-img check -r all /dev/583bce8f-fff6-4655-9fff-6ff2e48a0b5c/c19d500a-d990-464e-887e-72bd3fa8e3d2
.......
................
Leaked cluster 878244 refcount=1 reference=0
Leaked cluster 878245 refcount=1 reference=0
Leaked cluster 878246 refcount=1 reference=0
Leaked cluster 878247 refcount=1 reference=0
Leaked cluster 878248 refcount=1 reference=0
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device




One thing possibly worth noting, is that this VM has 4 disks, several were live migrated to another SD.
Two of them failed, appearing to hit BZ 1957776, however the disk in question (the one that caused the VM to pause, and became corrupt) was also LSM, however that one appeared to finish fine and the corruption/pause came an hour or more after that.


Unable to determine what caused the issue, no storage issues were observed when it occurred.

Logs will be attached soon.

Comment 2 Michal Skrivanek 2021-07-03 06:29:43 UTC

there's been many storage-related change since 4.4.3
It may not be feasible to find a root cause on an old version, please upgrade.

Comment 3 Eyal Shenitzky 2021-07-05 06:17:43 UTC

Nir, can you please have a look?

Comment 4 Nir Soffer 2021-07-05 18:17:48 UTC

We had similar reports in the past and never found the reason why qemu
mark an image as corrupted.

In general RHV is never modifying an image used by a VM, so image becoming
corrupted can be a bug in qemu, or external program running on the host
and accessing the image.

amashah, can you check if there there is a backup application is accessing
this host? We know about 2 backup applications accessing disk directly, 
bypassing RHV APIs. If this such backup application is involved, we cannot
support this system.

Do you see this issue with current version (4.4.6)?

Comment 6 Nir Soffer 2021-07-05 18:21:03 UTC

Kevin, how do you recommend to handle this case? Which info do we need to 
understand this corruption?

Comment 7 Kevin Wolf 2021-07-06 08:30:20 UTC

So it looks like at least of the L2 tables has been overwritten with other data. The L2 entries mentioned in the error messages are completely invalid.

I think for trying to identify where the corruption came from, our only indication is the kind of data that has been written over the L2 table, i.e. look at the hexdump at the offset where an L2 table should be (here at least 0x3eb50000, possibly others, too). Of course, if it looks like guest data, it's still not clear if it really comes from the guest or if an external tool has copied it.

With access to the image, we could also use some other tools like John's qcheck (https://github.com/jnsnow/qcheck) to analyse the exact shape of the corruption, though in general this can't tell much about its origin.

Comment 10 Michal Skrivanek 2021-07-08 12:51:02 UTC

either way our recommendation is to upgrade to 4.4.6. decreasing Sev accordingly

Comment 22 Eyal Shenitzky 2021-09-02 12:18:48 UTC


*** This bug has been marked as a duplicate of bug 1951507 ***

Note You need to log in before you can comment on or make changes to this bug.