Bug 1519071

Summary:

Fail to rebuild the reference count tables of qcow2 image on host block devices (e.g. LVs)

Product:

Red Hat Enterprise Linux 8

Reporter:

yilzhang

Component:

qemu-kvm

Assignee:

Hanna Czenczek <hreitz>

qemu-kvm sub component:

qcow2

QA Contact:

Tingting Mao <timao>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

urgent

CC:

areis, chayang, coli, gveitmic, hreitz, jinzhao, juzhang, kanderso, kkiwi, knoel, kwolf, mtessun, ngu, nsoffer, qzhang, rbalakri, timao, vcojot, virt-maint, xuwei, xzhou, ymankad, zixchen

Version:

8.5

Keywords:

Reopened, Triaged, ZStream

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

qemu-kvm-6.2.0-13.module+el8.7.0+15131+941fbd8d

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

2072242 2072379 (view as bug list)

Environment:

Last Closed:

2022-11-08 09:18:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2072242, 2072379

Attachments:

Description	Flags
Small example of a broken qcow2 image	none

Description yilzhang 2017-11-30 03:27:25 UTC

Description of problem:
Create an iSCSI image with lazy_refcounts=on and install guest with cache=writethrough; After installation finished, write file inside guest and kill the qemu process; After that, check the image, "qemu-img check -r all" reports lots of errors.

Version-Release number of selected component (if applicable):
Host kernel:   4.14.0-4.el7a.ppc64le
qemu-kvm-rhev: qemu-kvm-rhev-2.10.0-9.el7

How reproducible: 100%


Steps to Reproduce:
1. Create a qcow2_v3 iamge with lazy_refcounts=on
[Host]# qemu-img create -f qcow2 -o compat=1.1,lazy_refcounts=on     iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:t1/0   20G
2. Install guest with ** cache=writethrough**
3. After guest installed finished, dd file and get md5 value inside guest with sync
# dd if=/dev/urandom of=file1 conv=fsync bs=1M count=512 ; md5sum file1 ; sync

4. Kill qemu-kvm in host
[Host]# kill -9 `pidof qemu-kvm`
5. Rebuilt the reference count tables
[Host]# qemu-img check -r all  iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:t1/0


Actual results:
ERROR cluster 69574 refcount=0 reference=1
ERROR cluster 69575 refcount=0 reference=1
ERROR cluster 69576 refcount=0 reference=1
ERROR cluster 69577 refcount=0 reference=1
ERROR cluster 69578 refcount=0 reference=1
ERROR cluster 69579 refcount=0 reference=1
ERROR cluster 69580 refcount=0 reference=1
ERROR cluster 69581 refcount=0 reference=1
ERROR cluster 69582 refcount=0 reference=1
ERROR cluster 69583 refcount=0 reference=1
ERROR cluster 69584 refcount=0 reference=1
ERROR cluster 69585 refcount=0 reference=1
Rebuilding refcount structure
qemu-img: iSCSI Failure: SENSE KEY:ILLEGAL_REQUEST(5) ASCQ:LBA_OUT_OF_RANGE(0x2100)
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device
[Host]# echo $?
1


Expected results:
After step5: No errors were found on this image. Can show fragmentaion,
allocation and compressed cluster.
# qemu-img check base.qcow2
No errors were found on the image.
29877/29902 = 99.92% allocated, 0.00% fragmented, 0.00% compressed clusters
Image end offset: 1959657472



Additional info:
/usr/libexec/qemu-kvm \
-smp 8,sockets=2,cores=4,threads=1 -m 8192 \
-serial unix:/tmp/1-serial.log,server,nowait \
-nodefaults \
 -rtc base=localtime,clock=host \
 -boot menu=on \
 -monitor stdio \
\
 -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 \
 -device virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0 \
-drive file=iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:t1/0,media=disk,if=none,cache=writethrough,id=drive_sysdisk,format=qcow2,werror=stop,rerror=stop \
-device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 \
\
-drive file=/home/yilzhang/backup/RHEL-ALT-7.4-20171030.0-Server-ppc64le-dvd1.iso,if=none,id=scsi-cd-dr0,readonly=on,format=raw,cache=none \
-device scsi-cd,id=scsi-cd0,drive=scsi-cd-dr0,bus=scsi0.0,bootindex=1 \
\
-netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
-device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c3:e7:4f,bus=bridge1,addr=0x1e

Comment 2 Ping Li 2017-11-30 05:38:51 UTC

Could you reproduce the issue on x86?

Comment 3 yilzhang 2017-11-30 05:47:20 UTC

1. NFS backend image doesn't have this issue
2. x86 and PPC both have this issue.
Version of components for x86 platform:
Host kernel:       3.10.0-799.el7.x86_64
Guest install iso: RHEL-7.5-20171107.1-Server-x86_64-dvd1.iso
qemu-kvm-rhev:     qemu-kvm-rhev-2.10.0-9.el7

Comment 4 Longxiang Lyu 2017-11-30 09:23:37 UTC

rep

Comment 5 Longxiang Lyu 2017-11-30 09:24:56 UTC

reproduced in:
kernel-3.10.0-798.el7.x86_64
qemu-kvm-rhev-2.10.0-7.el7

set qa-ack+.

Comment 6 Ademar Reis 2017-12-18 17:17:22 UTC

(In reply to yilzhang from comment #0)
> Actual results:
> ERROR cluster 69574 refcount=0 reference=1
> ERROR cluster 69575 refcount=0 reference=1
> ERROR cluster 69576 refcount=0 reference=1
> ERROR cluster 69577 refcount=0 reference=1
> ERROR cluster 69578 refcount=0 reference=1
> ERROR cluster 69579 refcount=0 reference=1
> ERROR cluster 69580 refcount=0 reference=1
> ERROR cluster 69581 refcount=0 reference=1
> ERROR cluster 69582 refcount=0 reference=1
> ERROR cluster 69583 refcount=0 reference=1
> ERROR cluster 69584 refcount=0 reference=1
> ERROR cluster 69585 refcount=0 reference=1
> Rebuilding refcount structure
> qemu-img: iSCSI Failure: SENSE KEY:ILLEGAL_REQUEST(5)
> ASCQ:LBA_OUT_OF_RANGE(0x2100)
> ERROR writing refblock: No space left on device
> qemu-img: Check failed: No space left on device
> [Host]# echo $?
> 1

I think the refcount errors are expected with lazy_refcounts in this scenario, but the "ENOSPC" error is suspicious, but I don't understand how iscsi works in this case. Fam should know.

Comment 7 Fam Zheng 2017-12-20 07:00:04 UTC

;-)

Comment 8 Fam Zheng 2017-12-20 07:17:49 UTC

As shown above, the image is probably fully written, and the iscsi LUN is therefore also full, hence the ENOSPC error. (Refcount rebuilding needs to allocate new clusters.)

Please test again with a much larger iscsi LUN (e.g 10G larger than the qcow2 image size).

Comment 9 yilzhang 2017-12-20 08:47:01 UTC

(In reply to Fam Zheng from comment #8)
> As shown above, the image is probably fully written, and the iscsi LUN is
> therefore also full, hence the ENOSPC error. (Refcount rebuilding needs to
> allocate new clusters.)
> 
> Please test again with a much larger iscsi LUN (e.g 10G larger than the
> qcow2 image size).


Result of re-testing:
The iSCSI LUN created on iSCSI target side is 31G, then I only created a 20G qcow2 image on this iSCSI LUN (that is, the iSCSI LUN is 11G larger than the qcow2 image)
And, the problem in this bug still exists.


************  Host and qemu info:  ************
Host: power9 with kernel 4.14.0-18.el7a.ppc64le
qemu-kvm: qemu-kvm-rhev-2.10.0-13.el7


Step1:
[Host]# qemu-img info  iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:libiscsi/2
image: json:{"driver": "raw", "file": {"lun": "2", "portal": "10.0.0.7", "driver": "iscsi", "transport": "tcp", "target": "iqn.2017-08.com.yilzhang:libiscsi"}}
file format: raw
virtual size: 31G (33285996544 bytes)
disk size: unavailable

[Host]# qemu-img create -f qcow2 -o compat=1.1,lazy_refcounts=on   iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:libiscsi/2   20G
Formatting 'iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:libiscsi/2', fmt=qcow2 size=21474836480 compat=1.1 cluster_size=65536 lazy_refcounts=on refcount_bits=16

... ...
Step5:
[Host]#  qemu-img check -r all  iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:libiscsi/2
ERROR cluster 65536 refcount=0 reference=1
... ...
ERROR cluster 72685 refcount=0 reference=1
ERROR cluster 72686 refcount=0 reference=1
ERROR cluster 72687 refcount=0 reference=1
Rebuilding refcount structure
qemu-img: iSCSI Failure: SENSE KEY:ILLEGAL_REQUEST(5) ASCQ:LBA_OUT_OF_RANGE(0x2100)
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device

Comment 10 Tingting Mao 2018-08-13 11:49:26 UTC

Reproduced this issue with qemu-kvm-rhev-2.12.0-9.el7 package.

Steps:
1.Prepare the libiscsi file in the server
# targetcli
/backstores/fileio> create lun1 /home/iscsi/lun1.img 30G
/backstores/fileio> cd /iscsi/iqn.2018-07.com.example:t1/tpg1/luns/
/iscsi/iqn.20...:t1/tpg1/luns> create /backstores/fileio/lun1

2.operate the libiscsi file in the client
2.1 create qcow2 file on the libiscsi backend with the value of lazy_refcounts option is “on”
# qemu-img create -f qcow2 -o lazy_refcounts=on,compat=1.1 iscsi://10.66.11.19/iqn.2018-07.com.example:t1/1 30G
Formatting 'iscsi://10.66.11.19/iqn.2018-07.com.example:t1/1', fmt=qcow2 size=32212254720 compat=1.1 cluster_size=65536 lazy_refcounts=on refcount_bits=16

2.2 install rhel7.6 with the cache mode is writethrough
/usr/libexec/qemu-kvm \
        -name 'rhel7.6' \
        -machine q35 \
        -nodefaults \
        -vga qxl \
        -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=writethrough,format=qcow2,file=$1 \
        -device virtio-blk-pci,id=virtio_blk_pci0,drive=drive_image1,bus=pcie.0,addr=05 \
        -drive id=drive_cd1,if=none,snapshot=off,aio=threads,cache=unsafe,media=cdrom,file=$2 \
        -device ide-cd,id=cd1,drive=drive_cd1,bus=ide.0,unit=0 \
        -monitor stdio \
        -vnc :1 \
        -m 8192 \
        -smp 8 \
        -device virtio-net-pci,mac=9a:b5:b6:b1:b5:b5,id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pcie.0,addr=0x9  \
        -netdev tap,id=idxgXAlm \
        -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/timao/monitor-qmpmonitor1-20180220-094308-h9I6hRsI,server,nowait \
        -mon chardev=qmp_id_qmpmonitor1,mode=control  \

2.3 After installed, dd a file in the guest
# dd if=/dev/urandom of=/home/ftest bs=1M count=2048

2.4 shutdown the guest immediately after the dd finished

2.5 check the image
# qemu-img check iscsi://10.66.11.19/iqn.2018-07.com.example:t1/1
ERROR cluster 41455 refcount=0 reference=1
ERROR cluster 41456 refcount=0 reference=1
ERROR cluster 41457 refcount=0 reference=1
ERROR cluster 41458 refcount=0 reference=1
ERROR cluster 41459 refcount=0 reference=1
…...
…...
ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5400000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5410000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5420000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5430000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5440000 refcount=0
9900 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
46388/491520 = 9.44% allocated, 6.63% fragmented, 0.00% compressed clusters
Image end offset: 3041198080

2.6 repair the image
# qemu-img check -r all iscsi://10.66.11.19/iqn.2018-07.com.example:t1/1
ERROR cluster 41455 refcount=0 reference=1
ERROR cluster 41456 refcount=0 reference=1
ERROR cluster 41457 refcount=0 reference=1
ERROR cluster 41458 refcount=0 reference=1
ERROR cluster 41459 refcount=0 reference=1
……
…...
ERROR cluster 46401 refcount=0 reference=1
ERROR cluster 46402 refcount=0 reference=1
ERROR cluster 46403 refcount=0 reference=1
ERROR cluster 46404 refcount=0 reference=1
Rebuilding refcount structure
qemu-img: iSCSI WRITE10/16 failed at lba 62914688: SENSE KEY:ILLEGAL_REQUEST(5) ASCQ:LBA_OUT_OF_RANGE(0x2100)
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device

Comment 14 Ademar Reis 2020-02-05 22:45:44 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 19 RHEL Program Management 2021-02-15 07:31:33 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 40 Vincent S. Cojot 2022-03-28 20:36:58 UTC

Opened https://gitlab.com/qemu-project/qemu/-/issues/941

Comment 41 Vincent S. Cojot 2022-03-28 22:29:17 UTC

I've found a small VM with a qcow2 that needs fixing:

[root@w6017qarhv05 ~]# qemu-img check /dev/dm-36
ERROR cluster 1132682 refcount=0 reference=1
ERROR OFLAG_COPIED data cluster: l2_entry=80000011488a0000 refcount=0

2 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters
Image end offset: 74231513088

[root@w6017qarhv05 ~]# qemu-img info /dev/dm-36
image: /dev/dm-36
file format: qcow2
virtual size: 80 GiB (85899345920 bytes)
disk size: 0 B
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: true
    refcount bits: 16
    corrupt: false
    extended l2: false
[root@w6017qarhv05 ~]# qemu-img check -r all /dev/dm-36
ERROR cluster 1132682 refcount=0 reference=1
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device

I am currently dumping that LV to a file and will see if it repairs fine when in a filesystem
[root@w6017qarhv05 ~]# dd if=/dev/dm-36 of=dm36.qcow2 bs=1024k

Comment 42 Vincent S. Cojot 2022-03-28 22:44:44 UTC

[root@w6017qarhv05 ~]# dd if=/dev/dm-36 of=dm36.qcow2 bs=1024k
90112+0 records in
90112+0 records out
94489280512 bytes (94 GB, 88 GiB) copied, 432.391 s, 219 MB/s
[root@w6017qarhv05 ~]# qemu-img check dm36.qcow2
ERROR cluster 1132682 refcount=0 reference=1
ERROR OFLAG_COPIED data cluster: l2_entry=80000011488a0000 refcount=0

2 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters
Image end offset: 74231513088
[root@w6017qarhv05 ~]# qemu-img check -r all dm36.qcow2.orig
ERROR cluster 1132682 refcount=0 reference=1
Rebuilding refcount structure
Repairing cluster 1 refcount=1 reference=0
Repairing cluster 2 refcount=1 reference=0
Repairing cluster 32785 refcount=1 reference=0
Repairing cluster 65537 refcount=1 reference=0
Repairing cluster 98304 refcount=1 reference=0
[....]
Repairing cluster 1114127 refcount=1 reference=0
The following inconsistencies were found and repaired:

    0 leaked clusters  
    1 corruptions

Double checking the fixed image now...
No errors were found on the image.
1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters
Image end offset: 94489411584

Comment 43 Vincent S. Cojot 2022-03-28 22:46:01 UTC

It's the same data stream but in the second case, then qemu-img was able to repair the structure

[root@w6017qarhv05 ~]# qemu-img check -r all dm36.qcow2
No errors were found on the image.
1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters
Image end offset: 94489411584

Comment 47 Hanna Czenczek 2022-03-29 09:00:27 UTC

Hi,

I’ve sent a series last year to fix this, but didn’t get much feedback, so it fizzled out.

So when `qemu-img check` encounters an image with clusters that are in-use but not marked as allocated, it completely rebuilds the refcount structure to make sure during the repair it won’t accidentally overwrite these clusters.  However, it tended to put this new structure past the end of the image file (seemed like a sensible idea at the time, because that area’s definitely free to use) – but that will absolutely never work when the qcow2 file is on a fixed-size disk volume.

I’m in the process of reworking the patches and sending another version.  Once I do so, I’ll try to spin up a brew build for testing.

Comment 48 Hanna Czenczek 2022-03-29 09:51:25 UTC

I’ve sent a v2 series upstream:

https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg01260.html


And I’ve created a Brew build for testing:

http://brew-task-repos.usersys.redhat.com/repos/scratch/hreitz/qemu-kvm/6.2.0/9.el8.hreitz202203291121/

Comment 55 Hanna Czenczek 2022-03-29 12:48:27 UTC

As a reproducer, I’ve created a small broken qcow2 image that you can dd to any block device, and trying to repair it on the block device is likely to fail:

$ truncate -s 10G device.img

$ sudo losetup --find --show device.img
/dev/loop0

$ sudo dd if=broken.qcow2 of=/dev/loop0
[...]

$ sudo qemu-img check -r all /dev/loop0
ERROR cluster 0 refcount=0 reference=1
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device

$ sudo losetup -d /dev/loop0

With the build posted in comment 48, it works:

$ sudo /path/to/fixed/qemu-img check -r all /dev/loop0 
ERROR cluster 0 refcount=0 reference=1
Rebuilding refcount structure
Repairing cluster 1 refcount=1 reference=0
Repairing cluster 2 refcount=1 reference=0
The following inconsistencies were found and repaired:

    0 leaked clusters
    1 corruptions

Double checking the fixed image now...
No errors were found on the image.
128/131072 = 0.10% allocated, 0.00% fragmented, 0.00% compressed clusters
Image end offset: 85504

Comment 56 Hanna Czenczek 2022-03-29 12:50:46 UTC

Created attachment 1868966 [details]
Small example of a broken qcow2 image

Comment 58 Vincent S. Cojot 2022-03-29 13:06:20 UTC

Hanna's build correctly notices that images need fixing:

[root@w6017qarhv05 bin]# ./qemu-img  check /dev/dm-36
ERROR cluster 1132682 refcount=0 reference=1
ERROR OFLAG_COPIED data cluster: l2_entry=80000011488a0000 refcount=0

2 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters
Image end offset: 74231513088


[root@w6017qarhv05 bin]# ./qemu-img  check /dev/dm-37
ERROR cluster 1112588 refcount=0 reference=1
ERROR cluster 1112589 refcount=0 reference=1
ERROR cluster 1112590 refcount=0 reference=1
ERROR cluster 1112591 refcount=0 reference=1
ERROR cluster 1112592 refcount=0 reference=1
ERROR cluster 1112593 refcount=0 reference=1
ERROR cluster 1112594 refcount=0 reference=1
ERROR cluster 1112595 refcount=0 reference=1
ERROR cluster 1112596 refcount=0 reference=1
[....]
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa2d0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa2e0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa2f0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa300000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa310000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa320000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa330000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa340000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa350000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa360000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa370000 refcount=0

132 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
1112469/1310720 = 84.87% allocated, 0.48% fragmented, 0.00% compressed clusters
Image end offset: 72918892544

Comment 59 Vincent S. Cojot 2022-03-29 13:07:33 UTC

Hanna's binary build was able to fix it in place!

[root@w6017qarhv05 bin]# ./qemu-img  check -r all /dev/dm-36
ERROR cluster 1132682 refcount=0 reference=1
Rebuilding refcount structure
Repairing cluster 1 refcount=1 reference=0
Repairing cluster 2 refcount=1 reference=0
Repairing cluster 32785 refcount=1 reference=0
Repairing cluster 65537 refcount=1 reference=0
Repairing cluster 98304 refcount=1 reference=0
Repairing cluster 131087 refcount=1 reference=0
Repairing cluster 163852 refcount=1 reference=0
Repairing cluster 196621 refcount=1 reference=0
Repairing cluster 229391 refcount=1 reference=0
Repairing cluster 262174 refcount=1 reference=0
Repairing cluster 294913 refcount=1 reference=0
Repairing cluster 327684 refcount=1 reference=0
Repairing cluster 360451 refcount=1 reference=0
Repairing cluster 393219 refcount=1 reference=0
Repairing cluster 425987 refcount=1 reference=0
Repairing cluster 458753 refcount=1 reference=0
Repairing cluster 491522 refcount=1 reference=0
Repairing cluster 524291 refcount=1 reference=0
Repairing cluster 557068 refcount=1 reference=0
Repairing cluster 589824 refcount=1 reference=0
Repairing cluster 622607 refcount=1 reference=0
Repairing cluster 655380 refcount=1 reference=0
Repairing cluster 688129 refcount=1 reference=0
Repairing cluster 720897 refcount=1 reference=0
Repairing cluster 753669 refcount=1 reference=0
Repairing cluster 786458 refcount=1 reference=0
Repairing cluster 819224 refcount=1 reference=0
Repairing cluster 851990 refcount=1 reference=0
Repairing cluster 884763 refcount=1 reference=0
Repairing cluster 917533 refcount=1 reference=0
Repairing cluster 950298 refcount=1 reference=0
Repairing cluster 983071 refcount=1 reference=0
Repairing cluster 1015811 refcount=1 reference=0
Repairing cluster 1048591 refcount=1 reference=0
Repairing cluster 1081369 refcount=1 reference=0
Repairing cluster 1114127 refcount=1 reference=0
The following inconsistencies were found and repaired:

    0 leaked clusters
    1 corruptions

Double checking the fixed image now...
No errors were found on the image.
1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters
Image end offset: 74233872384

Comment 60 Vincent S. Cojot 2022-03-29 13:08:51 UTC

[root@w6017qarhv05 bin]# ./qemu-img  check -r all /dev/dm-37
ERROR cluster 1112588 refcount=0 reference=1
ERROR cluster 1112589 refcount=0 reference=1
ERROR cluster 1112590 refcount=0 reference=1
ERROR cluster 1112591 refcount=0 reference=1
ERROR cluster 1112592 refcount=0 reference=1
ERROR cluster 1112593 refcount=0 reference=1
ERROR cluster 1112594 refcount=0 reference=1
ERROR cluster 1112595 refcount=0 reference=1
ERROR cluster 1112596 refcount=0 reference=1
ERROR cluster 1112597 refcount=0 reference=1
ERROR cluster 1112598 refcount=0 reference=1
ERROR cluster 1112599 refcount=0 reference=1
ERROR cluster 1112600 refcount=0 reference=1
ERROR cluster 1112601 refcount=0 reference=1
ERROR cluster 1112602 refcount=0 reference=1
ERROR cluster 1112603 refcount=0 reference=1
ERROR cluster 1112604 refcount=0 reference=1
ERROR cluster 1112605 refcount=0 reference=1
ERROR cluster 1112606 refcount=0 reference=1
ERROR cluster 1112607 refcount=0 reference=1
ERROR cluster 1112608 refcount=0 reference=1
ERROR cluster 1112609 refcount=0 reference=1
ERROR cluster 1112610 refcount=0 reference=1
ERROR cluster 1112611 refcount=0 reference=1
ERROR cluster 1112612 refcount=0 reference=1
ERROR cluster 1112613 refcount=0 reference=1
ERROR cluster 1112614 refcount=0 reference=1
ERROR cluster 1112615 refcount=0 reference=1
ERROR cluster 1112616 refcount=0 reference=1
ERROR cluster 1112617 refcount=0 reference=1
ERROR cluster 1112618 refcount=0 reference=1
ERROR cluster 1112619 refcount=0 reference=1
ERROR cluster 1112620 refcount=0 reference=1
ERROR cluster 1112621 refcount=0 reference=1
ERROR cluster 1112622 refcount=0 reference=1
ERROR cluster 1112623 refcount=0 reference=1
ERROR cluster 1112624 refcount=0 reference=1
ERROR cluster 1112625 refcount=0 reference=1
ERROR cluster 1112626 refcount=0 reference=1
ERROR cluster 1112627 refcount=0 reference=1
ERROR cluster 1112628 refcount=0 reference=1
ERROR cluster 1112629 refcount=0 reference=1
ERROR cluster 1112630 refcount=0 reference=1
ERROR cluster 1112631 refcount=0 reference=1
ERROR cluster 1112632 refcount=0 reference=1
ERROR cluster 1112633 refcount=0 reference=1
ERROR cluster 1112634 refcount=0 reference=1
ERROR cluster 1112635 refcount=0 reference=1
ERROR cluster 1112636 refcount=0 reference=1
ERROR cluster 1112637 refcount=0 reference=1
ERROR cluster 1112638 refcount=0 reference=1
ERROR cluster 1112639 refcount=0 reference=1
ERROR cluster 1112640 refcount=0 reference=1
ERROR cluster 1112641 refcount=0 reference=1
ERROR cluster 1112642 refcount=0 reference=1
ERROR cluster 1112643 refcount=0 reference=1
ERROR cluster 1112644 refcount=0 reference=1
ERROR cluster 1112645 refcount=0 reference=1
ERROR cluster 1112646 refcount=0 reference=1
ERROR cluster 1112647 refcount=0 reference=1
ERROR cluster 1112648 refcount=0 reference=1
ERROR cluster 1112649 refcount=0 reference=1
ERROR cluster 1112650 refcount=0 reference=1
ERROR cluster 1112651 refcount=0 reference=1
ERROR cluster 1112652 refcount=0 reference=1
ERROR cluster 1112653 refcount=0 reference=1
Rebuilding refcount structure
Repairing cluster 1 refcount=1 reference=0
Repairing cluster 2 refcount=1 reference=0
Repairing cluster 32794 refcount=1 reference=0
Repairing cluster 65549 refcount=1 reference=0
Repairing cluster 98311 refcount=1 reference=0
Repairing cluster 131081 refcount=1 reference=0
Repairing cluster 163845 refcount=1 reference=0
Repairing cluster 196622 refcount=1 reference=0
Repairing cluster 229398 refcount=1 reference=0
Repairing cluster 262144 refcount=1 reference=0
Repairing cluster 294934 refcount=1 reference=0
Repairing cluster 327690 refcount=1 reference=0
Repairing cluster 360463 refcount=1 reference=0
Repairing cluster 393244 refcount=1 reference=0
Repairing cluster 426002 refcount=1 reference=0
Repairing cluster 458764 refcount=1 reference=0
Repairing cluster 491542 refcount=1 reference=0
Repairing cluster 524292 refcount=1 reference=0
Repairing cluster 557067 refcount=1 reference=0
Repairing cluster 589832 refcount=1 reference=0
Repairing cluster 622593 refcount=1 reference=0
Repairing cluster 655385 refcount=1 reference=0
Repairing cluster 688153 refcount=1 reference=0
Repairing cluster 720926 refcount=1 reference=0
Repairing cluster 753685 refcount=1 reference=0
Repairing cluster 786445 refcount=1 reference=0
Repairing cluster 819201 refcount=1 reference=0
Repairing cluster 851985 refcount=1 reference=0
Repairing cluster 884737 refcount=1 reference=0
Repairing cluster 917509 refcount=1 reference=0
Repairing cluster 950301 refcount=1 reference=0
Repairing cluster 983070 refcount=1 reference=0
Repairing cluster 1015812 refcount=1 reference=0
Repairing cluster 1048604 refcount=1 reference=0
Repairing cluster 1081356 refcount=1 reference=0
The following inconsistencies were found and repaired:

    0 leaked clusters
    66 corruptions

Double checking the fixed image now...
No errors were found on the image.
1112469/1310720 = 84.87% allocated, 0.48% fragmented, 0.00% compressed clusters
Image end offset: 72921186304
[root@w6017qarhv05 bin]# 

[root@w6017qarhv05 bin]# ./qemu-img  check /dev/dm-37
No errors were found on the image.
1112469/1310720 = 84.87% allocated, 0.48% fragmented, 0.00% compressed clusters
Image end offset: 72921186304
[root@w6017qarhv05 bin]#

Comment 61 Nir Soffer 2022-04-04 11:51:22 UTC

This bug is important for RHV. When "qemu-img check -r" cannot repair thin
disks on block storage, the only way to recover is to copy the disk to 
another storage, repair it, and copy it back to the original volume. When
dealing with big disks (1.6t in the last incident) this is very slow and
require downtime.

Comment 62 Kevin Wolf 2022-04-05 09:20:20 UTC

Nir, which are the relevant versions for RHV? 8.5 is long done and 8.6 is only for blockers at this point, so I'm moving it forward to the next release (and the virt_storage pool so it's on our radar), but if necessary, I imagine z-stream could be done.

Hanna, we'll probably also need a RHEL 9 clone if it doesn't exist yet.

Comment 63 Nir Soffer 2022-04-05 09:40:58 UTC

(In reply to Kevin Wolf from comment #62)
> Nir, which are the relevant versions for RHV? 8.5 is long done and 8.6 is
> only for blockers at this point, so I'm moving it forward to the next
> release (and the virt_storage pool so it's on our radar), but if necessary,
> I imagine z-stream could be done.

I don't know which version is related to the first reporter, but Vincent 
reopened the bug when handling indecent in RHV 4.4.6 (RHEL 8.5).

Having a fix in 8.6.z sounds good to me.

Comment 76 Yanan Fu 2022-05-05 12:01:20 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 79 Tingting Mao 2022-05-10 03:47:05 UTC

Verified this bug as below:


Tested with:
qemu-kvm-6.2.0-13.module+el8.7.0+15131+941fbd8d
kernel-4.18.0-384.el8.x86_64


Steps:
1. Prepare a LV as below
# qemu-img create -f raw loop.img 50G
# losetup /dev/loop1 /home/timao/test/loop.img
# pvcreate /dev/loop1
# lvcreate -L 30G -n lv vgroup

2. Convert a installed well qcow2 to the lv above
# qemu-img check -r all RHEL-8.6-x86_64-latest.qcow2 
No errors were found on the image.
31054/163840 = 18.95% allocated, 92.09% fragmented, 90.65% compressed clusters
Image end offset: 966262784

# qemu-img convert -f qcow2 -O qcow2 -o lazy_refcounts=on,compat=1.1 RHEL-8.6-x86_64-latest.qcow2 /dev/vgroup/lv -p

# qemu-img info /dev/vgroup/lv 
image: /dev/vgroup/lv
file format: qcow2
virtual size: 10 GiB (10737418240 bytes)
disk size: 0 B
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: true
    refcount bits: 16
    corrupt: false
    extended l2: false

3. Boot up a guest from the lv
# /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35 \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 15360  \
    -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2  \
    -cpu 'Haswell-noTSX',+kvm_pv_unhalt \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -object iothread,id=iothread1 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idEFQ4i1,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio \
    -device pcie-root-port,id=pcie-root-port-5,port=0x6,addr=0x1.0x5,bus=pcie.0,chassis=5 \
    -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-5,addr=0x0 \
    -blockdev node-name=file_image1,driver=host_device,auto-read-only=on,discard=unmap,aio=threads,filename=/dev/vgroup/lv,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=off,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=off \
    -chardev socket,server=on,path=/var/tmp/monitor-qmpmonitor1-20210721-024113-AsZ7KYro,id=qmp_id_qmpmonitor1,wait=off  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \

4. dd file and get md5 value inside guest with sync
(guest)# dd if=/dev/urandom of=file1 conv=fsync bs=1M count=512 ; md5sum file1 ; sync

5. Kill qemu-kvm process in host immediately after the dd finished in last step.
# kill -9 `pidof qemu-kvm`

6. Check the lv image file
# qemu-img check -r all /dev/vgroup/lv 
ERROR cluster 77672 refcount=0 reference=1
ERROR cluster 77673 refcount=0 reference=1
ERROR cluster 77674 refcount=0 reference=1
ERROR cluster 77675 refcount=0 reference=1
ERROR cluster 77676 refcount=0 reference=1
......
......
ERROR cluster 83248 refcount=0 reference=1
ERROR cluster 83249 refcount=0 reference=1
ERROR cluster 83250 refcount=0 reference=1
Rebuilding refcount structure
Repairing cluster 75054 refcount=1 reference=0
Repairing cluster 75055 refcount=1 reference=0
Repairing cluster 75056 refcount=1 reference=0
Repairing cluster 75057 refcount=1 reference=0
The following inconsistencies were found and repaired:

    0 leaked clusters
    5579 corruptions

Double checking the fixed image now...
No errors were found on the image.
83225/327680 = 25.40% allocated, 0.34% fragmented, 0.00% compressed clusters
Image end offset: 5455937536


Results:
As above, 'check -r all' fixed the image.

Comment 82 errata-xmlrpc 2022-11-08 09:18:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7472