Bug 1519071
Summary: | Fail to rebuild the reference count tables of qcow2 image on host block devices (e.g. LVs) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | yilzhang | ||||
Component: | qemu-kvm | Assignee: | Hanna Czenczek <hreitz> | ||||
qemu-kvm sub component: | qcow2 | QA Contact: | Tingting Mao <timao> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | urgent | CC: | areis, chayang, coli, gveitmic, hreitz, jinzhao, juzhang, kanderso, kkiwi, knoel, kwolf, mtessun, ngu, nsoffer, qzhang, rbalakri, timao, vcojot, virt-maint, xuwei, xzhou, ymankad, zixchen | ||||
Version: | 8.5 | Keywords: | Reopened, Triaged, ZStream | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-6.2.0-13.module+el8.7.0+15131+941fbd8d | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 2072242 2072379 (view as bug list) | Environment: | |||||
Last Closed: | 2022-11-08 09:18:32 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2072242, 2072379 | ||||||
Attachments: |
|
Description
yilzhang
2017-11-30 03:27:25 UTC
Could you reproduce the issue on x86? 1. NFS backend image doesn't have this issue 2. x86 and PPC both have this issue. Version of components for x86 platform: Host kernel: 3.10.0-799.el7.x86_64 Guest install iso: RHEL-7.5-20171107.1-Server-x86_64-dvd1.iso qemu-kvm-rhev: qemu-kvm-rhev-2.10.0-9.el7 rep reproduced in: kernel-3.10.0-798.el7.x86_64 qemu-kvm-rhev-2.10.0-7.el7 set qa-ack+. (In reply to yilzhang from comment #0) > Actual results: > ERROR cluster 69574 refcount=0 reference=1 > ERROR cluster 69575 refcount=0 reference=1 > ERROR cluster 69576 refcount=0 reference=1 > ERROR cluster 69577 refcount=0 reference=1 > ERROR cluster 69578 refcount=0 reference=1 > ERROR cluster 69579 refcount=0 reference=1 > ERROR cluster 69580 refcount=0 reference=1 > ERROR cluster 69581 refcount=0 reference=1 > ERROR cluster 69582 refcount=0 reference=1 > ERROR cluster 69583 refcount=0 reference=1 > ERROR cluster 69584 refcount=0 reference=1 > ERROR cluster 69585 refcount=0 reference=1 > Rebuilding refcount structure > qemu-img: iSCSI Failure: SENSE KEY:ILLEGAL_REQUEST(5) > ASCQ:LBA_OUT_OF_RANGE(0x2100) > ERROR writing refblock: No space left on device > qemu-img: Check failed: No space left on device > [Host]# echo $? > 1 I think the refcount errors are expected with lazy_refcounts in this scenario, but the "ENOSPC" error is suspicious, but I don't understand how iscsi works in this case. Fam should know. ;-) As shown above, the image is probably fully written, and the iscsi LUN is therefore also full, hence the ENOSPC error. (Refcount rebuilding needs to allocate new clusters.) Please test again with a much larger iscsi LUN (e.g 10G larger than the qcow2 image size). (In reply to Fam Zheng from comment #8) > As shown above, the image is probably fully written, and the iscsi LUN is > therefore also full, hence the ENOSPC error. (Refcount rebuilding needs to > allocate new clusters.) > > Please test again with a much larger iscsi LUN (e.g 10G larger than the > qcow2 image size). Result of re-testing: The iSCSI LUN created on iSCSI target side is 31G, then I only created a 20G qcow2 image on this iSCSI LUN (that is, the iSCSI LUN is 11G larger than the qcow2 image) And, the problem in this bug still exists. ************ Host and qemu info: ************ Host: power9 with kernel 4.14.0-18.el7a.ppc64le qemu-kvm: qemu-kvm-rhev-2.10.0-13.el7 Step1: [Host]# qemu-img info iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:libiscsi/2 image: json:{"driver": "raw", "file": {"lun": "2", "portal": "10.0.0.7", "driver": "iscsi", "transport": "tcp", "target": "iqn.2017-08.com.yilzhang:libiscsi"}} file format: raw virtual size: 31G (33285996544 bytes) disk size: unavailable [Host]# qemu-img create -f qcow2 -o compat=1.1,lazy_refcounts=on iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:libiscsi/2 20G Formatting 'iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:libiscsi/2', fmt=qcow2 size=21474836480 compat=1.1 cluster_size=65536 lazy_refcounts=on refcount_bits=16 ... ... Step5: [Host]# qemu-img check -r all iscsi://10.0.0.7/iqn.2017-08.com.yilzhang:libiscsi/2 ERROR cluster 65536 refcount=0 reference=1 ... ... ERROR cluster 72685 refcount=0 reference=1 ERROR cluster 72686 refcount=0 reference=1 ERROR cluster 72687 refcount=0 reference=1 Rebuilding refcount structure qemu-img: iSCSI Failure: SENSE KEY:ILLEGAL_REQUEST(5) ASCQ:LBA_OUT_OF_RANGE(0x2100) ERROR writing refblock: No space left on device qemu-img: Check failed: No space left on device Reproduced this issue with qemu-kvm-rhev-2.12.0-9.el7 package. Steps: 1.Prepare the libiscsi file in the server # targetcli /backstores/fileio> create lun1 /home/iscsi/lun1.img 30G /backstores/fileio> cd /iscsi/iqn.2018-07.com.example:t1/tpg1/luns/ /iscsi/iqn.20...:t1/tpg1/luns> create /backstores/fileio/lun1 2.operate the libiscsi file in the client 2.1 create qcow2 file on the libiscsi backend with the value of lazy_refcounts option is “on” # qemu-img create -f qcow2 -o lazy_refcounts=on,compat=1.1 iscsi://10.66.11.19/iqn.2018-07.com.example:t1/1 30G Formatting 'iscsi://10.66.11.19/iqn.2018-07.com.example:t1/1', fmt=qcow2 size=32212254720 compat=1.1 cluster_size=65536 lazy_refcounts=on refcount_bits=16 2.2 install rhel7.6 with the cache mode is writethrough /usr/libexec/qemu-kvm \ -name 'rhel7.6' \ -machine q35 \ -nodefaults \ -vga qxl \ -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=writethrough,format=qcow2,file=$1 \ -device virtio-blk-pci,id=virtio_blk_pci0,drive=drive_image1,bus=pcie.0,addr=05 \ -drive id=drive_cd1,if=none,snapshot=off,aio=threads,cache=unsafe,media=cdrom,file=$2 \ -device ide-cd,id=cd1,drive=drive_cd1,bus=ide.0,unit=0 \ -monitor stdio \ -vnc :1 \ -m 8192 \ -smp 8 \ -device virtio-net-pci,mac=9a:b5:b6:b1:b5:b5,id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pcie.0,addr=0x9 \ -netdev tap,id=idxgXAlm \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/timao/monitor-qmpmonitor1-20180220-094308-h9I6hRsI,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ 2.3 After installed, dd a file in the guest # dd if=/dev/urandom of=/home/ftest bs=1M count=2048 2.4 shutdown the guest immediately after the dd finished 2.5 check the image # qemu-img check iscsi://10.66.11.19/iqn.2018-07.com.example:t1/1 ERROR cluster 41455 refcount=0 reference=1 ERROR cluster 41456 refcount=0 reference=1 ERROR cluster 41457 refcount=0 reference=1 ERROR cluster 41458 refcount=0 reference=1 ERROR cluster 41459 refcount=0 reference=1 …... …... ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5400000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5410000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5420000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5430000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000000b5440000 refcount=0 9900 errors were found on the image. Data may be corrupted, or further writes to the image may corrupt it. 46388/491520 = 9.44% allocated, 6.63% fragmented, 0.00% compressed clusters Image end offset: 3041198080 2.6 repair the image # qemu-img check -r all iscsi://10.66.11.19/iqn.2018-07.com.example:t1/1 ERROR cluster 41455 refcount=0 reference=1 ERROR cluster 41456 refcount=0 reference=1 ERROR cluster 41457 refcount=0 reference=1 ERROR cluster 41458 refcount=0 reference=1 ERROR cluster 41459 refcount=0 reference=1 …… …... ERROR cluster 46401 refcount=0 reference=1 ERROR cluster 46402 refcount=0 reference=1 ERROR cluster 46403 refcount=0 reference=1 ERROR cluster 46404 refcount=0 reference=1 Rebuilding refcount structure qemu-img: iSCSI WRITE10/16 failed at lba 62914688: SENSE KEY:ILLEGAL_REQUEST(5) ASCQ:LBA_OUT_OF_RANGE(0x2100) ERROR writing refblock: No space left on device qemu-img: Check failed: No space left on device QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. I've found a small VM with a qcow2 that needs fixing: [root@w6017qarhv05 ~]# qemu-img check /dev/dm-36 ERROR cluster 1132682 refcount=0 reference=1 ERROR OFLAG_COPIED data cluster: l2_entry=80000011488a0000 refcount=0 2 errors were found on the image. Data may be corrupted, or further writes to the image may corrupt it. 1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters Image end offset: 74231513088 [root@w6017qarhv05 ~]# qemu-img info /dev/dm-36 image: /dev/dm-36 file format: qcow2 virtual size: 80 GiB (85899345920 bytes) disk size: 0 B cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: true refcount bits: 16 corrupt: false extended l2: false [root@w6017qarhv05 ~]# qemu-img check -r all /dev/dm-36 ERROR cluster 1132682 refcount=0 reference=1 Rebuilding refcount structure ERROR writing refblock: No space left on device qemu-img: Check failed: No space left on device I am currently dumping that LV to a file and will see if it repairs fine when in a filesystem [root@w6017qarhv05 ~]# dd if=/dev/dm-36 of=dm36.qcow2 bs=1024k [root@w6017qarhv05 ~]# dd if=/dev/dm-36 of=dm36.qcow2 bs=1024k 90112+0 records in 90112+0 records out 94489280512 bytes (94 GB, 88 GiB) copied, 432.391 s, 219 MB/s [root@w6017qarhv05 ~]# qemu-img check dm36.qcow2 ERROR cluster 1132682 refcount=0 reference=1 ERROR OFLAG_COPIED data cluster: l2_entry=80000011488a0000 refcount=0 2 errors were found on the image. Data may be corrupted, or further writes to the image may corrupt it. 1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters Image end offset: 74231513088 [root@w6017qarhv05 ~]# qemu-img check -r all dm36.qcow2.orig ERROR cluster 1132682 refcount=0 reference=1 Rebuilding refcount structure Repairing cluster 1 refcount=1 reference=0 Repairing cluster 2 refcount=1 reference=0 Repairing cluster 32785 refcount=1 reference=0 Repairing cluster 65537 refcount=1 reference=0 Repairing cluster 98304 refcount=1 reference=0 [....] Repairing cluster 1114127 refcount=1 reference=0 The following inconsistencies were found and repaired: 0 leaked clusters 1 corruptions Double checking the fixed image now... No errors were found on the image. 1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters Image end offset: 94489411584 It's the same data stream but in the second case, then qemu-img was able to repair the structure [root@w6017qarhv05 ~]# qemu-img check -r all dm36.qcow2 No errors were found on the image. 1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters Image end offset: 94489411584 Hi, I’ve sent a series last year to fix this, but didn’t get much feedback, so it fizzled out. So when `qemu-img check` encounters an image with clusters that are in-use but not marked as allocated, it completely rebuilds the refcount structure to make sure during the repair it won’t accidentally overwrite these clusters. However, it tended to put this new structure past the end of the image file (seemed like a sensible idea at the time, because that area’s definitely free to use) – but that will absolutely never work when the qcow2 file is on a fixed-size disk volume. I’m in the process of reworking the patches and sending another version. Once I do so, I’ll try to spin up a brew build for testing. I’ve sent a v2 series upstream: https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg01260.html And I’ve created a Brew build for testing: http://brew-task-repos.usersys.redhat.com/repos/scratch/hreitz/qemu-kvm/6.2.0/9.el8.hreitz202203291121/ As a reproducer, I’ve created a small broken qcow2 image that you can dd to any block device, and trying to repair it on the block device is likely to fail: $ truncate -s 10G device.img $ sudo losetup --find --show device.img /dev/loop0 $ sudo dd if=broken.qcow2 of=/dev/loop0 [...] $ sudo qemu-img check -r all /dev/loop0 ERROR cluster 0 refcount=0 reference=1 Rebuilding refcount structure ERROR writing refblock: No space left on device qemu-img: Check failed: No space left on device $ sudo losetup -d /dev/loop0 With the build posted in comment 48, it works: $ sudo /path/to/fixed/qemu-img check -r all /dev/loop0 ERROR cluster 0 refcount=0 reference=1 Rebuilding refcount structure Repairing cluster 1 refcount=1 reference=0 Repairing cluster 2 refcount=1 reference=0 The following inconsistencies were found and repaired: 0 leaked clusters 1 corruptions Double checking the fixed image now... No errors were found on the image. 128/131072 = 0.10% allocated, 0.00% fragmented, 0.00% compressed clusters Image end offset: 85504 Created attachment 1868966 [details]
Small example of a broken qcow2 image
Hanna's build correctly notices that images need fixing: [root@w6017qarhv05 bin]# ./qemu-img check /dev/dm-36 ERROR cluster 1132682 refcount=0 reference=1 ERROR OFLAG_COPIED data cluster: l2_entry=80000011488a0000 refcount=0 2 errors were found on the image. Data may be corrupted, or further writes to the image may corrupt it. 1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters Image end offset: 74231513088 [root@w6017qarhv05 bin]# ./qemu-img check /dev/dm-37 ERROR cluster 1112588 refcount=0 reference=1 ERROR cluster 1112589 refcount=0 reference=1 ERROR cluster 1112590 refcount=0 reference=1 ERROR cluster 1112591 refcount=0 reference=1 ERROR cluster 1112592 refcount=0 reference=1 ERROR cluster 1112593 refcount=0 reference=1 ERROR cluster 1112594 refcount=0 reference=1 ERROR cluster 1112595 refcount=0 reference=1 ERROR cluster 1112596 refcount=0 reference=1 [....] ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa2d0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa2e0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa2f0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa300000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa310000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa320000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa330000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa340000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa350000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa360000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000010fa370000 refcount=0 132 errors were found on the image. Data may be corrupted, or further writes to the image may corrupt it. 1112469/1310720 = 84.87% allocated, 0.48% fragmented, 0.00% compressed clusters Image end offset: 72918892544 Hanna's binary build was able to fix it in place! [root@w6017qarhv05 bin]# ./qemu-img check -r all /dev/dm-36 ERROR cluster 1132682 refcount=0 reference=1 Rebuilding refcount structure Repairing cluster 1 refcount=1 reference=0 Repairing cluster 2 refcount=1 reference=0 Repairing cluster 32785 refcount=1 reference=0 Repairing cluster 65537 refcount=1 reference=0 Repairing cluster 98304 refcount=1 reference=0 Repairing cluster 131087 refcount=1 reference=0 Repairing cluster 163852 refcount=1 reference=0 Repairing cluster 196621 refcount=1 reference=0 Repairing cluster 229391 refcount=1 reference=0 Repairing cluster 262174 refcount=1 reference=0 Repairing cluster 294913 refcount=1 reference=0 Repairing cluster 327684 refcount=1 reference=0 Repairing cluster 360451 refcount=1 reference=0 Repairing cluster 393219 refcount=1 reference=0 Repairing cluster 425987 refcount=1 reference=0 Repairing cluster 458753 refcount=1 reference=0 Repairing cluster 491522 refcount=1 reference=0 Repairing cluster 524291 refcount=1 reference=0 Repairing cluster 557068 refcount=1 reference=0 Repairing cluster 589824 refcount=1 reference=0 Repairing cluster 622607 refcount=1 reference=0 Repairing cluster 655380 refcount=1 reference=0 Repairing cluster 688129 refcount=1 reference=0 Repairing cluster 720897 refcount=1 reference=0 Repairing cluster 753669 refcount=1 reference=0 Repairing cluster 786458 refcount=1 reference=0 Repairing cluster 819224 refcount=1 reference=0 Repairing cluster 851990 refcount=1 reference=0 Repairing cluster 884763 refcount=1 reference=0 Repairing cluster 917533 refcount=1 reference=0 Repairing cluster 950298 refcount=1 reference=0 Repairing cluster 983071 refcount=1 reference=0 Repairing cluster 1015811 refcount=1 reference=0 Repairing cluster 1048591 refcount=1 reference=0 Repairing cluster 1081369 refcount=1 reference=0 Repairing cluster 1114127 refcount=1 reference=0 The following inconsistencies were found and repaired: 0 leaked clusters 1 corruptions Double checking the fixed image now... No errors were found on the image. 1132497/1310720 = 86.40% allocated, 0.20% fragmented, 0.00% compressed clusters Image end offset: 74233872384 [root@w6017qarhv05 bin]# ./qemu-img check -r all /dev/dm-37 ERROR cluster 1112588 refcount=0 reference=1 ERROR cluster 1112589 refcount=0 reference=1 ERROR cluster 1112590 refcount=0 reference=1 ERROR cluster 1112591 refcount=0 reference=1 ERROR cluster 1112592 refcount=0 reference=1 ERROR cluster 1112593 refcount=0 reference=1 ERROR cluster 1112594 refcount=0 reference=1 ERROR cluster 1112595 refcount=0 reference=1 ERROR cluster 1112596 refcount=0 reference=1 ERROR cluster 1112597 refcount=0 reference=1 ERROR cluster 1112598 refcount=0 reference=1 ERROR cluster 1112599 refcount=0 reference=1 ERROR cluster 1112600 refcount=0 reference=1 ERROR cluster 1112601 refcount=0 reference=1 ERROR cluster 1112602 refcount=0 reference=1 ERROR cluster 1112603 refcount=0 reference=1 ERROR cluster 1112604 refcount=0 reference=1 ERROR cluster 1112605 refcount=0 reference=1 ERROR cluster 1112606 refcount=0 reference=1 ERROR cluster 1112607 refcount=0 reference=1 ERROR cluster 1112608 refcount=0 reference=1 ERROR cluster 1112609 refcount=0 reference=1 ERROR cluster 1112610 refcount=0 reference=1 ERROR cluster 1112611 refcount=0 reference=1 ERROR cluster 1112612 refcount=0 reference=1 ERROR cluster 1112613 refcount=0 reference=1 ERROR cluster 1112614 refcount=0 reference=1 ERROR cluster 1112615 refcount=0 reference=1 ERROR cluster 1112616 refcount=0 reference=1 ERROR cluster 1112617 refcount=0 reference=1 ERROR cluster 1112618 refcount=0 reference=1 ERROR cluster 1112619 refcount=0 reference=1 ERROR cluster 1112620 refcount=0 reference=1 ERROR cluster 1112621 refcount=0 reference=1 ERROR cluster 1112622 refcount=0 reference=1 ERROR cluster 1112623 refcount=0 reference=1 ERROR cluster 1112624 refcount=0 reference=1 ERROR cluster 1112625 refcount=0 reference=1 ERROR cluster 1112626 refcount=0 reference=1 ERROR cluster 1112627 refcount=0 reference=1 ERROR cluster 1112628 refcount=0 reference=1 ERROR cluster 1112629 refcount=0 reference=1 ERROR cluster 1112630 refcount=0 reference=1 ERROR cluster 1112631 refcount=0 reference=1 ERROR cluster 1112632 refcount=0 reference=1 ERROR cluster 1112633 refcount=0 reference=1 ERROR cluster 1112634 refcount=0 reference=1 ERROR cluster 1112635 refcount=0 reference=1 ERROR cluster 1112636 refcount=0 reference=1 ERROR cluster 1112637 refcount=0 reference=1 ERROR cluster 1112638 refcount=0 reference=1 ERROR cluster 1112639 refcount=0 reference=1 ERROR cluster 1112640 refcount=0 reference=1 ERROR cluster 1112641 refcount=0 reference=1 ERROR cluster 1112642 refcount=0 reference=1 ERROR cluster 1112643 refcount=0 reference=1 ERROR cluster 1112644 refcount=0 reference=1 ERROR cluster 1112645 refcount=0 reference=1 ERROR cluster 1112646 refcount=0 reference=1 ERROR cluster 1112647 refcount=0 reference=1 ERROR cluster 1112648 refcount=0 reference=1 ERROR cluster 1112649 refcount=0 reference=1 ERROR cluster 1112650 refcount=0 reference=1 ERROR cluster 1112651 refcount=0 reference=1 ERROR cluster 1112652 refcount=0 reference=1 ERROR cluster 1112653 refcount=0 reference=1 Rebuilding refcount structure Repairing cluster 1 refcount=1 reference=0 Repairing cluster 2 refcount=1 reference=0 Repairing cluster 32794 refcount=1 reference=0 Repairing cluster 65549 refcount=1 reference=0 Repairing cluster 98311 refcount=1 reference=0 Repairing cluster 131081 refcount=1 reference=0 Repairing cluster 163845 refcount=1 reference=0 Repairing cluster 196622 refcount=1 reference=0 Repairing cluster 229398 refcount=1 reference=0 Repairing cluster 262144 refcount=1 reference=0 Repairing cluster 294934 refcount=1 reference=0 Repairing cluster 327690 refcount=1 reference=0 Repairing cluster 360463 refcount=1 reference=0 Repairing cluster 393244 refcount=1 reference=0 Repairing cluster 426002 refcount=1 reference=0 Repairing cluster 458764 refcount=1 reference=0 Repairing cluster 491542 refcount=1 reference=0 Repairing cluster 524292 refcount=1 reference=0 Repairing cluster 557067 refcount=1 reference=0 Repairing cluster 589832 refcount=1 reference=0 Repairing cluster 622593 refcount=1 reference=0 Repairing cluster 655385 refcount=1 reference=0 Repairing cluster 688153 refcount=1 reference=0 Repairing cluster 720926 refcount=1 reference=0 Repairing cluster 753685 refcount=1 reference=0 Repairing cluster 786445 refcount=1 reference=0 Repairing cluster 819201 refcount=1 reference=0 Repairing cluster 851985 refcount=1 reference=0 Repairing cluster 884737 refcount=1 reference=0 Repairing cluster 917509 refcount=1 reference=0 Repairing cluster 950301 refcount=1 reference=0 Repairing cluster 983070 refcount=1 reference=0 Repairing cluster 1015812 refcount=1 reference=0 Repairing cluster 1048604 refcount=1 reference=0 Repairing cluster 1081356 refcount=1 reference=0 The following inconsistencies were found and repaired: 0 leaked clusters 66 corruptions Double checking the fixed image now... No errors were found on the image. 1112469/1310720 = 84.87% allocated, 0.48% fragmented, 0.00% compressed clusters Image end offset: 72921186304 [root@w6017qarhv05 bin]# [root@w6017qarhv05 bin]# ./qemu-img check /dev/dm-37 No errors were found on the image. 1112469/1310720 = 84.87% allocated, 0.48% fragmented, 0.00% compressed clusters Image end offset: 72921186304 [root@w6017qarhv05 bin]# This bug is important for RHV. When "qemu-img check -r" cannot repair thin disks on block storage, the only way to recover is to copy the disk to another storage, repair it, and copy it back to the original volume. When dealing with big disks (1.6t in the last incident) this is very slow and require downtime. Nir, which are the relevant versions for RHV? 8.5 is long done and 8.6 is only for blockers at this point, so I'm moving it forward to the next release (and the virt_storage pool so it's on our radar), but if necessary, I imagine z-stream could be done. Hanna, we'll probably also need a RHEL 9 clone if it doesn't exist yet. (In reply to Kevin Wolf from comment #62) > Nir, which are the relevant versions for RHV? 8.5 is long done and 8.6 is > only for blockers at this point, so I'm moving it forward to the next > release (and the virt_storage pool so it's on our radar), but if necessary, > I imagine z-stream could be done. I don't know which version is related to the first reporter, but Vincent reopened the bug when handling indecent in RHV 4.4.6 (RHEL 8.5). Having a fix in 8.6.z sounds good to me. QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Verified this bug as below: Tested with: qemu-kvm-6.2.0-13.module+el8.7.0+15131+941fbd8d kernel-4.18.0-384.el8.x86_64 Steps: 1. Prepare a LV as below # qemu-img create -f raw loop.img 50G # losetup /dev/loop1 /home/timao/test/loop.img # pvcreate /dev/loop1 # lvcreate -L 30G -n lv vgroup 2. Convert a installed well qcow2 to the lv above # qemu-img check -r all RHEL-8.6-x86_64-latest.qcow2 No errors were found on the image. 31054/163840 = 18.95% allocated, 92.09% fragmented, 90.65% compressed clusters Image end offset: 966262784 # qemu-img convert -f qcow2 -O qcow2 -o lazy_refcounts=on,compat=1.1 RHEL-8.6-x86_64-latest.qcow2 /dev/vgroup/lv -p # qemu-img info /dev/vgroup/lv image: /dev/vgroup/lv file format: qcow2 virtual size: 10 GiB (10737418240 bytes) disk size: 0 B cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: true refcount bits: 16 corrupt: false extended l2: false 3. Boot up a guest from the lv # /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine q35 \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 15360 \ -smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2 \ -cpu 'Haswell-noTSX',+kvm_pv_unhalt \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -object iothread,id=iothread0 \ -object iothread,id=iothread1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-net-pci,mac=9a:1c:0c:0d:e3:4c,id=idjmZXQS,netdev=idEFQ4i1,bus=pcie-root-port-3,addr=0x0 \ -netdev tap,id=idEFQ4i1,vhost=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm \ -monitor stdio \ -device pcie-root-port,id=pcie-root-port-5,port=0x6,addr=0x1.0x5,bus=pcie.0,chassis=5 \ -device virtio-scsi-pci,id=virtio_scsi_pci2,bus=pcie-root-port-5,addr=0x0 \ -blockdev node-name=file_image1,driver=host_device,auto-read-only=on,discard=unmap,aio=threads,filename=/dev/vgroup/lv,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=off,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=off \ -chardev socket,server=on,path=/var/tmp/monitor-qmpmonitor1-20210721-024113-AsZ7KYro,id=qmp_id_qmpmonitor1,wait=off \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ 4. dd file and get md5 value inside guest with sync (guest)# dd if=/dev/urandom of=file1 conv=fsync bs=1M count=512 ; md5sum file1 ; sync 5. Kill qemu-kvm process in host immediately after the dd finished in last step. # kill -9 `pidof qemu-kvm` 6. Check the lv image file # qemu-img check -r all /dev/vgroup/lv ERROR cluster 77672 refcount=0 reference=1 ERROR cluster 77673 refcount=0 reference=1 ERROR cluster 77674 refcount=0 reference=1 ERROR cluster 77675 refcount=0 reference=1 ERROR cluster 77676 refcount=0 reference=1 ...... ...... ERROR cluster 83248 refcount=0 reference=1 ERROR cluster 83249 refcount=0 reference=1 ERROR cluster 83250 refcount=0 reference=1 Rebuilding refcount structure Repairing cluster 75054 refcount=1 reference=0 Repairing cluster 75055 refcount=1 reference=0 Repairing cluster 75056 refcount=1 reference=0 Repairing cluster 75057 refcount=1 reference=0 The following inconsistencies were found and repaired: 0 leaked clusters 5579 corruptions Double checking the fixed image now... No errors were found on the image. 83225/327680 = 25.40% allocated, 0.34% fragmented, 0.00% compressed clusters Image end offset: 5455937536 Results: As above, 'check -r all' fixed the image. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Low: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7472 |