Description of problem: If a bitmap becomes inconsistent after abnormal vm termination, copying the volume with the inconsistent bitmap using the --bitmaps option will fail with: qemu-img: Failed to populate bitmap 5f59b2d6-6b52-484c-ae7a-f8b43f2175a4: Bitmap \'5f59b2d6-6b52-484c-ae7a-f8b43f2175a4\' is inconsistent and cannot be used\nTry block-dirty-bitmap-remove to delete this bitmap from disk" This will fail the storage job, failing move disk operation (both cold and live) and possibly other operations (bug 1946084). qemu-img added --skip-broken-bitmaps option to skip inconsistent bitmaps and avoid such failures. This feature will be available in qemu 6.1.0, and we hope to get it also in a future RHEL AV 8.4.z update. We can add the code to use this now by detecting if qemu-img convert supports this option, and always use --skip-broken-bitmaps when using the --bitmaps option. Once we have a fix in qemu-img, we can require the fixed version in the spec, but this may happen only in 4.4.9. Reproducing the issue: 1. Start vm with qcow2 disk 2. Create full backup 3. Kill the qemu process (kill -9 qemu-pid) 4. Try to move the disk to another storage (cold and live) Actual result: Move disk fails with the mentioned error in qemu-img convert. Testing the fix: With fixed version moving the disk will succeed. Testing require qemu-img version supporting --skip-broken-bitmaps. This version will be available soon in RHEL AV 8.5 nightly builds and in Centos Stream. This failure is basically a regression in the version we started to support copying bitmaps using --bitmaps (I think this is 4.4.6). Before that we did not copy bitmaps, so after moving a disk full backup was required. In current release, move disk is not possible without removing the inconsistent bitmap. Deleting the related checkpoint should remove the bitmap and fix this issue.
Hi Nir, Have a few questions regarding the verification flow. Do I need to do steps 2 and 3 simoultainosly? For step 3 I did the command: "ps ax | grep qemu" - do I simply need to kill the first process that appears there? Thanks.
(In reply to Amit Sharir from comment #1) > Hi Nir, > > Have a few questions regarding the verification flow. > > Do I need to do steps 2 and 3 simoultainosly? > > For step 3 I did the command: "ps ax | grep qemu" - do I simply need to kill > the first process that appears there? > > Thanks. No, you can just start a live backup, after it was started, power-off the VM outside the guest (can use REST-API with 'force' option), It is enough to corrupt the bitmap.
(In reply to Amit Sharir from comment #1) > Hi Nir, > > Have a few questions regarding the verification flow. > > Do I need to do steps 2 and 3 simoultainosly? No, you should wait until the full backup is completed. > For step 3 I did the command: "ps ax | grep qemu" - do I simply need to kill > the first process that appears there? No, you can have other qemu instances on the host, you need to kill the right one. You can detect the right qemu instance by grepping the vm name, which should be unique.
To make sure that the bitmaps got corrupted you can run - qemu-img info <path-to-volume> And see that the bitmaps are having a 'in-use' flag.
(In reply to Eyal Shenitzky from comment #2) > No, you can just start a live backup, after it was started, power-off the VM > outside the guest (can use REST-API with 'force' option), It is enough to > corrupt the bitmap. Sutting down the guest via API or virsh will *not* corrupt any bitmap. qemu will corrupt bitmaps. Qemu persist bitmaps to storage when terminated normally. The best way to corrupt the bitmap is to kill qemu with SIGKILL since it cannot handle this signal.
Version: vdsm-4.40.90.2-1.el8ev.x86_64 ovirt-engine-4.4.9.1-0.13.el8ev.noarch Verification Flow: 1. Create a VM with qcow2 disk. 2. Start VM. 3. Create a full backup using "python3 backup_vm.py -c engine full <vm-id>" 4. Find the relevant QEMU process using "ps ax | grep guest=<VM name of the machine that was created in step1>" on the host the VM is running on. 5. Kill the QEMU process using "kill -9 qemu-pid" 6. Move the disk to another storage (cold and live) - done via UI. Verification Conclusions: The expected output matched the actual output. The total flow mentioned was done with no errors, I was able to move the disk after completing the mentioned flow without any issues. Bug verified.
(In reply to Amit Sharir from comment #8) > Verification Flow: ... Sounds correct, but did you verify the original issue with vdsm version that does not use --skip-broken-bitmaps?
I just did the flow I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c8 and saw that I could move the disk without any issues. Do I need to check additional functionality in order to verify this bug?
(In reply to Amit Sharir from comment #10) > I just did the flow I mentioned in > https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c8 and saw that I could > move the disk without any issues. > Do I need to check additional functionality in order to verify this bug? If you don't reproduce the issue before testing, how do you know if your test was correct? You should be able to downgrade vdsm to older version that did not support --skip-broken-bitmap (e.g from 4.4.8), or run the same test on an older environment and reproduce the failure to move or copy disk in the same flow.
I was sure it was already reproduced in the same way. Just ran the flow mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c8 on version vdsm-4.40.80.2-1.el8ev.x86_64 / ovirt-engine-4.4.8.1-0.9.el8ev.noarch and was able to complete the whole flow with no errors. Meaning that the verification flow mentioned was not valid. Do you have another way I can reproduce this? Maybe how Eyal suggested? Returning this to "on_qa" until this will be clarified.
(In reply to Amit Sharir from comment #12) > Just ran the flow mentioned in > https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c8 on version > vdsm-4.40.80.2-1.el8ev.x86_64 / ovirt-engine-4.4.8.1-0.9.el8ev.noarch and > was able to complete the whole flow with no errors. Meaning that the > verification flow mentioned was not valid. The flow described in comment 0 was incorrect. This is why we need to always reproduce the issue. The bitmap created during the backup is written to the disk only during shutdown, so killing the VM will not leave a broken bitmap in the disk, the disk will not have any bitmap, so moving the disk does not fail. We need to kill a VM which already had a bitmap before we started it. Try should work: 1. Create VM with qcow2 disk on file storage (Testing on file storage is easier) 2. Start the VM 3. Perform full backup 4. Stop the VM normally (e.g. from the UI) 5. Check that the disk has the expected bitmaps Find the disk snapshot UUID in the UI VMs -> snpashots -> Active VM -> disks Find the volume file: # ls /rhev/data-center/mnt/*/*/images/*/d0bc130c-c908-4258-871b-88ad16bfd072 /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 # qemu-img info /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 image: /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 file format: qcow2 virtual size: 6 GiB (6442450944 bytes) disk size: 52.6 MiB cluster_size: 65536 backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) backing file format: qcow2 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false bitmaps: [0]: flags: [0]: auto name: b99cd4e4-b90f-4d78-95a5-04dec106634e granularity: 65536 refcount bits: 16 corrupt: false extended l2: false 6. Start the VM again 7. Find qemu pid 8. Kill qemu 9. Check the disk again # qemu-img info /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 image: /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 file format: qcow2 virtual size: 6 GiB (6442450944 bytes) disk size: 72.9 MiB cluster_size: 65536 backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) backing file format: qcow2 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false bitmaps: [0]: flags: [0]: in-use [1]: auto name: b99cd4e4-b90f-4d78-95a5-04dec106634e granularity: 65536 refcount bits: 16 corrupt: false extended l2: false We have a broken bitmap with the "in-use" flag. 10. Try to move the disk to another storage domain (another NFS domain in this example). In 4.4.8 this will fail in vdsm with an error about broken bitmaps. In 4.4.9 this will succeed. 11. Check the moved disk Find the disk again: # ls /rhev/data-center/mnt/*/*/images/*/d0bc130c-c908-4258-871b-88ad16bfd072 /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 # qemu-img info /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 image: /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 file format: qcow2 virtual size: 6 GiB (6442450944 bytes) disk size: 79.6 MiB cluster_size: 65536 backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) backing file format: qcow2 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false extended l2: false The broken bitmap was skipped. To live disk move, you need to perform the entire flow again to create a new broken bitmap, and start the VM before you move the disk.
> Try should work: > > 1. Create VM with qcow2 disk on file storage > (Testing on file storage is easier) > 2. Start the VM > 3. Perform full backup > 4. Stop the VM normally (e.g. from the UI) > 5. Check that the disk has the expected bitmaps > > Find the disk snapshot UUID in the UI > VMs -> snpashots -> Active VM -> disks > > Find the volume file: > > # ls /rhev/data-center/mnt/*/*/images/*/d0bc130c-c908-4258-871b-88ad16bfd072 > /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 > > # qemu-img info > /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 > image: > /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 > file format: qcow2 > virtual size: 6 GiB (6442450944 bytes) > disk size: 52.6 MiB > cluster_size: 65536 > backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: > /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) > backing file format: qcow2 > Format specific information: > compat: 1.1 > compression type: zlib > lazy refcounts: false > bitmaps: > [0]: > flags: > [0]: auto > name: b99cd4e4-b90f-4d78-95a5-04dec106634e > granularity: 65536 > refcount bits: 16 > corrupt: false > extended l2: false > > 6. Start the VM again > 7. Find qemu pid > 8. Kill qemu > 9. Check the disk again > > # qemu-img info > /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 > image: > /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 > file format: qcow2 > virtual size: 6 GiB (6442450944 bytes) > disk size: 72.9 MiB > cluster_size: 65536 > backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: > /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) > backing file format: qcow2 > Format specific information: > compat: 1.1 > compression type: zlib > lazy refcounts: false > bitmaps: > [0]: > flags: > [0]: in-use > [1]: auto > name: b99cd4e4-b90f-4d78-95a5-04dec106634e > granularity: 65536 > refcount bits: 16 > corrupt: false > extended l2: false > > We have a broken bitmap with the "in-use" flag. > > 10. Try to move the disk to another storage domain > (another NFS domain in this example). > > In 4.4.8 this will fail in vdsm with an error about broken bitmaps. > > In 4.4.9 this will succeed. > > 11. Check the moved disk > > Find the disk again: > > # ls /rhev/data-center/mnt/*/*/images/*/d0bc130c-c908-4258-871b-88ad16bfd072 > /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 > > # qemu-img info > /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 > image: > /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 > file format: qcow2 > virtual size: 6 GiB (6442450944 bytes) > disk size: 79.6 MiB > cluster_size: 65536 > backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: > /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/ > 82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) > backing file format: qcow2 > Format specific information: > compat: 1.1 > compression type: zlib > lazy refcounts: false > refcount bits: 16 > corrupt: false > extended l2: false > > > The broken bitmap was skipped. > > To live disk move, you need to perform the entire flow again to create > a new broken bitmap, and start the VM before you move the disk. I had 100% success in reproducing the issue using the flow mentioned on version: ovirt-engine-4.4.8.6-0.1.el8ev.noarch / vdsm-4.40.80.6-1.el8ev.x86_64
Version: vdsm-4.40.90.2-1.el8ev.x86_64 ovirt-engine-4.4.9.1-0.13.el8ev.noarch Verification Flow: As mentioned by Nir in https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c13 - did both flows for cold and live disk move (including bitmap validations along the flow). Verification Conclusions: The expected output matched the actual output. The total flow mentioned was done with no errors, I was able to move the disk after completing the mentioned flow without any issues. Bug verified.
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.