Bug 1984852
Summary: | [CBT] Use --skip-broken-bitmaps in qemu-img convert --bitmaps to avoid failure if a bitmap is inconsistent | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Nir Soffer <nsoffer> |
Component: | General | Assignee: | Eyal Shenitzky <eshenitz> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Amit Sharir <asharir> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.40.60.7 | CC: | ahadas, asharir, bugs, eshenitz |
Target Milestone: | ovirt-4.4.9 | Keywords: | ZStream |
Target Release: | --- | Flags: | pm-rhel:
ovirt-4.4+
asharir: testing_plan_complete+ |
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
Broken bitmaps prevent from moving/copy VM disks.
Consequence:
VM with broken bitmaps due to failed backup cannot be moved or copy
Fix:
Added support for moving or coping VM disks even if they contain broken bitmaps.
Result:
VM with disks that contains broken bitmaps due to failed backup will not fail to move or copy.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-21 07:27:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1946084 | ||
Bug Blocks: |
Description
Nir Soffer
2021-07-22 10:58:40 UTC
Hi Nir, Have a few questions regarding the verification flow. Do I need to do steps 2 and 3 simoultainosly? For step 3 I did the command: "ps ax | grep qemu" - do I simply need to kill the first process that appears there? Thanks. (In reply to Amit Sharir from comment #1) > Hi Nir, > > Have a few questions regarding the verification flow. > > Do I need to do steps 2 and 3 simoultainosly? > > For step 3 I did the command: "ps ax | grep qemu" - do I simply need to kill > the first process that appears there? > > Thanks. No, you can just start a live backup, after it was started, power-off the VM outside the guest (can use REST-API with 'force' option), It is enough to corrupt the bitmap. (In reply to Amit Sharir from comment #1) > Hi Nir, > > Have a few questions regarding the verification flow. > > Do I need to do steps 2 and 3 simoultainosly? No, you should wait until the full backup is completed. > For step 3 I did the command: "ps ax | grep qemu" - do I simply need to kill > the first process that appears there? No, you can have other qemu instances on the host, you need to kill the right one. You can detect the right qemu instance by grepping the vm name, which should be unique. To make sure that the bitmaps got corrupted you can run - qemu-img info <path-to-volume> And see that the bitmaps are having a 'in-use' flag. (In reply to Eyal Shenitzky from comment #2) > No, you can just start a live backup, after it was started, power-off the VM > outside the guest (can use REST-API with 'force' option), It is enough to > corrupt the bitmap. Sutting down the guest via API or virsh will *not* corrupt any bitmap. qemu will corrupt bitmaps. Qemu persist bitmaps to storage when terminated normally. The best way to corrupt the bitmap is to kill qemu with SIGKILL since it cannot handle this signal. Version: vdsm-4.40.90.2-1.el8ev.x86_64 ovirt-engine-4.4.9.1-0.13.el8ev.noarch Verification Flow: 1. Create a VM with qcow2 disk. 2. Start VM. 3. Create a full backup using "python3 backup_vm.py -c engine full <vm-id>" 4. Find the relevant QEMU process using "ps ax | grep guest=<VM name of the machine that was created in step1>" on the host the VM is running on. 5. Kill the QEMU process using "kill -9 qemu-pid" 6. Move the disk to another storage (cold and live) - done via UI. Verification Conclusions: The expected output matched the actual output. The total flow mentioned was done with no errors, I was able to move the disk after completing the mentioned flow without any issues. Bug verified. (In reply to Amit Sharir from comment #8) > Verification Flow: ... Sounds correct, but did you verify the original issue with vdsm version that does not use --skip-broken-bitmaps? I just did the flow I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c8 and saw that I could move the disk without any issues. Do I need to check additional functionality in order to verify this bug? (In reply to Amit Sharir from comment #10) > I just did the flow I mentioned in > https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c8 and saw that I could > move the disk without any issues. > Do I need to check additional functionality in order to verify this bug? If you don't reproduce the issue before testing, how do you know if your test was correct? You should be able to downgrade vdsm to older version that did not support --skip-broken-bitmap (e.g from 4.4.8), or run the same test on an older environment and reproduce the failure to move or copy disk in the same flow. I was sure it was already reproduced in the same way. Just ran the flow mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c8 on version vdsm-4.40.80.2-1.el8ev.x86_64 / ovirt-engine-4.4.8.1-0.9.el8ev.noarch and was able to complete the whole flow with no errors. Meaning that the verification flow mentioned was not valid. Do you have another way I can reproduce this? Maybe how Eyal suggested? Returning this to "on_qa" until this will be clarified. (In reply to Amit Sharir from comment #12) > Just ran the flow mentioned in > https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c8 on version > vdsm-4.40.80.2-1.el8ev.x86_64 / ovirt-engine-4.4.8.1-0.9.el8ev.noarch and > was able to complete the whole flow with no errors. Meaning that the > verification flow mentioned was not valid. The flow described in comment 0 was incorrect. This is why we need to always reproduce the issue. The bitmap created during the backup is written to the disk only during shutdown, so killing the VM will not leave a broken bitmap in the disk, the disk will not have any bitmap, so moving the disk does not fail. We need to kill a VM which already had a bitmap before we started it. Try should work: 1. Create VM with qcow2 disk on file storage (Testing on file storage is easier) 2. Start the VM 3. Perform full backup 4. Stop the VM normally (e.g. from the UI) 5. Check that the disk has the expected bitmaps Find the disk snapshot UUID in the UI VMs -> snpashots -> Active VM -> disks Find the volume file: # ls /rhev/data-center/mnt/*/*/images/*/d0bc130c-c908-4258-871b-88ad16bfd072 /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 # qemu-img info /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 image: /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 file format: qcow2 virtual size: 6 GiB (6442450944 bytes) disk size: 52.6 MiB cluster_size: 65536 backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) backing file format: qcow2 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false bitmaps: [0]: flags: [0]: auto name: b99cd4e4-b90f-4d78-95a5-04dec106634e granularity: 65536 refcount bits: 16 corrupt: false extended l2: false 6. Start the VM again 7. Find qemu pid 8. Kill qemu 9. Check the disk again # qemu-img info /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 image: /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 file format: qcow2 virtual size: 6 GiB (6442450944 bytes) disk size: 72.9 MiB cluster_size: 65536 backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) backing file format: qcow2 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false bitmaps: [0]: flags: [0]: in-use [1]: auto name: b99cd4e4-b90f-4d78-95a5-04dec106634e granularity: 65536 refcount bits: 16 corrupt: false extended l2: false We have a broken bitmap with the "in-use" flag. 10. Try to move the disk to another storage domain (another NFS domain in this example). In 4.4.8 this will fail in vdsm with an error about broken bitmaps. In 4.4.9 this will succeed. 11. Check the moved disk Find the disk again: # ls /rhev/data-center/mnt/*/*/images/*/d0bc130c-c908-4258-871b-88ad16bfd072 /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 # qemu-img info /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 image: /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072 file format: qcow2 virtual size: 6 GiB (6442450944 bytes) disk size: 79.6 MiB cluster_size: 65536 backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path: /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111) backing file format: qcow2 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false extended l2: false The broken bitmap was skipped. To live disk move, you need to perform the entire flow again to create a new broken bitmap, and start the VM before you move the disk.
> Try should work:
>
> 1. Create VM with qcow2 disk on file storage
> (Testing on file storage is easier)
> 2. Start the VM
> 3. Perform full backup
> 4. Stop the VM normally (e.g. from the UI)
> 5. Check that the disk has the expected bitmaps
>
> Find the disk snapshot UUID in the UI
> VMs -> snpashots -> Active VM -> disks
>
> Find the volume file:
>
> # ls /rhev/data-center/mnt/*/*/images/*/d0bc130c-c908-4258-871b-88ad16bfd072
> /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072
>
> # qemu-img info
> /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072
> image:
> /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072
> file format: qcow2
> virtual size: 6 GiB (6442450944 bytes)
> disk size: 52.6 MiB
> cluster_size: 65536
> backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path:
> /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111)
> backing file format: qcow2
> Format specific information:
> compat: 1.1
> compression type: zlib
> lazy refcounts: false
> bitmaps:
> [0]:
> flags:
> [0]: auto
> name: b99cd4e4-b90f-4d78-95a5-04dec106634e
> granularity: 65536
> refcount bits: 16
> corrupt: false
> extended l2: false
>
> 6. Start the VM again
> 7. Find qemu pid
> 8. Kill qemu
> 9. Check the disk again
>
> # qemu-img info
> /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072
> image:
> /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072
> file format: qcow2
> virtual size: 6 GiB (6442450944 bytes)
> disk size: 72.9 MiB
> cluster_size: 65536
> backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path:
> /rhev/data-center/mnt/alpine:_00/8ece2aae-5c72-4a5c-b23b-74bae65c88e1/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111)
> backing file format: qcow2
> Format specific information:
> compat: 1.1
> compression type: zlib
> lazy refcounts: false
> bitmaps:
> [0]:
> flags:
> [0]: in-use
> [1]: auto
> name: b99cd4e4-b90f-4d78-95a5-04dec106634e
> granularity: 65536
> refcount bits: 16
> corrupt: false
> extended l2: false
>
> We have a broken bitmap with the "in-use" flag.
>
> 10. Try to move the disk to another storage domain
> (another NFS domain in this example).
>
> In 4.4.8 this will fail in vdsm with an error about broken bitmaps.
>
> In 4.4.9 this will succeed.
>
> 11. Check the moved disk
>
> Find the disk again:
>
> # ls /rhev/data-center/mnt/*/*/images/*/d0bc130c-c908-4258-871b-88ad16bfd072
> /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072
>
> # qemu-img info
> /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072
> image:
> /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/d0bc130c-c908-4258-871b-88ad16bfd072
> file format: qcow2
> virtual size: 6 GiB (6442450944 bytes)
> disk size: 79.6 MiB
> cluster_size: 65536
> backing file: 9d63f782-7467-4243-af1e-5c1f8b49c111 (actual path:
> /rhev/data-center/mnt/alpine:_01/f07583a1-03d5-4716-9fb0-7dc5c347371a/images/
> 82fa5b89-bcfa-4de0-bc8b-834e65d97122/9d63f782-7467-4243-af1e-5c1f8b49c111)
> backing file format: qcow2
> Format specific information:
> compat: 1.1
> compression type: zlib
> lazy refcounts: false
> refcount bits: 16
> corrupt: false
> extended l2: false
>
>
> The broken bitmap was skipped.
>
> To live disk move, you need to perform the entire flow again to create
> a new broken bitmap, and start the VM before you move the disk.
I had 100% success in reproducing the issue using the flow mentioned on version: ovirt-engine-4.4.8.6-0.1.el8ev.noarch / vdsm-4.40.80.6-1.el8ev.x86_64
Version: vdsm-4.40.90.2-1.el8ev.x86_64 ovirt-engine-4.4.9.1-0.13.el8ev.noarch Verification Flow: As mentioned by Nir in https://bugzilla.redhat.com/show_bug.cgi?id=1984852#c13 - did both flows for cold and live disk move (including bitmap validations along the flow). Verification Conclusions: The expected output matched the actual output. The total flow mentioned was done with no errors, I was able to move the disk after completing the mentioned flow without any issues. Bug verified. This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |