Bug 2147617
Summary: | qemu-img finishes successfully while having errors in commit or bitmaps operations | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Albert Esteve <aesteve> | |
Component: | qemu-kvm | Assignee: | Kevin Wolf <kwolf> | |
qemu-kvm sub component: | Storage | QA Contact: | aihua liang <aliang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | bstinson, chayang, coli, eblake, hreitz, jinzhao, juzhang, jwboyer, kwolf, nsoffer, virt-maint, xuwei | |
Version: | CentOS Stream | Keywords: | Triaged | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-6.2.0-30.module+el8.8.0+18165+621caf3a | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2150180 (view as bug list) | Environment: | ||
Last Closed: | 2023-05-16 08:16:35 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2150180 |
Okay, after looking at the code, the problem seems clear to me. This is an error that happens only while closing the image. Unfortunately closing the image is done by bdrv_unref(), a code path that doesn't return an error code, so qemu-img never knows about it. Explicitly inactivating the image first should allow qemu-img to see the error, so maybe that's an approach to try. The other option would be allowing bdrv_unref() to fail - this would require more code changes, but potentially cover more problematic cases. Eric, Hanna, any preferences on which approach to take? Would you have bdrv_unref() fail outright (i.e. refuse to actually close the image on error, and force the caller to retain its refcount somehow) or just return an error message but still succeed in its operation of dropping the refcount and closing the image? The former sounds weird and difficult (not necessarily wrong, though), the latter sounds more sensible. I wonder what callers would do with the error message, though. It’s clear what to do in the root of a user-facing function, but when it’s deeply nested? If we can always push this upwards to a user-facing function (blockdev-del[1], qemu-img, ...), it seems like the better approach, but without trying, I don’t know. [1] Even though I don’t know what to do in case of blockdev-del. Depending on the answer to my first question, we probably don’t want it to return an error, even if something went wrong when the node was closed. If it turns out that qemu-img is the only place where we can really do something useful with the information of whether something went wrong during bdrv_unref(), explicit inactivation seems better to me. Hanna Can reproduce this issue with qemu-kvm-6.2.0-26.module+el8.8.0+17341+68372c23. Steps: 1. Create lv devices #qemu-img create -f raw test.img 400M #losetup /dev/loop0 test.img #pvcreate /dev/loop0 #vgcreate test /dev/loop0 #lvcreate -n base --size 128M test #lvcreate -n top --size 128M test 2. Create base image and add 6 bitmaps to it. #qemu-img create -f qcow2 /dev/test/base 128M #qemu-img bitmap --add /dev/test/base stale-bitmap-1 #qemu-img bitmap --add /dev/test/base stale-bitmap-2 #qemu-img bitmap --add /dev/test/base stale-bitmap-3 #qemu-img bitmap --add /dev/test/base stale-bitmap-4 #qemu-img bitmap --add /dev/test/base stale-bitmap-5 #qemu-img bitmap --add /dev/test/base stale-bitmap-6 3. Create snapshot image, add a bitmap to it #qemu-img create -f qcow2 /dev/test/top -F qcow2 -b /dev/test/base #qemu-img bitmap --add /dev/test/top good-bitmap 4. Fullwrite top # qemu-io -f qcow2 /dev/test/top -c "write 0 126M" wrote 132120576/132120576 bytes at offset 0 126 MiB, 1 ops; 00.20 sec (624.019 MiB/sec and 4.9525 ops/sec) 5. Commit from base to top #qemu-img commit -f qcow2 -t none -b /dev/test/base -d -p /dev/test/top (100.00/100%) Image committed. 6. Add bitmap "good-bitmap" to base #qemu-img bitmap --add /dev/test/base good-bitmap 7. Merge bitmap #qemu-img bitmap --merge "bitmap-good" -F qcow2 -b /dev/test/top /dev/test/base "bitmap-good" qcow2_free_clusters failed: No space left on device qemu-img: Lost persistent bitmaps during inactivation of node '#block119': Failed to write bitmap 'bitmap-good' to file: No space left on device qemu-img: Failed to flush the refcount block cache: No space left on device Test on qemu-kvm-6.2.0-30.module+el8.8.0+18165+621caf3a, both commit and bitmap merge failed as expected. Scenario 1: commit with error Test Steps: 1. Create lv devices #qemu-img create -f raw test.img 400M #losetup /dev/loop0 test.img #pvcreate /dev/loop0 #vgcreate test /dev/loop0 #lvcreate -n base --size 128M test #lvcreate -n top --size 128M test 2. Create base image and add 6 bitmaps to it. #qemu-img create -f qcow2 /dev/test/base 128M #qemu-img bitmap --add /dev/test/base stale-bitmap-1 #qemu-img bitmap --add /dev/test/base stale-bitmap-2 #qemu-img bitmap --add /dev/test/base stale-bitmap-3 #qemu-img bitmap --add /dev/test/base stale-bitmap-4 #qemu-img bitmap --add /dev/test/base stale-bitmap-5 #qemu-img bitmap --add /dev/test/base stale-bitmap-6 #qemu-img bitmap --add /dev/test/base stale-bitmap-7 3. Create snapshot image, add a bitmap to it #qemu-img create -f qcow2 /dev/test/top -F qcow2 -b /dev/test/base #qemu-img bitmap --add /dev/test/top good-bitmap 4. Fullwrite top # qemu-io -f qcow2 /dev/test/top -c "write 0 126M" wrote 132120576/132120576 bytes at offset 0 126 MiB, 1 ops; 00.20 sec (624.019 MiB/sec and 4.9525 ops/sec) 5. Commit from base to top #qemu-img commit -f qcow2 -t none -b /dev/test/base -d -p /dev/test/top (100.00/100%) qemu-img: Lost persistent bitmaps during inactivation of node '#block397': Failed to write bitmap 'stale-bitmap-7' to file: No space left on device qemu-img: Lost persistent bitmaps during inactivation of node '#block397': Failed to write bitmap 'stale-bitmap-7' to file: No space left on device qemu-img: Error while closing the image: Invalid argument Scenario 2: bitmap merge with error 1. Create lv devices #qemu-img create -f raw test.img 400M #losetup /dev/loop0 test.img #pvcreate /dev/loop0 #vgcreate test /dev/loop0 #lvcreate -n base --size 128M test #lvcreate -n top --size 128M test 2. Create base image and add 6 bitmaps to it. #qemu-img create -f qcow2 /dev/test/base 128M #qemu-img bitmap --add /dev/test/base stale-bitmap-1 #qemu-img bitmap --add /dev/test/base stale-bitmap-2 #qemu-img bitmap --add /dev/test/base stale-bitmap-3 #qemu-img bitmap --add /dev/test/base stale-bitmap-4 #qemu-img bitmap --add /dev/test/base stale-bitmap-5 #qemu-img bitmap --add /dev/test/base stale-bitmap-6 3. Create snapshot image, add a bitmap to it #qemu-img create -f qcow2 /dev/test/top -F qcow2 -b /dev/test/base #qemu-img bitmap --add /dev/test/top good-bitmap 4. Fullwrite top # qemu-io -f qcow2 /dev/test/top -c "write 0 126M" wrote 132120576/132120576 bytes at offset 0 126 MiB, 1 ops; 00.20 sec (624.019 MiB/sec and 4.9525 ops/sec) 5. Commit from base to top # qemu-img commit -f qcow2 -t none -b /dev/test/base -d -p /dev/test/top (100.00/100%) Image committed. 6. Add bitmap good-bitmap to base #qemu-img bitmap --add /dev/test/base good-bitmap 7. Merge bitmap #qemu-img bitmap --merge good-bitmap -F qcow2 -b /dev/test/top /dev/test/base good-bitmap qemu-img: Lost persistent bitmaps during inactivation of node '#block147': Failed to write bitmap 'good-bitmap' to file: No space left on device qemu-img: Error while closing the image: Invalid argument qemu-img: Lost persistent bitmaps during inactivation of node '#block147': Failed to write bitmap 'good-bitmap' to file: No space left on device QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2757 |
Description of problem: Problem raises when trying to merge two images with the top image almost full, and base image having stale bitmaps (bitmaps missing from the top image). In our usercase, the size of the LV that contains the base image is not accounting for the stale bitmaps, and therefore, when we run commit or bitmap --merge, it fails with: > qcow2_free_clusters failed: No space left on device > qemu-img: Lost persistent bitmaps during inactivation of node '#block308': Failed to write bitmap 'stale-bitmap-002' to file: No space left on device > qemu-img: Failed to flush the refcount block cache: No space left on device However, in both cases qemu-img returned successfully, while having logs printed to stderr, and failing the merge. Two cases: - qemu-img commit, the data was commited successfully, it failed as it was adding the bitmaps. Still, the process exit with success. This can be ok, since bitmaps are basically an optimization. - qemu-img bitmap, the process shall return with error code. If we cannot write a bitmap is fatal error. Also, bitmaps in the base image are left with the in-use flag set. Still, probably is best to fail in both cases. It would be best to allocate the bitmap storage upfront so failing with ENOSPC is not possible at the end. A user cannot recover from this failure since the bitmaps are left in in-use state, and there is no way to fix them. Version-Release number of selected component (if applicable): 6.2.0 How reproducible: 100% Steps to Reproduce: Check linked upsteam ticket for steps to reproduce. Actual results: Operations return with success exit code. Expected results: Operations shall fail. Additional info: