RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2150180 - qemu-img finishes successfully while having errors in commit or bitmaps operations
Summary: qemu-img finishes successfully while having errors in commit or bitmaps opera...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Kevin Wolf
QA Contact: aihua liang
URL:
Whiteboard:
Depends On: 2147617
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-02 01:34 UTC by aihua liang
Modified: 2023-05-25 02:28 UTC (History)
13 users (show)

Fixed In Version: qemu-kvm-7.2.0-8.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2147617
Environment:
Last Closed: 2023-05-09 07:20:55 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src qemu-kvm merge_requests 143 0 None opened qemu-img: Fix exit code for errors closing the image 2023-02-03 20:55:17 UTC
Red Hat Issue Tracker RHELPLAN-141107 0 None None None 2022-12-02 01:44:18 UTC
Red Hat Product Errata RHSA-2023:2162 0 None None None 2023-05-09 07:21:42 UTC

Description aihua liang 2022-12-02 01:34:33 UTC
+++ This bug was initially created as a clone of Bug #2147617 +++

Description of problem:
Problem raises when trying to merge two images with the top image almost
full, and base image having stale bitmaps (bitmaps missing from
the top image).
In our usercase, the size of the LV that contains the base image is not
accounting for the stale bitmaps, and therefore, when we run commit or
bitmap --merge, it fails with:

> qcow2_free_clusters failed: No space left on device
> qemu-img: Lost persistent bitmaps during inactivation of node '#block308': Failed to write bitmap 'stale-bitmap-002' to file: No space left on device
> qemu-img: Failed to flush the refcount block cache: No space left on device

However, in both cases qemu-img returned successfully,
while having logs printed to stderr, and failing the merge.

Two cases:
- qemu-img commit, the data was commited successfully,
  it failed as it was adding the bitmaps. Still, the process
  exit with success. This can be ok, since bitmaps are
  basically an optimization.
- qemu-img bitmap, the process shall return with error code.
  If we cannot write a bitmap is fatal error.
  Also, bitmaps in the base image are left with the in-use
  flag set.

Still, probably is best to fail in both cases.

It would be best to allocate the bitmap storage upfront so failing
with ENOSPC is not possible at the end.
A user cannot recover from this failure since the bitmaps
are left in in-use state, and there is no way to fix them.

Version-Release number of selected component (if applicable):
6.2.0

How reproducible:
100%

Steps to Reproduce:
Check linked upsteam ticket for steps to reproduce.

Actual results:
Operations return with success exit code.

Expected results:
Operations shall fail.

Additional info:

--- Additional comment from Kevin Wolf on 2022-11-24 13:13:06 UTC ---

Okay, after looking at the code, the problem seems clear to me.

This is an error that happens only while closing the image. Unfortunately closing the image is done by bdrv_unref(), a code path that doesn't return an error code, so qemu-img never knows about it. Explicitly inactivating the image first should allow qemu-img to see the error, so maybe that's an approach to try. The other option would be allowing bdrv_unref() to fail - this would require more code changes, but potentially cover more problematic cases.

Eric, Hanna, any preferences on which approach to take?

--- Additional comment from Hanna Reitz on 2022-11-24 16:56:26 UTC ---

Would you have bdrv_unref() fail outright (i.e. refuse to actually close the image on error, and force the caller to retain its refcount somehow) or just return an error message but still succeed in its operation of dropping the refcount and closing the image?

The former sounds weird and difficult (not necessarily wrong, though), the latter sounds more sensible.  I wonder what callers would do with the error message, though.  It’s clear what to do in the root of a user-facing function, but when it’s deeply nested?  If we can always push this upwards to a user-facing function (blockdev-del[1], qemu-img, ...), it seems like the better approach, but without trying, I don’t know.

[1] Even though I don’t know what to do in case of blockdev-del.  Depending on the answer to my first question, we probably don’t want it to return an error, even if something went wrong when the node was closed.

If it turns out that qemu-img is the only place where we can really do something useful with the information of whether something went wrong during bdrv_unref(), explicit inactivation seems better to me.

Hanna

--- Additional comment from aihua liang on 2022-12-01 11:05:15 UTC ---

Can reproduce this issue with qemu-kvm-6.2.0-26.module+el8.8.0+17341+68372c23.

Steps:
 1. Create lv devices
   #qemu-img create -f raw test.img 400M
   #losetup /dev/loop0 test.img
   #pvcreate /dev/loop0
   #vgcreate test /dev/loop0
   #lvcreate -n base --size 128M test
   #lvcreate -n top --size 128M test

 2. Create base image and add 6 bitmaps to it.
   #qemu-img create -f qcow2 /dev/test/base 128M
   #qemu-img bitmap --add /dev/test/base stale-bitmap-1
   #qemu-img bitmap --add /dev/test/base stale-bitmap-2
   #qemu-img bitmap --add /dev/test/base stale-bitmap-3
   #qemu-img bitmap --add /dev/test/base stale-bitmap-4
   #qemu-img bitmap --add /dev/test/base stale-bitmap-5
   #qemu-img bitmap --add /dev/test/base stale-bitmap-6

 3. Create snapshot image, add a bitmap to it
   #qemu-img create -f qcow2 /dev/test/top -F qcow2 -b /dev/test/base
   #qemu-img bitmap --add /dev/test/top good-bitmap

 4. Fullwrite top
   # qemu-io -f qcow2 /dev/test/top -c "write 0 126M"
   wrote 132120576/132120576 bytes at offset 0
126 MiB, 1 ops; 00.20 sec (624.019 MiB/sec and 4.9525 ops/sec)

 5. Commit from base to top
   #qemu-img commit -f qcow2 -t none -b /dev/test/base -d -p /dev/test/top
   (100.00/100%)
Image committed.

 6. Add bitmap "good-bitmap" to base
   #qemu-img bitmap --add /dev/test/base good-bitmap

 7. Merge bitmap
   #qemu-img bitmap --merge good-bitmap -F qcow2 -b /dev/test/top /dev/test/base good-bitmap
qcow2_free_clusters failed: No space left on device
qemu-img: Lost persistent bitmaps during inactivation of node '#block119': Failed to write bitmap 'good-bitmap' to file: No space left on device
qemu-img: Failed to flush the refcount block cache: No space left on device


Also hit this issue on RHEL9.2 with qemu:qemu-kvm-7.1.0-5.el9, so clone it.

Comment 3 Kevin Wolf 2023-01-31 14:16:49 UTC
Patches have been posted upstream, hopefully to be merged tomorrow when upstream gets new CI minutes:
https://lists.gnu.org/archive/html/qemu-block/2023-01/msg00336.html

Comment 5 aihua liang 2023-02-10 08:17:15 UTC
Test on qemu-kvm-7.2.0-8.el9, the result as bellow:

Scenario 1: commit failed for no space left
Steps:
 1. Create lv devices
   #qemu-img create -f raw test.img 400M
   #losetup /dev/loop0 test.img
   #pvcreate /dev/loop0
   #vgcreate test /dev/loop0
   #lvcreate -n base --size 128M test

 2. Create base image and add 8 bitmaps to it.
   #qemu-img create -f qcow2 /dev/test/base 128M
   #qemu-img bitmap --add /dev/test/base stale-bitmap-1
   #qemu-img bitmap --add /dev/test/base stale-bitmap-2
   #qemu-img bitmap --add /dev/test/base stale-bitmap-3
   #qemu-img bitmap --add /dev/test/base stale-bitmap-4
   #qemu-img bitmap --add /dev/test/base stale-bitmap-5
   #qemu-img bitmap --add /dev/test/base stale-bitmap-6
   #qemu-img bitmap --add /dev/test/base stale-bitmap-7
   #qemu-img bitmap --add /dev/test/base stale-bitmap-8

 3. Create snapshot image, add a bitmap to it
   #qemu-img create -f qcow2 top.img -F qcow2 -b /dev/test/base
   #qemu-img bitmap --add top.img good-bitmap

 4. Fullwrite top
   # qemu-io -f qcow2 top.img -c "write 0 126M"
   wrote 132120576/132120576 bytes at offset 0
126 MiB, 1 ops; 00.20 sec (624.019 MiB/sec and 4.9525 ops/sec)

 5. Commit from top to base
   #qemu-img commit -f qcow2 -t none -b /dev/test/base -d -p top.img
  (100.00/100%)
qemu-img: Lost persistent bitmaps during inactivation of node '#block325': Failed to write bitmap 'stale-bitmap-5' to file: No space left on device
qemu-img: Lost persistent bitmaps during inactivation of node '#block325': Failed to write bitmap 'stale-bitmap-5' to file: No space left on device
qemu-img: Error while closing the image: Invalid argument

 6. Check bitmap status in /dev/test/base, and all bitmaps are in-use status.
   # qemu-img info /dev/test/base 
image: /dev/test/base
file format: qcow2
virtual size: 128 MiB (134217728 bytes)
disk size: 0 B
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    bitmaps:
        [0]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-1
            granularity: 65536
        [1]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-2
            granularity: 65536
        [2]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-3
            granularity: 65536
        [3]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-4
            granularity: 65536
        [4]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-5
            granularity: 65536
        [5]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-6
            granularity: 65536
        [6]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-7
            granularity: 65536
        [7]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-8
            granularity: 65536
        [8]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-9
            granularity: 65536
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: /dev/test/base
    protocol type: host_device
    file length: 128 MiB (134217728 bytes)
    disk size: 0 B

 7.Extend the space for /dev/test/base.
   #lvextend -L 350M /dev/test/base 
  Rounding size to boundary between physical extents: 352.00 MiB.
  Size of logical volume test/base changed from 128.00 MiB (32 extents) to 352.00 MiB (88 extents).
  Logical volume test/base successfully resized.

 8.Commit again.
  #qemu-img commit -f qcow2 -t none -b /dev/test/base -d -p top.img
    (100.00/100%)
Image committed.
 
 9.Check bitmaps of /dev/test/base, all bitmaps are in-use status.
   # qemu-img info /dev/test/base 
image: /dev/test/base
file format: qcow2
virtual size: 128 MiB (134217728 bytes)
disk size: 0 B
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    bitmaps:
        [0]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-1
            granularity: 65536
        [1]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-2
            granularity: 65536
        [2]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-3
            granularity: 65536
        [3]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-4
            granularity: 65536
        [4]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-5
            granularity: 65536
        [5]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-6
            granularity: 65536
        [6]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-7
            granularity: 65536
        [7]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-8
            granularity: 65536
        [8]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-9
            granularity: 65536
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: /dev/test/base
    protocol type: host_device
    file length: 352 MiB (369098752 bytes)
    disk size: 0 B


 Scenario2: Bitmap merge failed for no space left.
Steps:
 1. Create lv devices
   #qemu-img create -f raw test.img 400M
   #losetup /dev/loop0 test.img
   #pvcreate /dev/loop0
   #vgcreate test /dev/loop0
   #lvcreate -n base --size 128M test

 2. Create base image and add 6 bitmaps to it.
   #qemu-img create -f qcow2 /dev/test/base 128M
   #qemu-img bitmap --add /dev/test/base stale-bitmap-1
   #qemu-img bitmap --add /dev/test/base stale-bitmap-2
   #qemu-img bitmap --add /dev/test/base stale-bitmap-3
   #qemu-img bitmap --add /dev/test/base stale-bitmap-4
   #qemu-img bitmap --add /dev/test/base stale-bitmap-5
   #qemu-img bitmap --add /dev/test/base stale-bitmap-6

 3. Create snapshot image, add a bitmap to it
   #qemu-img create -f qcow2 top.img -F qcow2 -b /dev/test/base
   #qemu-img bitmap --add top.img good-bitmap

 4. Fullwrite top
   # qemu-io -f qcow2 top.img -c "write 0 126M"
   wrote 132120576/132120576 bytes at offset 0
126 MiB, 1 ops; 00.20 sec (624.019 MiB/sec and 4.9525 ops/sec)

 5. Commit from top to base
   #qemu-img commit -f qcow2 -t none -b /dev/test/base -d -p top.img
  (100.00/100%)
Image committed.

 6. Check bitmap status in /dev/test/base, and all bitmaps are auto status.
   # qemu-img info /dev/test/base  
image: /dev/test/base
file format: qcow2
virtual size: 128 MiB (134217728 bytes)
disk size: 0 B
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    bitmaps:
        [0]:
            flags:
                [0]: auto
            name: stale-bitmap-1
            granularity: 65536
        [1]:
            flags:
                [0]: auto
            name: stale-bitmap-2
            granularity: 65536
        [2]:
            flags:
                [0]: auto
            name: stale-bitmap-3
            granularity: 65536
        [3]:
            flags:
                [0]: auto
            name: stale-bitmap-4
            granularity: 65536
        [4]:
            flags:
                [0]: auto
            name: stale-bitmap-5
            granularity: 65536
        [5]:
            flags:
                [0]: auto
            name: stale-bitmap-6
            granularity: 65536
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: /dev/test/base
    protocol type: host_device
    file length: 128 MiB (134217728 bytes)
    disk size: 0 B

 7.Add a new bitmap to /dev/test/base
   #qemu-img bitmap --add /dev/test/base good-bitmap

 8.Do bitmap merge from top to base
   ## qemu-img bitmap --merge good-bitmap -F qcow2 -b top.img /dev/test/base good-bitmap
qemu-img: Lost persistent bitmaps during inactivation of node '#block151': Failed to write bitmap 'good-bitmap' to file: No space left on device
qemu-img: Error while closing the image: Invalid argument
qemu-img: Lost persistent bitmaps during inactivation of node '#block151': Failed to write bitmap 'good-bitmap' to file: No space left on device

 9.Check bitmaps of /dev/test/base, all bitmaps are in-use.
    # qemu-img info /dev/test/base 
image: /dev/test/base
file format: qcow2
virtual size: 128 MiB (134217728 bytes)
disk size: 0 B
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    bitmaps:
        [0]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-1
            granularity: 65536
        [1]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-2
            granularity: 65536
        [2]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-3
            granularity: 65536
        [3]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-4
            granularity: 65536
        [4]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-5
            granularity: 65536
        [5]:
            flags:
                [0]: in-use
                [1]: auto
            name: stale-bitmap-6
            granularity: 65536
        [6]:
            flags:
                [0]: in-use
                [1]: auto
            name: good-bitmap
            granularity: 65536
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: /dev/test/base
    protocol type: host_device
    file length: 128 MiB (134217728 bytes)
    disk size: 0 B

 10.Extend the space for /dev/test/base.
   #lvextend -L 350M /dev/test/base 
  Rounding size to boundary between physical extents: 352.00 MiB.
  Size of logical volume test/base changed from 128.00 MiB (32 extents) to 352.00 MiB (88 extents).
  Logical volume test/base successfully resized.

 11.Merge bitmaps again -->bitmap merge failed.
  #qemu-img bitmap --merge good-bitmap -F qcow2 -b top.img /dev/test/base good-bitmap
qemu-img: Operation merge on bitmap good-bitmap failed: Bitmap 'good-bitmap' is inconsistent and cannot be used
Try block-dirty-bitmap-remove to delete this bitmap from disk


There is only one small issue, these two error msg are the same.
     qemu-img: Lost persistent bitmaps during inactivation of node '#block325': Failed to write bitmap 'stale-bitmap-5' to file: No space left on device
     qemu-img: Lost persistent bitmaps during inactivation of node '#block325': Failed to write bitmap 'stale-bitmap-5' to file: No space left on device




Hi, Kevin

 Can you help to check the error msg issue? I know they come from different component maybe, but looks the same. Is that acceptable?

BR,
Aliang

Comment 6 Kevin Wolf 2023-02-10 13:01:09 UTC
(In reply to aihua liang from comment #5)
>  Can you help to check the error msg issue? I know they come from different
> component maybe, but looks the same. Is that acceptable?

The error messages are expected to be mostly the same, as this is a real error condition. The "Error while closing the image:" message is new, but this is not the important part of the change.

The actual problem that was fixed is that before, qemu-img returned success (exit code 0) despite printing the error messages, and now it correctly returns failure (exit code != 0).

Comment 7 Yanan Fu 2023-02-13 08:55:21 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 10 aihua liang 2023-02-14 07:55:24 UTC
As comment 5 and comment 7, set bug's status to "VERIFIED".

Comment 12 errata-xmlrpc 2023-05-09 07:20:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2162


Note You need to log in before you can comment on or make changes to this bug.