Bug 1946084 - qemu-img convert --bitmaps fail if a bitmap is inconsistent
Summary: qemu-img convert --bitmaps fail if a bitmap is inconsistent
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 8.4
Assignee: Eric Blake
QA Contact: aihua liang
URL:
Whiteboard:
Depends On:
Blocks: 1957194 1984852 1993308
TreeView+ depends on / blocked
 
Reported: 2021-04-04 07:29 UTC by Eyal Shenitzky
Modified: 2021-11-16 08:28 UTC (History)
13 users (show)

Fixed In Version: qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1993308 (view as bug list)
Environment:
Last Closed: 2021-11-16 07:52:31 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Engine VDSM libvirt and QEMU logs (10.03 MB, application/zip)
2021-04-04 07:29 UTC, Eyal Shenitzky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:53:15 UTC

Description Eyal Shenitzky 2021-04-04 07:29:26 UTC
Created attachment 1769034 [details]
Engine VDSM libvirt and QEMU logs

Description of problem:

qemu-img convert command failed with the following error when used on a volume with bitmap that was taken during a backup operation.

From host-0 VDSM log - 
qemu-img: Failed to populate bitmap 5f59b2d6-6b52-484c-ae7a-f8b43f2175a4: Bitmap \'5f59b2d6-6b52-484c-ae7a-f8b43f2175a4\' is inconsistent and cannot be used\nTry block-dirty-bitmap-remove to delete this bitmap from disk"

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Createa a VM and start a full and incremental backup.
2. Kill the QEMU process and start it again
3. Move the VM disk to another storage domain (using QEMU convert)

Actual results:
Disk migration failed due to inconsistent bitmap

Expected results:
Disk should migrated withou any issues

Additional info:

Comment 2 aihua liang 2021-06-04 09:22:20 UTC
Hi, Eric
 According to your mail reply:
  << It makes no sense to copy an inconsistent bitmap.  Thus, live migration
  << and qemuing covert are both probably going to fail because of the
  << inability to copy a corrupt bitmap, and the solution is to delete the
  << bad bitmap before attempting any other operation that wants to copy
  << bitmaps.  The only reason qemu does not automatically ignore/delete the
  << inconsistent bitmap is because we want to ensure the management app is
  << aware of the bitmap loss; the management app acknowledges it by deleting
  << the bad bitmap.

So can you help to make a double confirmation? Then I can close it. 


Thanks,
Aliang

Comment 3 Eric Blake 2021-06-10 17:50:39 UTC
Correct - the only recovery method is for the user to manually delete the broken bitmap (qemu-img bitmap --remove file.qcow2 broken_name) before migrating the file.

Comment 4 Nir Soffer 2021-06-10 20:26:34 UTC
This means we have to check if all bitmaps are valid before the copy, and remove
invalid bitmaps.

It will be more useful if qemu-img covert could skip invalid bitmaps during copy.
Bitmaps are only an optimization. If we don't copy a bitmap the next incremental
backup will fail and we can fallback to full backup.

The current situation is that entire copy fails, and recovery requires manual
support.

Comment 5 Eric Blake 2021-06-16 18:52:00 UTC
(In reply to Nir Soffer from comment #4)
> This means we have to check if all bitmaps are valid before the copy, and
> remove
> invalid bitmaps.
> 
> It will be more useful if qemu-img covert could skip invalid bitmaps during
> copy.
> Bitmaps are only an optimization. If we don't copy a bitmap the next
> incremental
> backup will fail and we can fallback to full backup.
> 
> The current situation is that entire copy fails, and recovery requires manual
> support.

Deleting broken bitmaps by default may be risky, but copying all working bitmaps while removing the broken ones via a command line option (maybe --skip-broken, borrowing naming from dnf) seems like it might be a nice UI addition.

Comment 6 Nir Soffer 2021-06-16 22:43:09 UTC
(In reply to Eric Blake from comment #5)
> Deleting broken bitmaps by default may be risky, but copying all working
> bitmaps while removing the broken ones via a command line option (maybe
> --skip-broken, borrowing naming from dnf) seems like it might be a nice UI
> addition.

I agree. Can get this in 8.5?

Comment 7 Klaus Heinrich Kiwi 2021-07-06 16:58:37 UTC
(In reply to Nir Soffer from comment #6)
> (In reply to Eric Blake from comment #5)
> > Deleting broken bitmaps by default may be risky, but copying all working
> > bitmaps while removing the broken ones via a command line option (maybe
> > --skip-broken, borrowing naming from dnf) seems like it might be a nice UI
> > addition.
> 
> I agree. Can get this in 8.5?

Any updates here? I took the liberty of adjusting the summary as well as reclassifying this bug to be of low severity, low priority - let me know if there's disagreement.

Comment 8 Nir Soffer 2021-07-06 18:39:12 UTC
(In reply to Klaus Heinrich Kiwi from comment #7)
> reclassifying this bug to be of low severity, low priority - let me know if
> there's disagreement.

This is a regression that causes failures in basic flows in RHV.

After a VM is terminated abnormally, it will have bitmaps with 'in-use' flag.
Copying the vm disks to other storage will fail because of the invalid bitmaps,
and there is no way to fix this without deleting the bitmaps manually, which
means support case.

Before adding the feature to copy bitmaps, abnormal termination did not cause
any failures when copying the disks because qemu-img did not copy any bitmaps.

So I don't think this is a low severity.

Regarding priority, we can fix this in RHV by checking and deleting bitmaps 
before copying disks, but this not an easy change in RHV.

So the right way to fix this seems to be adding an option to skip broken
bitmaps in qemu-img convert, which RHV will always use. Adding another option
to the qemu-img convert command is trivial change.

I also assume that this is relatively easy to fix in qemu, a minor change
in the code copying bitmaps to skip bitmaps with in-use flag.

Fixing this in qemu also means that this issue will be fixed for all users,
instead only for RHV.

So from my point of view, it is high priority that this will be fixed in
qemu and not in RHV.

Comment 9 Nir Soffer 2021-07-06 18:47:45 UTC
(In reply to Klaus Heinrich Kiwi from comment #7)
I adjusted the summary again since "improve options and defaults"
does not reflect the issue well. This indeed sounds like low priority thing
when the issue is failing to copy a disk.

Comment 10 Nir Soffer 2021-07-07 13:10:05 UTC
More info - we need to support copying disks used by a vm, for example during
live storage migration. In this flow we create a snapshot, mirror the snapshot
to the destination disk using blockCopy job, and copy the rest of the layers
using qemu-img convert.

If qemu-img convert fail because a layer includes an inconsistent bitmap, the
entire live storage migration will fail.

Removing the bitmap before the copy requires modifying images *used* by qemu
using "qemu-img bitmap --remove". This is likey not possible, and even it is
possible it sounds like a really bad idea. So this must be handled by qemu-img
convert.

Comment 12 Eric Blake 2021-07-08 01:34:27 UTC
Patch posted for upstream discussion:
https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01731.html

Comment 13 Klaus Heinrich Kiwi 2021-07-13 19:48:24 UTC
(In reply to Nir Soffer from comment #8)
> (In reply to Klaus Heinrich Kiwi from comment #7)
> > reclassifying this bug to be of low severity, low priority - let me know if
> > there's disagreement.
> 
> This is a regression that causes failures in basic flows in RHV.
> 
> After a VM is terminated abnormally, it will have bitmaps with 'in-use' flag.
> Copying the vm disks to other storage will fail because of the invalid
> bitmaps,
> and there is no way to fix this without deleting the bitmaps manually, which
> means support case.
> 
> Before adding the feature to copy bitmaps, abnormal termination did not cause
> any failures when copying the disks because qemu-img did not copy any
> bitmaps.
> 
> So I don't think this is a low severity.
> 
<... snip ...>
> 
> So from my point of view, it is high priority that this will be fixed in
> qemu and not in RHV.


Thanks for the clarification. If this was an interface that was exposed / supported by users / LPs and is not behaving in a way that (now) can crash, I am convinced this is a high-priority regression. Luckily, seems like the upstream work is progressing.

Comment 14 Eric Blake 2021-07-21 20:13:08 UTC
Upstream pull request for 6.1-rc1
https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg05705.html

Comment 15 Nir Soffer 2021-07-22 10:33:31 UTC
(In reply to Klaus Heinrich Kiwi from comment #13)
Is this possible to get the fix in 8.4.z?

The fix affects only qemu-img convert when using the --bitmaps option. So testing
is limited to qemu-img tool.

Comment 16 Klaus Heinrich Kiwi 2021-07-22 19:58:12 UTC
(In reply to Nir Soffer from comment #15)
> (In reply to Klaus Heinrich Kiwi from comment #13)
> Is this possible to get the fix in 8.4.z?
> 
> The fix affects only qemu-img convert when using the --bitmaps option. So
> testing
> is limited to qemu-img tool.

I generally agree with bringing it to 8.4.z since customers could be affected otherwise. I assume the same updates needed on your end for 8.5 would be needed for 8.4.z? 

As for doability, I'd defer to Eric - If it's not too much (throw-away/risky) work, I'd suggest we backport it.

Comment 18 John Ferlan 2021-08-06 15:24:14 UTC
zstream requested per comment 15 as RHV customer w/ RHEL-AV.  Eric notes patches are essentially the same as posted for 8.5.

Comment 19 John Ferlan 2021-08-10 12:49:59 UTC
Can we get a qa_ack+ please.  I set ITM=26 since that's about the end of the non exception part of the release and the normal 2 week after DTM date.  Bug has 1 more downstream ack, so it should move to on_qa soon.

Comment 26 Yanan Fu 2021-08-19 06:35:16 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 27 aihua liang 2021-08-19 06:39:58 UTC
Test on qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b, the problem has been resolved.

Test Env:
  Kernel version:4.18.0-330.el8.x86_64
  qemu-kvm version:qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b

Test Steps:
 1.Create a qcow2 image
   #qemu-img create -f qcow2 /home/data.qcow2 2G

 2.Add a persistent bitmap offline
   #qemu-img bitmap /home/data.qcow2 --add bitmap_persistent

 3.Expose the persistent bitmap via qemu-nbd
   #qemu-nbd -t -p 10098 /home/data.qcow2 -B bitmap_persistent

 4.Kill qemu-nbd
   #kill -9 $pid_qemu_nbd

 5.Check image info of data.qcow2
   # qemu-img info /home/data.qcow2 
image: /home/data.qcow2
file format: qcow2
virtual size: 2 GiB (2147483648 bytes)
disk size: 324 KiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    bitmaps:
        [0]:
            flags:
                [0]: in-use
                [1]: auto
            name: bitmap_persistent
            granularity: 65536
    refcount bits: 16
    corrupt: false
    extended l2: false

  6.Add a new persistent bitmap to /home/data.qcow2
    #qemu-img bitmap /home/data.qcow2 --add bitmap_add

  7.Check image info of /home/data.qcow2
    #qemu-img info /home/data.qcow2 
image: /home/data.qcow2
file format: qcow2
virtual size: 2 GiB (2147483648 bytes)
disk size: 452 KiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    bitmaps:
        [0]:
            flags:
                [0]: in-use
                [1]: auto
            name: bitmap_persistent
            granularity: 65536
        [1]:
            flags:
                [0]: auto
            name: bitmap_add
            granularity: 65536
    refcount bits: 16
    corrupt: false
    extended l2: false

 8.Convert image with all bitmaps
   #qemu-img convert -O qcow2 /home/data.qcow2 /home/test/data_new.qcow2 --bitmaps
qemu-img: Cannot copy inconsistent bitmap 'bitmap_persistent'
Try --skip-broken-bitmaps, or use 'qemu-img bitmap --remove' to delete it

 9.Convert image with inconsistent bitmap skipped.
   #qemu-img convert -O qcow2 /home/data.qcow2 /home/test/data_new.qcow2 --bitmaps --skip-broken-bitmaps
qemu-img: warning: Skipping inconsistent bitmap 'bitmap_persistent'

 10.Check image info of converted image
   #qemu-img info /home/test/data_new.qcow2 
image: /home/test/data_new.qcow2
file format: qcow2
virtual size: 2 GiB (2147483648 bytes)
disk size: 324 KiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    bitmaps:
        [0]:
            flags:
                [0]: auto
            name: bitmap_add
            granularity: 65536
    refcount bits: 16
    corrupt: false
    extended l2: false

Comment 28 aihua liang 2021-08-20 03:06:38 UTC
As comment 27, set bug's status to "Verified".

Comment 29 Klaus Heinrich Kiwi 2021-08-25 17:17:10 UTC
(In reply to Klaus Heinrich Kiwi from comment #16)
> 
> As for doability, I'd defer to Eric - If it's not too much
> (throw-away/risky) work, I'd suggest we backport it.
z-stream was approved, clearing the needinfo flag

Comment 31 errata-xmlrpc 2021-11-16 07:52:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.