Bug 1829829

Summary: [incremental backup] Creating incremental backup that includes a new VM disk that requires full backup is impossible
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Eyal Shenitzky <eshenitz>
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED ERRATA QA Contact: yisun
Severity: high Docs Contact:
Priority: high    
Version: 8.2CC: aliang, dyuan, jdenemar, lmen, lmiksik, pkrempa, virt-maint, xuzhang
Target Milestone: rcKeywords: Triaged
Target Release: 8.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-6.6.0-1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-17 17:48:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1861674    
Attachments:
Description Flags
libvirt logs none

Description Eyal Shenitzky 2020-04-30 12:51:29 UTC
Created attachment 1683301 [details]
libvirt logs

Description of problem:

Libvirt supports only one checkpoint_id per backup for creating an incremental backup, this behavior prevents from including new disks that were added after the backup was taken (full or incremental).

It causes the user to maintain separated chains of backups for each newly added disks for the VM.

For example - 

Create a VM with an OS disk and run it.
Perform a full backup for the disk.
Add a new disk for the VM
Start an incremental backup that includes both disks using the given checkpoint ID that was created in step 2.

The operation fails with the following error - 
"Error starting backup: invalid argument: failed to find bitmap \'0d25d274-35f1-4c33-b9b8-f5894540206c\' in image \'sdd6\'"}'


Version-Release number of selected component (if applicable):
libvirt-6.1.0-2
qemu-4.2.0-7

How reproducible:
100%

Steps to Reproduce (oVirt operations):
1. Create a VM with a disk
2. Start the VM
3. Create a full backup for the disk -> checkpoint '123' created
4. Stop the backup
5. Attach a new disk for the VM.
6. Start a new backup that includes both disks using the created checkpoint ID from step 3

Actual results:
Mixed incremental/full backup operation failed

Expected results:
Libvirt should provide a way to unify the backup creation of newly added disks

Additional info:

Comment 1 Peter Krempa 2020-04-30 13:08:42 UTC
As outlined in the private email thread that lead to filing this bug, there are multiple possible options how to tackle this problem:

1) Allow a full disk backup of the new disk to be included in an incremental backup
  1a) by having a global per backup job flag to enable full backup for any disk where the bitmaps are not present
  1b) by having a per-disk flag to notify that this disk is allowed to have a full backup (doesn't fail if the disk has the appropriate checkpoints, just creates incremental backup)
  1c) by having a per-disk flag to force a full backup (full backup even if checkpoints are present)
  1d) by having a per-disk-flag to force a full backup and report errors if checkpoints are present

2) allow more than one checkpoint as a source for the incremental backup

  This one may be problematic due to the design we chose for the hierarchy of backups. I'll need to think about it.

Additionally even 1) has a similar possible set of issues though. The new disk which is not part of the previous checkpoint might cause problems when deleting checkpoints and other situations e.g. when taking backup starting from an earlier checkpoint.

Thus please state what would be the best solution for you and then I can start designing how to overcome the design limitations we have.

Comment 3 Peter Krempa 2020-07-08 06:48:26 UTC
Libvirt now allows configuring the checkpoint for incremental backup and also backup mode per disk. This can be used to force a full backup for an individualdisk or select a different checkpoint to backup from while creating a common checkpoint for any of them.

Committed upstream as:

commit 7e5b993d3b8cae9c43b753591a7b12db5c540da5
Author: Peter Krempa <pkrempa>
Date:   Thu Jun 25 16:16:14 2020 +0200

    backup: Allow configuring incremental backup per-disk individually
    
    The semantics of the backup operation don't strictly require that all
    disks being backed up are part of the same incremental part (when a disk
    was checkpointed/backed up separately or in a different VM), or even
    they may not have a previous checkpoint at all (e.g. when the disk
    was freshly hotplugged to the vm).
    
    In such cases we can still create a common checkpoint for all of them
    and backup differences according to configuration.
    
    This patch adds a per-disk configuration of the checkpoint to do the
    incremental backup from via the 'incremental' attribute and allows
    perform full backups via the 'backupmode' attribute.
    
    Note that no changes to the qemu driver are necessary to take advantage
    of this as we already obey the per-disk 'incremental' field.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1829829
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>


commit 6f33e441e236dd738fd7776bf062cc546835226c
Author: Peter Krempa <pkrempa>
Date:   Tue Jul 7 16:38:00 2020 +0200

    backupxml2xmltest: Call 'virDomainBackupAlignDisks' before formatting output
    
    Call the post-processing function so that we can validate that it does
    the correct thing.
    
    virDomainBackupAlignDisks requires disk definitions to be present so
    let's fake them by copying disks from the backup definition and add one
    extra disk 'vdextradisk'.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>

commit 1a7ce56ae12c71f4a28fb9578240630dc9afee49
Author: Peter Krempa <pkrempa>
Date:   Tue Jul 7 16:52:50 2020 +0200

    virDomainBackupDiskDefFormat: Format internal disk state only when valid
    
    Format the disk state only when it isn't _NONE.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>

commit 5c08a3739b7c1d45c35bf403451cbb5f35f4fa58
Author: Peter Krempa <pkrempa>
Date:   Tue Jul 7 16:56:16 2020 +0200

    backupxml2xmltest: Remove output symlink of 'backup-pull-internal-invalid'
    
    Replace the output by a copy of the input file for further changes once
    we start testing virDomainBackupAlignDisks.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>
:

Comment 4 Eyal Shenitzky 2020-08-03 10:22:11 UTC
Is the fix for this bug will introduce a API change? the checkpoint/backup XML will be changed?

If not can you please explain in a high level what will be the change?

Comment 5 Peter Krempa 2020-08-03 10:28:31 UTC
It's achieved using two new XML attributes for the backup disk in the backup XML.

I suggest you look at the commit which also adds documentation of the new fields:

https://gitlab.com/libvirt/libvirt/-/commit/7e5b993d3b8cae9c43b753591a7b12db5c540da5

Comment 8 yisun 2020-09-18 09:20:08 UTC
Test with: 
libvirt-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64
qemu-kvm-5.1.0-7.module+el8.3.0+8099+dba2fe3e.x86_64

Result: PASS

Pls note:
Only pull mode backup test result recorded here, since the pull mode and push mode will enter same code path.
Push mode and negative test scenarios will be put in our test case document.

==================================================
Script to dump nbd export to local file
==================================================
[root@dell-per740xd-10 ~]# cat copyif3.sh 
copyif2() {
map_from="--image-opts driver=nbd,export=$4,server.type=inet"
map_from+=",server.host=127.0.0.1,server.port=10809"
map_from+=",x-dirty-bitmap=qemu:dirty-bitmap:$3"
state=false
qemu-img rebase -u -f qcow2 -F raw -b $1 $2
while read line; do
  [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.$state.* ]] || continue
  start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]}
  echo
  echo " $start $len:"
  qemu-io -C -c "r $start $len" -f qcow2 $2
done < <(qemu-img map --output=json $map_from)
qemu-img rebase -u -f qcow2 -b '' $2
}

copyif2 $1 $2 $3 $4


==================================================
Script to make a disk chain for some qcow2 files
==================================================
[root@dell-per740xd-10 images]# cat make_chain.py 
import sys
import os
import shutil
import subprocess 


USAGE = """
====================================================================
Usage: python make_chain.py file1 file2 file3 ...
Script will make a copy of each file, and make the copies as a chain:
    cp_file1 <- cp_file2  <- cp_file3 ...
This means cp_file3 based on cp_file2 based on cp_file1
====================================================================
"""

def check_files(img_files):
    for img_file in (img_files):
        if not os.path.exists(img_file):
            exit(("File '%s' not existing" % img_file) + "\n" + USAGE)


def copy_files(img_files):
    img_copies = []
    for img_file in img_files:
        new_img_file = img_file + ".copy"
        if os.path.exists(new_img_file):
            os.remove(new_img_file)
            print("File '%s' deleted" % new_img_file)
        shutil.copyfile(img_file, new_img_file)
        img_copies.append(new_img_file)
    return img_copies


def make_chain(img_files):
    cmd_pattern = "qemu-img rebase -u -f qcow2 -b %s -F qcow2 %s"
    chain = ""
    for i in range(len(img_files)):
        if i == 0:
            chain += img_files[i]
            continue
        cmd = cmd_pattern % (img_files[i-1], img_files[i])
        print("Making Chain: '%s' based on '%s'" % (img_files[i], img_files[i-1]))
        status = subprocess.call(cmd,shell=True)
        if status:
            exit()
        chain += " <=== %s" % img_files[i]
    print("\nNOW THE CHAIN IS:\n"
          "======================================================================\n"
          "\t%s\n"
          "======================================================================"
          % chain)

if len(sys.argv) <= 2:
    exit(USAGE)

img_files = sys.argv[1:]
check_files(img_files)
img_copies = copy_files(img_files)
make_chain(img_copies)



==================================================
Test steps:
==================================================
1. prepare images for vdb and vdc
[root@dell-per740xd-10 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdb.qcow2 100M
Formatting '/var/lib/libvirt/images/vdb.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=104857600 lazy_refcounts=off refcount_bits=16
[root@dell-per740xd-10 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdc.qcow2 200M
Formatting '/var/lib/libvirt/images/vdc.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=209715200 lazy_refcounts=off refcount_bits=16


2. make sure vm1 already has vdb and we'll attach vdc to it
[root@dell-per740xd-10 ~]# virsh domblklist vm1
 Target   Source
--------------------------------------------------------
 vda      /var/lib/libvirt/images/jeos-27-x86_64.qcow2
 vdb      /var/lib/libvirt/images/vdb.qcow2

3. prepare vdc disk xml
[root@dell-per740xd-10 ~]# cat vdc.xml 
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/vdc.qcow2'/>
      <target dev='vdc' bus='virtio'/>
    </disk>


4. Generete some data in vm's vdb, and get one line data from hexdump
[root@localhost ~]# dd if=/dev/urandom of=/dev/vdb bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0950622 s, 110 MB/s

[root@localhost ~]# hexdump /dev/vdb | head -1
0000000 fd6a 31ce 6c5b d0f3 b5ea b26a 62b3 23ff

5. start full backup for vdb
[root@dell-per740xd-10 ~]# cat ck1.xml 
<domaincheckpoint>
  <name>ck1</name>
  <disks>
    <disk name='vda' checkpoint='no'/>
    <disk name='vdb' checkpoint='bitmap'/>
  </disks>
</domaincheckpoint>

[root@dell-per740xd-10 ~]# cat bk1.xml 
<domainbackup mode='pull'>
  <server name="localhost" port="10809"/>
  <disks>
    <disk name='vda' backup='no'/>
    <disk name='vdb' backup='yes' type='file'>
        <scratch file='/tmp/vdb_scratch'/>
    </disk>
  </disks>
</domainbackup>

[root@dell-per740xd-10 ~]# virsh backup-begin vm1 bk1.xml ck1.xml 
Backup started

[root@dell-per740xd-10 ~]# qemu-img convert -f raw nbd://localhost:10809/vdb -O qcow2 /var/lib/libvirt/images/vdb.full.backup

6. attach vdc to disk
[root@dell-per740xd-10 ~]# virsh attach-device vm1 vdc.xml 
Device attached successfully

7. generate some new data for vdb and vdc
[root@localhost ~]# dd if=/dev/urandom of=/dev/vdb bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0893557 s, 117 MB/s

[root@localhost ~]# hexdump /dev/vdb | head -1
0000000 03e1 2d26 9d85 0f43 7fa8 692c ad35 b8e4

[root@localhost ~]# dd if=/dev/urandom of=/dev/vdc bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0896498 s, 117 MB/s

[root@localhost ~]# hexdump /dev/vdc | head -1
0000000 f66c 2cf9 388d ea59 e97d 3934 0ef5 cecd

8. prepare checkpoint and backup xml for the 2nd round backup
[root@dell-per740xd-10 ~]# cat ck2.xml 
<domaincheckpoint>
  <name>ck2</name>
  <disks>
    <disk name='vda' checkpoint='no'/>
    <disk name='vdb' checkpoint='bitmap'/>
    <disk name='vdc' checkpoint='bitmap'/>
  </disks>
</domaincheckpoint>
[root@dell-per740xd-10 ~]# cat bk2.xml 
<domainbackup mode='pull'>
  <incremental>ck1</incremental>
  <server name="localhost" port="10809"/>
  <disks>
    <disk name='vda' backup='no'/>
    <disk name='vdb' backup='yes' type='file'>
        <scratch file='/tmp/vdb_scratch'/>
    </disk>
    <disk name='vdc' backup='yes' type='file' backupmode='full'>
        <scratch file='/tmp/vdc_scratch'/>
    </disk>
  </disks>
</domainbackup>

9. start 2nd round backup
[root@dell-per740xd-10 ~]# virsh backup-begin vm1 bk2.xml ck2.xml 
Backup started

10. save the backuped data and abort backup job
vdb_inc1:
[root@dell-per740xd-10 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdb.inc1.backup 100M
Formatting '/var/lib/libvirt/images/vdb.inc1.backup', fmt=qcow2 cluster_size=65536 compression_type=zlib size=104857600 lazy_refcounts=off refcount_bits=16

[root@dell-per740xd-10 ~]# ./copyif3.sh nbd://127.0.0.1:10809/vdb /var/lib/libvirt/images/vdb.inc1.backup backup-vdb vdb
 0 10485760:
read 10485760/10485760 bytes at offset 0
10 MiB, 1 ops; 00.03 sec (371.044 MiB/sec and 37.1044 ops/sec)

vdc_full:
[root@dell-per740xd-10 ~]# qemu-img convert -f raw nbd://localhost:10809/vdc -O qcow2 /var/lib/libvirt/images/vdc.full.backup

[root@dell-per740xd-10 ~]# virsh domjobabort vm1

11. generate some new data for vdb and vdc
[root@localhost ~]# dd if=/dev/urandom of=/dev/vdb bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0827228 s, 127 MB/s

[root@localhost ~]# hexdump /dev/vdb | head -1
0000000 2e47 afbd 106f 989e 3a81 535d edaf 74f5

[root@localhost ~]# dd if=/dev/urandom of=/dev/vdc bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.082781 s, 127 MB/s

[root@localhost ~]# hexdump /dev/vdc | head -1
0000000 8299 aaea c9fc f1b6 6907 6202 f8ca 702c

12. prepare checkpoint and backup xml for 3rd round backup
[root@dell-per740xd-10 ~]# cat bk3.xml 
<domainbackup mode='pull'>
  <server name="localhost" port="10809"/>
  <disks>
    <disk name='vda' backup='no'/>
    <disk name='vdb' backup='yes' type='file' backupmode='incremental' incremental='ck2'>
        <scratch file='/tmp/vdb_scratch'/>
    </disk>
    <disk name='vdc' backup='yes' type='file' incremental='ck2'>
        <scratch file='/tmp/vdc_scratch'/>
    </disk>
  </disks>
</domainbackup>
[root@dell-per740xd-10 ~]# cat ck3.xml 
<domaincheckpoint>
  <name>ck3</name>
  <disks>
    <disk name='vda' checkpoint='no'/>
    <disk name='vdb' checkpoint='bitmap'/>
    <disk name='vdc' checkpoint='bitmap'/>
  </disks>
</domaincheckpoint>

13. start 3rd round backup
[root@dell-per740xd-10 ~]# virsh backup-begin vm1 bk3.xml ck3.xml 
Backup started


14. save the backuped data
vdb incremental backup 2:
[root@dell-per740xd-10 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdb.inc2.backup 100M
Formatting '/var/lib/libvirt/images/vdb.inc2.backup', fmt=qcow2 cluster_size=65536 compression_type=zlib size=104857600 lazy_refcounts=off refcount_bits=16

[root@dell-per740xd-10 ~]# ./copyif3.sh nbd://127.0.0.1:10809/vdb /var/lib/libvirt/images/vdb.inc2.backup backup-vdb vdb

 0 10485760:
read 10485760/10485760 bytes at offset 0
10 MiB, 1 ops; 00.03 sec (357.202 MiB/sec and 35.7202 ops/sec)

vdc incremental backup 1:
[root@dell-per740xd-10 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdc.inc1.backup 200M
Formatting '/var/lib/libvirt/images/vdc.inc1.backup', fmt=qcow2 cluster_size=65536 compression_type=zlib size=209715200 lazy_refcounts=off refcount_bits=16
[root@dell-per740xd-10 ~]# ./copyif3.sh nbd://127.0.0.1:10809/vdc /var/lib/libvirt/images/vdc.inc1.backup backup-vdc vdc

 0 10485760:
read 10485760/10485760 bytes at offset 0
10 MiB, 1 ops; 00.03 sec (345.078 MiB/sec and 34.5078 ops/sec)

15. make disk chain for vdb and vdc backup files and check their data is correct
15.1 make disk chain for vdb backup files
[root@dell-per740xd-10 images]# python make_chain.py vdb.full.backup vdb.inc1.backup vdb.inc2.backup
Making Chain: 'vdb.inc1.backup.copy' based on 'vdb.full.backup.copy'
Making Chain: 'vdb.inc2.backup.copy' based on 'vdb.inc1.backup.copy'

NOW THE CHAIN IS:
======================================================================
	vdb.full.backup.copy <=== vdb.inc1.backup.copy <=== vdb.inc2.backup.copy
======================================================================

15.2 make disk chain for vdc backup files
[root@dell-per740xd-10 images]# python make_chain.py vdc.full.backup vdc.inc1.backup 
Making Chain: 'vdc.inc1.backup.copy' based on 'vdc.full.backup.copy'

NOW THE CHAIN IS:
======================================================================
	vdc.full.backup.copy <=== vdc.inc1.backup.copy
======================================================================

15.2 check vdb data
using vdb.inc2.backup.copy
data of hexdump: 
[root@localhost ~]# hexdump /dev/vdb | head -1
0000000 2e47 afbd 106f 989e 3a81 535d edaf 74f5
<== same as step 11 data

using vdb.inc1.backup.copy
data of hexdump:
[root@localhost ~]# hexdump /dev/vdb | head -1
0000000 03e1 2d26 9d85 0f43 7fa8 692c ad35 b8e4
<== same as step 7

using vdb.full.backup.copy
data of hexdump:
[root@localhost ~]# hexdump /dev/vdb | head -1
0000000 fd6a 31ce 6c5b d0f3 b5ea b26a 62b3 23ff
<== same as step 4

15.3 check vdc data
using: vdc.inc1.backup.copy
data of hexdump: 
[root@localhost ~]# hexdump /dev/vdc | head -1
0000000 8299 aaea c9fc f1b6 6907 6202 f8ca 702c
<== same as step 11

using: vdc.full.backup.copy
[root@localhost ~]# hexdump /dev/vdc | head -1
0000000 f66c 2cf9 388d ea59 e97d 3934 0ef5 cecd
<== same as step 7

Comment 11 errata-xmlrpc 2020-11-17 17:48:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137