Bug 1904487 - [incremental_backup] Provide more detailed errors for virDomainBackupBegin
Summary: [incremental_backup] Provide more detailed errors for virDomainBackupBegin
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.3
Assignee: Peter Krempa
QA Contact: yisun
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-04 15:40 UTC by Peter Krempa
Modified: 2021-05-25 06:45 UTC (History)
10 users (show)

Fixed In Version: libvirt-7.0.0-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1874846
Environment:
Last Closed: 2021-05-25 06:45:17 UTC
Type: Bug
Target Upstream Version: 6.10.0
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)

Description Peter Krempa 2020-12-04 15:40:09 UTC
+++ This bug was initially created as a clone of Bug #1874846 +++

Description of problem:

When virDomainBackupBegin fails, we always get a generic internal error.

The documentation does not document any error for this API:
https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin

When we get internal error, we don't have a good way to recover from the error.
The only way to ensure that the next backup will succeed (if we don't have some
bug in libvirt, or unexpected condition) is to delete all checkpoints, and start
from full backup. This is not good error handling strategy.

We can assume that the error is temporary and retry the operation, but this is
also not attractive option. For example if backup failed because the request was
invalid (e.g, trying to do incremental backup with raw disk) it can never succeed.
If the backup failed because a bitmap is missing or inconsistent, we know that
we must delete some checkpoints, and we can do backup only for checkpoints after
the checkpoint with the bad bitmap.

So what we need is public documented error codes like:
https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainDiskErrorCode

That allow detection of:
- Invalid request (e.g. incremental backup of raw file)
- Temporary failure - the request failed but it may succeed in the future
- Unrecoverable failure - this backup cannot succeed and should not be retried

In the unrecoverable failure category, having a special error code for missing
bitmap can help. I don't know if this is possible since before backup we redefine
all checkpoints, and we assume that libvirt validate that bitmap exists during
checkpoint redefinition.

Version-Release number of selected component (if applicable):
8.2

More info:
We had one report so far on backup failing because a bitmap was missing after
all checkpoints were redefined. We don't have enough details about this case.

--- Additional comment from  on 2020-09-09 04:41:13 CEST ---

Here is a reproduce steps:
1. vm has a vdb pointing to a raw img
[root@dell-per740xd-11 ~]# virsh dumpxml vm1 | awk '/<disk/,/<\/disk/'
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/vdb.raw' index='1'/>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </disk>

2. start the vm
[root@dell-per740xd-11 ~]# virsh start vm1
virsh baDomain vm1 started

3. prepare backup and checkpoint xml to do backup
[root@dell-per740xd-11 ~]# cat backup_push_full.xml checkpoint_0.xml 
<domainbackup mode="push">
  <disks>
    <disk name='vda' backup='no'/>
    <disk name='vdb' type='file'>
      <driver type='qcow2'/>
      <target file='/tmp/push_full_backup.img'>
        <encryption format='luks'>
          <secret type='passphrase' usage='/luks/img'/>
        </encryption>
      </target>
    </disk>
  </disks>
</domainbackup>
<domaincheckpoint>
  <disks>
    <disk checkpoint="no" name="vda" />
    <disk checkpoint="bitmap" name="vdb" />
  </disks>
  <name>cp0</name>
  <description>cp0</description>
</domaincheckpoint>


4. start the backup
[root@dell-per740xd-11 ~]# virsh backup-begin vm1 backup_push_full.xml checkpoint_0.xml 
error: unsupported configuration: checkpoint for disk vdb unsupported for storage type raw

5. get the error number, it's '1'
[root@dell-per740xd-11 ~]# echo $?
1

--- Additional comment from Peter Krempa on 2020-12-04 16:33:34 CET ---

Upstream:

commit 2a917e6756578e43b21695aa284b39bdc05fa96f
Author: Daniel P. Berrangé <berrange>
Date:   Mon Nov 9 16:14:53 2020 +0000

    Fix name prefix of VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE
    
    The enum constant names should all have a prefix that matches the enum
    name. VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE was missing the "CREATE_"
    part of the name prefix.
    
    Reviewed-by: Peter Krempa <pkrempa>
    Signed-off-by: Daniel P. Berrangé <berrange>


commit 5ab8cc78c4ab026338070409196fe14ee56a0d04
Author: Peter Krempa <pkrempa>
Date:   Wed Nov 4 13:37:35 2020 +0100

    qemu: backup: Add partial validation of incremental backup checkpoint
    
    Verify that the checkpoint requested by an incremental backup exists.
    Unfortunately validating whether the checkpoint configuration actually
    matches the disk may not be reasonably feasible as the disk may have
    been renamed/snapshotted/etc. We still rely on bitmap presence.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit a4d4d2bd5deb3efaa45a5a3386f681363a54a6e5
Author: Peter Krempa <pkrempa>
Date:   Wed Nov 4 10:16:02 2020 +0100

    qemu: checkpoint: Implement VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE
    
    Validate that the bitmaps are present when redefining a checkpoint.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit 93873c9bcc4cba575616d977eb00acf73a6f5403
Author: Peter Krempa <pkrempa>
Date:   Wed Nov 4 10:10:56 2020 +0100

    conf: checkpoint: Split virDomainCheckpointRedefinePrep into two functions
    
    First one prepares and validates the definition, the second one actually
    either updates an existing checkpoint or assigns definition for the new
    one.
    
    This will allow driver code to add extra validation between those
    steps.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>



commit f37d306f6e02a469bd79293f8431df9ca3034744
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:49:11 2020 +0100

    virsh: checkpoint-create: Add support for VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit 9b54eb84c8a25ed1b71d9d1cc0e65a630915da30
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:28:46 2020 +0100

    checkpoint: Introduce VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE flag
    
    Introduce a flag which will allow users to perform hypervisor-specific
    validation when redefining the checkpoint metadata. This will allow
    checking metadata which is stored e.g. in disk images when populating
    the libvirt metadata.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit e33e89d8399042f55fa6999fdfaa40ee884a9bfe
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:00:21 2020 +0100

    qemu: backup: Use VIR_ERR_CHECKPOINT_INCONSISTENT when starting a backup
    
    If we don't have a consistent chain of bitmaps for the backup to proceed
    we'd report VIR_ERR_INVALID_ARG error code, which makes it hard to
    decide whether an incremental backup makes even sense.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit facfa8262ecab062229228b8d3662b391384ebe1
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 11:57:58 2020 +0100

    error: Introduce VIR_ERR_CHECKPOINT_INCONSISTENT error code
    
    This code will be used to signal cases when the checkpoint is broken
    either during backup or other operations where a user might want to make
    decision based on the presence of the checkpoint, such as do a full
    backup instead of an incremental one.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit af7047717f83a7f75c5ae3b318dc4d44b81627fd
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:33:55 2020 +0100

    man: virsh: Mention that '--size' for 'checkpoint-dumpxml' may require running vm
    
    Separate the docs for the '--size' flag into its own paragraph and
    mention that the domain may be required to be running.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit 1bb33357eaaa36e12b31c794c603d27f6ae03835
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:30:53 2020 +0100

    checkpoint: Mention that VIR_DOMAIN_CHECKPOINT_XML_SIZE may require running vm
    
    The qemu implementation requires that the VM associated with the
    checkpoint is running when checking the size. Mention this possibility
    with the flag.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

Comment 1 yisun 2021-01-08 04:10:51 UTC
PreVerified with libvirt-6.10.0-1.fc34.x86_64
Result: PASS

➜  fedora virsh start pc
Domain 'pc' started

➜  fedora virsh domblklist pc
 Target   Source
---------------------------------------------
 vda      /var/lib/libvirt/images/pc.qcow2
 vdb      /var/lib/libvirt/images/vdb.qcow2

➜  fedora virsh checkpoint-list pc
 Name   Creation Time
-----------------------


➜  fedora cat ck1.xml
<domaincheckpoint>
  <name>ck1</name>
  <disks>
    <disk name='vda' checkpoint='no'/>
    <disk name='vdb' checkpoint='bitmap'/>
  </disks>
</domaincheckpoint>

➜  fedora virsh checkpoint-create pc ck1.xml
Domain checkpoint ck1 created from 'ck1.xml'

➜  fedora virsh checkpoint-dumpxml pc ck1 --no-domain > created_ck.xml

➜  fedora cat created_ck.xml
<domaincheckpoint>
  <name>ck1</name>
  <creationTime>1610078860</creationTime>
  <disks>
    <disk name='vda' checkpoint='no'/>
    <disk name='vdb' checkpoint='bitmap' bitmap='ck1'/>
  </disks>
</domaincheckpoint>

➜  fedora virsh checkpoint-list pc
 Name   Creation Time
-----------------------------------
 ck1    2021-01-08 04:07:40 +0000

➜  fedora virsh destroy pc
Domain 'pc' destroyed

➜  fedora qemu-img create -f qcow2 /var/lib/libvirt/images/vdb.qcow2 100M
Formatting '/var/lib/libvirt/images/vdb.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=104857600 lazy_refcounts=off refcount_bits=16
➜  fedora virsh start pc
Domain 'pc' started

➜  fedora virsh checkpoint-delete pc ck1 --metadata
Domain checkpoint ck1 deleted

➜  fedora virsh checkpoint-create pc created_ck.xml --redefine --redefine-validate
error: checkpoint inconsistent: missing or broken bitmap 'ck1' for disk 'vdb'

Comment 7 errata-xmlrpc 2021-05-25 06:45:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098


Note You need to log in before you can comment on or make changes to this bug.