Bug 1874846 - [incremental_backup] Provide more detailed errors for virDomainBackupBegin
Summary: [incremental_backup] Provide more detailed errors for virDomainBackupBegin
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 8.3
Assignee: Peter Krempa
QA Contact: yisun
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-02 12:29 UTC by Nir Soffer
Modified: 2021-02-22 15:40 UTC (History)
7 users (show)

Fixed In Version: libvirt-6.6.0-9.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1904487 (view as bug list)
Environment:
Last Closed: 2021-02-22 15:39:38 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Nir Soffer 2020-09-02 12:29:12 UTC
Description of problem:

When virDomainBackupBegin fails, we always get a generic internal error.

The documentation does not document any error for this API:
https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin

When we get internal error, we don't have a good way to recover from the error.
The only way to ensure that the next backup will succeed (if we don't have some
bug in libvirt, or unexpected condition) is to delete all checkpoints, and start
from full backup. This is not good error handling strategy.

We can assume that the error is temporary and retry the operation, but this is
also not attractive option. For example if backup failed because the request was
invalid (e.g, trying to do incremental backup with raw disk) it can never succeed.
If the backup failed because a bitmap is missing or inconsistent, we know that
we must delete some checkpoints, and we can do backup only for checkpoints after
the checkpoint with the bad bitmap.

So what we need is public documented error codes like:
https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainDiskErrorCode

That allow detection of:
- Invalid request (e.g. incremental backup of raw file)
- Temporary failure - the request failed but it may succeed in the future
- Unrecoverable failure - this backup cannot succeed and should not be retried

In the unrecoverable failure category, having a special error code for missing
bitmap can help. I don't know if this is possible since before backup we redefine
all checkpoints, and we assume that libvirt validate that bitmap exists during
checkpoint redefinition.

Version-Release number of selected component (if applicable):
8.2

More info:
We had one report so far on backup failing because a bitmap was missing after
all checkpoints were redefined. We don't have enough details about this case.

Comment 1 yisun 2020-09-09 02:41:13 UTC
Here is a reproduce steps:
1. vm has a vdb pointing to a raw img
[root@dell-per740xd-11 ~]# virsh dumpxml vm1 | awk '/<disk/,/<\/disk/'
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/vdb.raw' index='1'/>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </disk>

2. start the vm
[root@dell-per740xd-11 ~]# virsh start vm1
virsh baDomain vm1 started

3. prepare backup and checkpoint xml to do backup
[root@dell-per740xd-11 ~]# cat backup_push_full.xml checkpoint_0.xml 
<domainbackup mode="push">
  <disks>
    <disk name='vda' backup='no'/>
    <disk name='vdb' type='file'>
      <driver type='qcow2'/>
      <target file='/tmp/push_full_backup.img'>
        <encryption format='luks'>
          <secret type='passphrase' usage='/luks/img'/>
        </encryption>
      </target>
    </disk>
  </disks>
</domainbackup>
<domaincheckpoint>
  <disks>
    <disk checkpoint="no" name="vda" />
    <disk checkpoint="bitmap" name="vdb" />
  </disks>
  <name>cp0</name>
  <description>cp0</description>
</domaincheckpoint>


4. start the backup
[root@dell-per740xd-11 ~]# virsh backup-begin vm1 backup_push_full.xml checkpoint_0.xml 
error: unsupported configuration: checkpoint for disk vdb unsupported for storage type raw

5. get the error number, it's '1'
[root@dell-per740xd-11 ~]# echo $?
1

Comment 3 Peter Krempa 2020-12-04 15:33:34 UTC
Upstream:

commit 2a917e6756578e43b21695aa284b39bdc05fa96f
Author: Daniel P. Berrangé <berrange>
Date:   Mon Nov 9 16:14:53 2020 +0000

    Fix name prefix of VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE
    
    The enum constant names should all have a prefix that matches the enum
    name. VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE was missing the "CREATE_"
    part of the name prefix.
    
    Reviewed-by: Peter Krempa <pkrempa>
    Signed-off-by: Daniel P. Berrangé <berrange>


commit 5ab8cc78c4ab026338070409196fe14ee56a0d04
Author: Peter Krempa <pkrempa>
Date:   Wed Nov 4 13:37:35 2020 +0100

    qemu: backup: Add partial validation of incremental backup checkpoint
    
    Verify that the checkpoint requested by an incremental backup exists.
    Unfortunately validating whether the checkpoint configuration actually
    matches the disk may not be reasonably feasible as the disk may have
    been renamed/snapshotted/etc. We still rely on bitmap presence.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit a4d4d2bd5deb3efaa45a5a3386f681363a54a6e5
Author: Peter Krempa <pkrempa>
Date:   Wed Nov 4 10:16:02 2020 +0100

    qemu: checkpoint: Implement VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE
    
    Validate that the bitmaps are present when redefining a checkpoint.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit 93873c9bcc4cba575616d977eb00acf73a6f5403
Author: Peter Krempa <pkrempa>
Date:   Wed Nov 4 10:10:56 2020 +0100

    conf: checkpoint: Split virDomainCheckpointRedefinePrep into two functions
    
    First one prepares and validates the definition, the second one actually
    either updates an existing checkpoint or assigns definition for the new
    one.
    
    This will allow driver code to add extra validation between those
    steps.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>



commit f37d306f6e02a469bd79293f8431df9ca3034744
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:49:11 2020 +0100

    virsh: checkpoint-create: Add support for VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit 9b54eb84c8a25ed1b71d9d1cc0e65a630915da30
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:28:46 2020 +0100

    checkpoint: Introduce VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE flag
    
    Introduce a flag which will allow users to perform hypervisor-specific
    validation when redefining the checkpoint metadata. This will allow
    checking metadata which is stored e.g. in disk images when populating
    the libvirt metadata.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit e33e89d8399042f55fa6999fdfaa40ee884a9bfe
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:00:21 2020 +0100

    qemu: backup: Use VIR_ERR_CHECKPOINT_INCONSISTENT when starting a backup
    
    If we don't have a consistent chain of bitmaps for the backup to proceed
    we'd report VIR_ERR_INVALID_ARG error code, which makes it hard to
    decide whether an incremental backup makes even sense.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit facfa8262ecab062229228b8d3662b391384ebe1
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 11:57:58 2020 +0100

    error: Introduce VIR_ERR_CHECKPOINT_INCONSISTENT error code
    
    This code will be used to signal cases when the checkpoint is broken
    either during backup or other operations where a user might want to make
    decision based on the presence of the checkpoint, such as do a full
    backup instead of an incremental one.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit af7047717f83a7f75c5ae3b318dc4d44b81627fd
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:33:55 2020 +0100

    man: virsh: Mention that '--size' for 'checkpoint-dumpxml' may require running vm
    
    Separate the docs for the '--size' flag into its own paragraph and
    mention that the domain may be required to be running.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit 1bb33357eaaa36e12b31c794c603d27f6ae03835
Author: Peter Krempa <pkrempa>
Date:   Tue Nov 3 12:30:53 2020 +0100

    checkpoint: Mention that VIR_DOMAIN_CHECKPOINT_XML_SIZE may require running vm
    
    The qemu implementation requires that the VM associated with the
    checkpoint is running when checking the size. Mention this possibility
    with the flag.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

Comment 7 yisun 2020-12-16 03:53:23 UTC
Hi Peter,
The patches introduce a new flag VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE and new error VIR_ERR_CHECKPOINT_INCONSISTENT to make sure the qcow2 image has consistent bitmap as the xml defination. And this works well during my test.
But I saw comment 0, reporter mentioned: '- Invalid request (e.g. incremental backup of raw file)' and now we still use a common error VIR_ERR_CONFIG_UNSUPPORTED in src/qemu/qemu_checkpoint.c. Should we provide a new err for this kind of error, or the VIR_ERR_CONFIG_UNSUPPORTED is enough? Pls help to confirm, thx

295 /* Called inside job lock */                                                    
296 static int                                                                      
297 qemuCheckpointPrepare(virQEMUDriverPtr driver,                                  
298 |   |   |   |   |   | virDomainObjPtr vm,                                       
299 |   |   |   |   |   | virDomainCheckpointDefPtr def)                            
300 {                                                              
...
332 |   |   if (vm->def->disks[i]->src->format != VIR_STORAGE_FILE_QCOW2) {         
333 |   |   |   virReportError(VIR_ERR_CONFIG_UNSUPPORTED,                          
334 |   |   |   |   |   |   |  _("checkpoint for disk %s unsupported "              
335 |   |   |   |   |   |   |   |"for storage type %s"),                            
336 |   |   |   |   |   |   |  disk->name,                                          
337 |   |   |   |   |   |   |  virStorageFileFormatTypeToString(                    
338 |   |   |   |   |   |   |   |  vm->def->disks[i]->src->format));                
339 |   |   |   return -1;                                                          
340 |   |   }

Comment 8 yisun 2020-12-25 05:51:53 UTC
Test the newly introduced --redefine-validate param for checkponit-create cmd

version: libvirt-6.6.0-11.module+el8.3.1+9196+74a80ca4.x86_64
result: pass

1. prepare a new qcow2 image vdb.qcow2
[root@dell-per740-01 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdb.qcow2 100M
Formatting '/var/lib/libvirt/images/vdb.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=104857600 lazy_refcounts=off refcount_bits=16

2. use the vdb.qcow2 image in vm as vdb
[root@dell-per740-01 ~]# virsh start vm1
Domain vm1 started

[root@dell-per740-01 ~]# virsh domblklist vm1
 Target   Source
--------------------------------------------------------
 vda      /var/lib/libvirt/images/jeos-27-x86_64.qcow2
 vdb      /var/lib/libvirt/images/vdb.qcow2

3. create a checkpoint for vdb
[root@dell-per740-01 ~]# cat ck1.xml 
<domaincheckpoint>
  <name>ck1</name>
  <disks>
    <disk name='vda' checkpoint='no'/>
    <disk name='vdb' checkpoint='bitmap'/>
  </disks>
</domaincheckpoint>


[root@dell-per740-01 ~]# virsh checkpoint-create vm1 ck1.xml 
Domain checkpoint ck1 created from 'ck1.xml'

4. dumpxml ck1 to created_ck.xml for checkpoint redefinition
[root@dell-per740-01 ~]# virsh checkpoint-dumpxml vm1 ck1 --no-domain > created_ck.xml

5. destroy the vm and create the vdb.qcow2 again to clear its block dirty bitmap
[root@dell-per740-01 ~]# virsh destroy vm1
Domain vm1 destroyed


[root@dell-per740-01 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdb.qcow2 100M
Formatting '/var/lib/libvirt/images/vdb.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=104857600 lazy_refcounts=off refcount_bits=16

[root@dell-per740-01 ~]# virsh start vm1
Domain vm1 started


6. remove the ck1's metadata
[root@dell-per740-01 ~]# virsh checkpoint-list vm1
 Name   Creation Time
-----------------------------------
 ck1    2020-12-25 00:43:16 -0500


[root@dell-per740-01 ~]# virsh checkpoint-delete vm1 ck1 --metadata
Domain checkpoint ck1 deleted

7. redefine the checkpoint ck1 with --redefine-validate
[root@dell-per740-01 ~]# virsh checkpoint-create vm1 created_ck.xml --redefine --redefine-validate
error: checkpoint inconsistent: missing or broken bitmap 'ck1' for disk 'vdb'
<=== expected error happens

Comment 9 Peter Krempa 2021-01-04 10:52:30 UTC
(In reply to yisun from comment #7)
> Hi Peter,
> The patches introduce a new flag VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE and
> new error VIR_ERR_CHECKPOINT_INCONSISTENT to make sure the qcow2 image has
> consistent bitmap as the xml defination. And this works well during my test.
> But I saw comment 0, reporter mentioned: '- Invalid request (e.g.
> incremental backup of raw file)' and now we still use a common error
> VIR_ERR_CONFIG_UNSUPPORTED in src/qemu/qemu_checkpoint.c. Should we provide
> a new err for this kind of error, or the VIR_ERR_CONFIG_UNSUPPORTED is
> enough? Pls help to confirm, thx
> 
> 295 /* Called inside job lock */                                            
> 
> 296 static int                                                              
> 
> 297 qemuCheckpointPrepare(virQEMUDriverPtr driver,                          
> 
> 298 |   |   |   |   |   | virDomainObjPtr vm,                               
> 
> 299 |   |   |   |   |   | virDomainCheckpointDefPtr def)                    
> 
> 300 {                                                              
> ...
> 332 |   |   if (vm->def->disks[i]->src->format != VIR_STORAGE_FILE_QCOW2) { 
> 
> 333 |   |   |   virReportError(VIR_ERR_CONFIG_UNSUPPORTED,                  
> 
> 334 |   |   |   |   |   |   |  _("checkpoint for disk %s unsupported "      

That case can happen only if the user provides wrong configuration of the backup job. In that case VIR_ERR_CONFIG_UNSUPPORTED is correct as that is an hard unrecoverable error caused by wrong parameters to the API. Specifically oVirt needs to know when the disk state/bitmaps are corrupted as that can happen in normal usage and there is a recovery path that can be taken, thus the only required new error code is for that scenario. Usage error should not happen for them, and any other error (such as failure to start the NBD server) needs analysis anyways.

Comment 11 errata-xmlrpc 2021-02-22 15:39:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0639


Note You need to log in before you can comment on or make changes to this bug.