Description of problem: When virDomainBackupBegin fails, we always get a generic internal error. The documentation does not document any error for this API: https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin When we get internal error, we don't have a good way to recover from the error. The only way to ensure that the next backup will succeed (if we don't have some bug in libvirt, or unexpected condition) is to delete all checkpoints, and start from full backup. This is not good error handling strategy. We can assume that the error is temporary and retry the operation, but this is also not attractive option. For example if backup failed because the request was invalid (e.g, trying to do incremental backup with raw disk) it can never succeed. If the backup failed because a bitmap is missing or inconsistent, we know that we must delete some checkpoints, and we can do backup only for checkpoints after the checkpoint with the bad bitmap. So what we need is public documented error codes like: https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainDiskErrorCode That allow detection of: - Invalid request (e.g. incremental backup of raw file) - Temporary failure - the request failed but it may succeed in the future - Unrecoverable failure - this backup cannot succeed and should not be retried In the unrecoverable failure category, having a special error code for missing bitmap can help. I don't know if this is possible since before backup we redefine all checkpoints, and we assume that libvirt validate that bitmap exists during checkpoint redefinition. Version-Release number of selected component (if applicable): 8.2 More info: We had one report so far on backup failing because a bitmap was missing after all checkpoints were redefined. We don't have enough details about this case.
Here is a reproduce steps: 1. vm has a vdb pointing to a raw img [root@dell-per740xd-11 ~]# virsh dumpxml vm1 | awk '/<disk/,/<\/disk/' ... <disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/vdb.raw' index='1'/> <backingStore/> <target dev='vdb' bus='virtio'/> <alias name='virtio-disk1'/> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </disk> 2. start the vm [root@dell-per740xd-11 ~]# virsh start vm1 virsh baDomain vm1 started 3. prepare backup and checkpoint xml to do backup [root@dell-per740xd-11 ~]# cat backup_push_full.xml checkpoint_0.xml <domainbackup mode="push"> <disks> <disk name='vda' backup='no'/> <disk name='vdb' type='file'> <driver type='qcow2'/> <target file='/tmp/push_full_backup.img'> <encryption format='luks'> <secret type='passphrase' usage='/luks/img'/> </encryption> </target> </disk> </disks> </domainbackup> <domaincheckpoint> <disks> <disk checkpoint="no" name="vda" /> <disk checkpoint="bitmap" name="vdb" /> </disks> <name>cp0</name> <description>cp0</description> </domaincheckpoint> 4. start the backup [root@dell-per740xd-11 ~]# virsh backup-begin vm1 backup_push_full.xml checkpoint_0.xml error: unsupported configuration: checkpoint for disk vdb unsupported for storage type raw 5. get the error number, it's '1' [root@dell-per740xd-11 ~]# echo $? 1
Upstream: commit 2a917e6756578e43b21695aa284b39bdc05fa96f Author: Daniel P. Berrangé <berrange> Date: Mon Nov 9 16:14:53 2020 +0000 Fix name prefix of VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE The enum constant names should all have a prefix that matches the enum name. VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE was missing the "CREATE_" part of the name prefix. Reviewed-by: Peter Krempa <pkrempa> Signed-off-by: Daniel P. Berrangé <berrange> commit 5ab8cc78c4ab026338070409196fe14ee56a0d04 Author: Peter Krempa <pkrempa> Date: Wed Nov 4 13:37:35 2020 +0100 qemu: backup: Add partial validation of incremental backup checkpoint Verify that the checkpoint requested by an incremental backup exists. Unfortunately validating whether the checkpoint configuration actually matches the disk may not be reasonably feasible as the disk may have been renamed/snapshotted/etc. We still rely on bitmap presence. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit a4d4d2bd5deb3efaa45a5a3386f681363a54a6e5 Author: Peter Krempa <pkrempa> Date: Wed Nov 4 10:16:02 2020 +0100 qemu: checkpoint: Implement VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE Validate that the bitmaps are present when redefining a checkpoint. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit 93873c9bcc4cba575616d977eb00acf73a6f5403 Author: Peter Krempa <pkrempa> Date: Wed Nov 4 10:10:56 2020 +0100 conf: checkpoint: Split virDomainCheckpointRedefinePrep into two functions First one prepares and validates the definition, the second one actually either updates an existing checkpoint or assigns definition for the new one. This will allow driver code to add extra validation between those steps. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit f37d306f6e02a469bd79293f8431df9ca3034744 Author: Peter Krempa <pkrempa> Date: Tue Nov 3 12:49:11 2020 +0100 virsh: checkpoint-create: Add support for VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit 9b54eb84c8a25ed1b71d9d1cc0e65a630915da30 Author: Peter Krempa <pkrempa> Date: Tue Nov 3 12:28:46 2020 +0100 checkpoint: Introduce VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE flag Introduce a flag which will allow users to perform hypervisor-specific validation when redefining the checkpoint metadata. This will allow checking metadata which is stored e.g. in disk images when populating the libvirt metadata. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit e33e89d8399042f55fa6999fdfaa40ee884a9bfe Author: Peter Krempa <pkrempa> Date: Tue Nov 3 12:00:21 2020 +0100 qemu: backup: Use VIR_ERR_CHECKPOINT_INCONSISTENT when starting a backup If we don't have a consistent chain of bitmaps for the backup to proceed we'd report VIR_ERR_INVALID_ARG error code, which makes it hard to decide whether an incremental backup makes even sense. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit facfa8262ecab062229228b8d3662b391384ebe1 Author: Peter Krempa <pkrempa> Date: Tue Nov 3 11:57:58 2020 +0100 error: Introduce VIR_ERR_CHECKPOINT_INCONSISTENT error code This code will be used to signal cases when the checkpoint is broken either during backup or other operations where a user might want to make decision based on the presence of the checkpoint, such as do a full backup instead of an incremental one. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit af7047717f83a7f75c5ae3b318dc4d44b81627fd Author: Peter Krempa <pkrempa> Date: Tue Nov 3 12:33:55 2020 +0100 man: virsh: Mention that '--size' for 'checkpoint-dumpxml' may require running vm Separate the docs for the '--size' flag into its own paragraph and mention that the domain may be required to be running. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit 1bb33357eaaa36e12b31c794c603d27f6ae03835 Author: Peter Krempa <pkrempa> Date: Tue Nov 3 12:30:53 2020 +0100 checkpoint: Mention that VIR_DOMAIN_CHECKPOINT_XML_SIZE may require running vm The qemu implementation requires that the VM associated with the checkpoint is running when checking the size. Mention this possibility with the flag. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko>
Hi Peter, The patches introduce a new flag VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE and new error VIR_ERR_CHECKPOINT_INCONSISTENT to make sure the qcow2 image has consistent bitmap as the xml defination. And this works well during my test. But I saw comment 0, reporter mentioned: '- Invalid request (e.g. incremental backup of raw file)' and now we still use a common error VIR_ERR_CONFIG_UNSUPPORTED in src/qemu/qemu_checkpoint.c. Should we provide a new err for this kind of error, or the VIR_ERR_CONFIG_UNSUPPORTED is enough? Pls help to confirm, thx 295 /* Called inside job lock */ 296 static int 297 qemuCheckpointPrepare(virQEMUDriverPtr driver, 298 | | | | | | virDomainObjPtr vm, 299 | | | | | | virDomainCheckpointDefPtr def) 300 { ... 332 | | if (vm->def->disks[i]->src->format != VIR_STORAGE_FILE_QCOW2) { 333 | | | virReportError(VIR_ERR_CONFIG_UNSUPPORTED, 334 | | | | | | | _("checkpoint for disk %s unsupported " 335 | | | | | | | |"for storage type %s"), 336 | | | | | | | disk->name, 337 | | | | | | | virStorageFileFormatTypeToString( 338 | | | | | | | | vm->def->disks[i]->src->format)); 339 | | | return -1; 340 | | }
Test the newly introduced --redefine-validate param for checkponit-create cmd version: libvirt-6.6.0-11.module+el8.3.1+9196+74a80ca4.x86_64 result: pass 1. prepare a new qcow2 image vdb.qcow2 [root@dell-per740-01 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdb.qcow2 100M Formatting '/var/lib/libvirt/images/vdb.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=104857600 lazy_refcounts=off refcount_bits=16 2. use the vdb.qcow2 image in vm as vdb [root@dell-per740-01 ~]# virsh start vm1 Domain vm1 started [root@dell-per740-01 ~]# virsh domblklist vm1 Target Source -------------------------------------------------------- vda /var/lib/libvirt/images/jeos-27-x86_64.qcow2 vdb /var/lib/libvirt/images/vdb.qcow2 3. create a checkpoint for vdb [root@dell-per740-01 ~]# cat ck1.xml <domaincheckpoint> <name>ck1</name> <disks> <disk name='vda' checkpoint='no'/> <disk name='vdb' checkpoint='bitmap'/> </disks> </domaincheckpoint> [root@dell-per740-01 ~]# virsh checkpoint-create vm1 ck1.xml Domain checkpoint ck1 created from 'ck1.xml' 4. dumpxml ck1 to created_ck.xml for checkpoint redefinition [root@dell-per740-01 ~]# virsh checkpoint-dumpxml vm1 ck1 --no-domain > created_ck.xml 5. destroy the vm and create the vdb.qcow2 again to clear its block dirty bitmap [root@dell-per740-01 ~]# virsh destroy vm1 Domain vm1 destroyed [root@dell-per740-01 ~]# qemu-img create -f qcow2 /var/lib/libvirt/images/vdb.qcow2 100M Formatting '/var/lib/libvirt/images/vdb.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=104857600 lazy_refcounts=off refcount_bits=16 [root@dell-per740-01 ~]# virsh start vm1 Domain vm1 started 6. remove the ck1's metadata [root@dell-per740-01 ~]# virsh checkpoint-list vm1 Name Creation Time ----------------------------------- ck1 2020-12-25 00:43:16 -0500 [root@dell-per740-01 ~]# virsh checkpoint-delete vm1 ck1 --metadata Domain checkpoint ck1 deleted 7. redefine the checkpoint ck1 with --redefine-validate [root@dell-per740-01 ~]# virsh checkpoint-create vm1 created_ck.xml --redefine --redefine-validate error: checkpoint inconsistent: missing or broken bitmap 'ck1' for disk 'vdb' <=== expected error happens
(In reply to yisun from comment #7) > Hi Peter, > The patches introduce a new flag VIR_DOMAIN_CHECKPOINT_REDEFINE_VALIDATE and > new error VIR_ERR_CHECKPOINT_INCONSISTENT to make sure the qcow2 image has > consistent bitmap as the xml defination. And this works well during my test. > But I saw comment 0, reporter mentioned: '- Invalid request (e.g. > incremental backup of raw file)' and now we still use a common error > VIR_ERR_CONFIG_UNSUPPORTED in src/qemu/qemu_checkpoint.c. Should we provide > a new err for this kind of error, or the VIR_ERR_CONFIG_UNSUPPORTED is > enough? Pls help to confirm, thx > > 295 /* Called inside job lock */ > > 296 static int > > 297 qemuCheckpointPrepare(virQEMUDriverPtr driver, > > 298 | | | | | | virDomainObjPtr vm, > > 299 | | | | | | virDomainCheckpointDefPtr def) > > 300 { > ... > 332 | | if (vm->def->disks[i]->src->format != VIR_STORAGE_FILE_QCOW2) { > > 333 | | | virReportError(VIR_ERR_CONFIG_UNSUPPORTED, > > 334 | | | | | | | _("checkpoint for disk %s unsupported " That case can happen only if the user provides wrong configuration of the backup job. In that case VIR_ERR_CONFIG_UNSUPPORTED is correct as that is an hard unrecoverable error caused by wrong parameters to the API. Specifically oVirt needs to know when the disk state/bitmaps are corrupted as that can happen in normal usage and there is a recovery path that can be taken, thus the only required new error code is for that scenario. Usage error should not happen for them, and any other error (such as failure to start the NBD server) needs analysis anyways.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0639