1812827 – [incremental_backup] RFE: backup job could provide more detailed info when backup failed or provide --wating --verbose as blockcopy does

Bug 1812827 - [incremental_backup] RFE: backup job could provide more detailed info when backup failed or provide --wating --verbose as blockcopy does

Summary: [incremental_backup] RFE: backup job could provide more detailed info when ba...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	8.2
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	8.0
Assignee:	Peter Krempa
QA Contact:	yisun
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1799015
TreeView+	depends on / blocked

Reported:	2020-03-12 09:59 UTC by yisun
Modified:	2020-11-17 17:48 UTC (History)
CC List:	6 users (show)
Fixed In Version:	libvirt-6.3.0-1.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-17 17:47:36 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description yisun 2020-03-12 09:59:23 UTC

Description:
[incremental_backup] RFE: backup job could provide more detailed info when backup failed or provide --wating --verbose as blockcopy does

@Peter, pls help to confirm if this is worth implementing

Here we used a low space backup target to generate a backup failure. 
Since the backup in push mode is a async cmd, no error happens.
And with 'virsh event' or 'virsh domjobinfo --completed', there is no clear message about why backup failure.
But actually, qmp returns a clear error message.
Could we add something like 'blockcopy --wating --verbose' to backup cmd?
Or could we export qmp error message to user in 'virsh event' or 'domjobinfo --completed'? 

Versions:
libvirt-6.0.0-9.module+el8.2.0+5957+7ae8988e.x86_64

How reproducible:
100%

Steps:
1. Prepare a small block device, here I used 100M iscsi disk. And vm's vda taking around 1.6G data
[root@hp-dl320eg8-05 images]# lsblk | grep sde
sde                              8:64   0   100M  0 disk /mnt


2. Use a file as backup target
[root@hp-dl320eg8-05 images]# mount | grep mnt
/dev/sde on /mnt type ext4 (rw,relatime,seclabel,errors=continue,stripe=8192)

3. Prepare checkpoint and backup xml
[root@hp-dl320eg8-05 push]# cat no_space_backup_full.xml
<domainbackup mode='push'>
  <disks>
    <disk name='vda' type='file'>
      <target file='/mnt/vda.full.backup'/>
      <driver type='qcow2'/>
    </disk>
    <disk name='vdb' backup='no'/>
  </disks>
</domainbackup>

[root@hp-dl320eg8-05 push]# cat checkpoint_push_full.xml
<domaincheckpoint>
  <name>check_full</name>
</domaincheckpoint>

4. Do backup
[root@hp-dl320eg8-05 push]# virsh backup-begin vm1 no_space_backup_full.xml checkpoint_push_full.xml
Backup started
<==== This seems to be an async job, so virsh returned immediately 

[root@hp-dl320eg8-05 push]# virsh domjobinfo vm1 --completed
Job type:         Failed
Operation:        Backup

5. Monitor the vm's event in another terminal, it showed the backup failed
[root@hp-dl320eg8-05 push]# virsh event vm1 --all --loop
event 'job-completed' for domain vm1:
	operation: 9
	time_elapsed: 2074
	disk_total: 10737418240
	disk_processed: 104529920
	disk_remaining: 10632888320
	success: 0
event 'block-job' for domain vm1: Backup for /var/lib/libvirt/images/base_os.active failed
event 'block-job-2' for domain vm1: Backup for vda failed


6. Actually we received an clear error from qmp, but the error not exported to any user interface.
[root@hp-dl320eg8-05 push]# cat /var/log/libvirtd-debug.log | grep -i "no space"
2020-03-12 09:15:01.686+0000: 50756: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1584004501, "microseconds": 685667}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "backup-vda-libvirt-2-format", "len": 10737418240, "offset": 104529920, "speed": 0, "type": "backup", "error": "No space left on device"}}]
2020-03-12 09:15:01.686+0000: 50756: debug : virJSONValueFromString:1844 : string={"timestamp": {"seconds": 1584004501, "microseconds": 685667}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "backup-vda-libvirt-2-format", "len": 10737418240, "offset": 104529920, "speed": 0, "type": "backup", "error": "No space left on device"}}
2020-03-12 09:15:01.686+0000: 50756: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f808c048af0 event={"timestamp": {"seconds": 1584004501, "microseconds": 685667}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "backup-vda-libvirt-2-format", "len": 10737418240, "offset": 104529920, "speed": 0, "type": "backup", "error": "No space left on device"}}
2020-03-12 09:15:01.687+0000: 50756: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": [{"current-progress": 104529920, "status": "concluded", "total-progress": 10737418240, "type": "backup", "id": "backup-vda-libvirt-2-format", "error": "No space left on device"}], "id": "libvirt-434"}]
2020-03-12 09:15:01.687+0000: 50756: debug : virJSONValueFromString:1844 : string={"return": [{"current-progress": 104529920, "status": "concluded", "total-progress": 10737418240, "type": "backup", "id": "backup-vda-libvirt-2-format", "error": "No space left on device"}], "id": "libvirt-434"}
2020-03-12 09:15:01.687+0000: 50756: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f808c048af0 reply={"return": [{"current-progress": 104529920, "status": "concluded", "total-progress": 10737418240, "type": "backup", "id": "backup-vda-libvirt-2-format", "error": "No space left on device"}], "id": "libvirt-434"}

Comment 1 Peter Krempa 2020-04-24 07:05:50 UTC

Fixed upstream:

commit b37fdfb9d4ead0a01377837e02ecedee48337c97
Author: Peter Krempa <pkrempa>
Date:   Thu Apr 16 11:23:07 2020 +0200

    backup: Store error message for failed backups
    
    If a backup job fails midway it's hard to figure out what happened as
    it's running asynchronous. Use the VIR_DOMAIN_JOB_ERRMSG job statistics
    field to pass through the error from the first failed backup-blockjob
    so that both the consumer of the virDomainGetJobStats and the
    corresponding event can see the error.
    
    event 'job-completed' for domain backup-test:
            operation: 9
            time_elapsed: 46
            disk_total: 104857600
            disk_processed: 10158080
            disk_remaining: 94699520
            success: 0
            errmsg: No space left on device
    
    virsh domjobinfo backup-test --completed --anystats
    Job type:         Failed
    Operation:        Backup
    Time elapsed:     46           ms
    File processed:   9.688 MiB
    File remaining:   90.312 MiB
    File total:       100.000 MiB
    Error message:    No space left on device
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1812827
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>

commit 75a66f1076548c240e675976d690835aadb6e79c
Author: Peter Krempa <pkrempa>
Date:   Thu Apr 16 11:21:51 2020 +0200

    qemu: domain: Add 'errmsg' field to qemuDomainJobInfo
    
    The field can be used by jobs to add an optional error message to a
    completed (failed) job.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>

commit 4f39774610cac5e8de019b0be2aec2548b1acfd0
Author: Peter Krempa <pkrempa>
Date:   Wed Apr 15 12:27:53 2020 +0200

    API: Add VIR_DOMAIN_JOB_ERRMSG domain job statistics field
    
    In some cases it's useful to report the error which caused the domain
    job to fail. Add an optional field for holding the error message so that
    it can be later retrieved from statistics of a completed job.
    
    Add the field name macro and code for extracting it in virsh.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>

commit 72186f9c8c63983c55adb5194dda739ddc0f4043
Author: Peter Krempa <pkrempa>
Date:   Thu Mar 26 17:55:00 2020 +0100

    qemu: Add free and copy function for qemuDomainJobInfo and use it
    
    In order to add a string to qemuDomainJobInfo we must ensure that it's
    freed and copied properly. Add helpers to copy and free the structure
    and adjust the code to use them properly for the new semantics.
    
    Additionally also allocation is changed to g_new0 as it includes the
    type and thus it's very easy to grep for all the allocations of a given
    type.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>

commit e77a0941419554f203febe43ba8cc1c5ac05753e
Author: Peter Krempa <pkrempa>
Date:   Thu Apr 16 11:14:00 2020 +0200

    remote: remoteDispatchDomainGetJobStats: Encode typed parameter strings
    
    String typed parameter values were introduced in v0.9.7-30-g40624d32fb.
    virDomainGetJobStats was introduced in v1.0.2-239-g4dd00f4238 so all
    clients already support typed parameter stings at that time thus we can
    enable it unconditionally.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Eric Blake <eblake>

Comment 5 yisun 2020-08-11 04:18:12 UTC

Test with libvirt-6.6.0-2.module+el8.3.0+7567+dc41c0a9.x86_64
Result: PASS

1. prepare a 100M block device and mount it to /mnt/ (I used iscsi for convinence)
yum install targetcli -y
mkdir -p /var/tmp/avocado_glngTO/
dd if=/dev/zero of=/var/tmp/avocado_glngTO/logical-pool count=100 bs=1024K
setenforce permissive
targetcli /backstores/fileio/ create device.logical-pool /var/tmp/avocado_glngTO/logical-pool
targetcli /iscsi/ create iqn.2016-03.com.virttest:logical-pool.target
targetcli /iscsi/iqn.2016-03.com.virttest:logical-pool.target/tpg1/portals ls
targetcli /iscsi/iqn.2016-03.com.virttest:logical-pool.target/tpg1/luns/ create /backstores/fileio/device.logical-pool
setenforce enforcing
targetcli /iscsi/iqn.2016-03.com.virttest:logical-pool.target/tpg1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
targetcli / saveconfig
#systemctl restart iscsid.service
iscsiadm -m discovery -t sendtargets -p 127.0.0.1
iscsiadm --mode node --login --targetname iqn.2016-03.com.virttest:logical-pool.target --portal 127.0.0.1
iscsiadm --mode session
iscsiadm -m session -P 3


root@dell-per730-67 ~]# lsscsi
...
[11:0:0:0]   disk    LIO-ORG  device.logical-  4.0   /dev/sdb 

[root@dell-per730-67 ~]# lsblk
NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
...
sdb                              8:16   0   100M  0 disk 

[root@dell-per730-67 ~]# mkfs.ext4 /dev/sdb -F
mke2fs 1.45.6 (20-Mar-2020)
Creating filesystem with 102400 1k blocks and 25688 inodes
Filesystem UUID: 8afda60d-40b7-4a67-8e3b-fc5e7ebd14d0
Superblock backups stored on blocks: 
	8193, 24577, 40961, 57345, 73729

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done 

[root@dell-per730-67 ~]# mount /dev/sdb /mnt/


2. Using a vm with disk image having more than 100M data 
[root@dell-per730-67 ~]# virsh domblklist vm1
 Target   Source
--------------------------------------------------------
 vda      /var/lib/libvirt/images/jeos-27-x86_64.qcow2

[root@dell-per730-67 ~]# qemu-img info /var/lib/libvirt/images/jeos-27-x86_64.qcow2 -U
image: /var/lib/libvirt/images/jeos-27-x86_64.qcow2
file format: qcow2
virtual size: 10 GiB (10737418240 bytes)
disk size: 900 MiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

3. Prepare checkpoint and backup xml
[root@dell-per730-67 ~]# cat backup.xml 
<domainbackup mode='push'>
  <disks>
    <disk name='vda' type='file'>
      <target file='/mnt/vda.full.backup'/>
      <driver type='qcow2'/>
    </disk>
  </disks>
</domainbackup>

[root@dell-per730-67 ~]# cat checkpoint.xml 
<domaincheckpoint>
  <name>check_full</name>
</domaincheckpoint>

4. Start a event monitor on another terminal
[root@dell-per730-67 ~]# virsh event vm1 --all --loop

5. Do the backup
[root@dell-per730-67 ~]# virsh backup-begin vm1 backup.xml checkpoint.xml 
Backup started

6. Check the event monitor and domjobinfo output
[root@dell-per730-67 ~]# virsh event vm1 --all --loop
event 'job-completed' for domain vm1:
	operation: 9
	time_elapsed: 444
	disk_total: 10737418240
	disk_processed: 89325568
	disk_remaining: 10648092672
	success: 0
	errmsg: No space left on device
<=== Err msg is clear
event 'block-job' for domain vm1: Backup for /var/lib/libvirt/images/jeos-27-x86_64.qcow2 failed
event 'block-job-2' for domain vm1: Backup for vda failed


[root@dell-per730-67 ~]# virsh domjobinfo vm1 --completed --anystats
Job type:         Failed      
Operation:        Backup      
Time elapsed:     411          ms
File processed:   85.188 MiB
File remaining:   9.917 GiB
File total:       10.000 GiB
Error message:    No space left on device
<=== Err msg is clear

Comment 8 errata-xmlrpc 2020-11-17 17:47:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137

Note You need to log in before you can comment on or make changes to this bug.