Bug 1870488

Summary: [incremental_backup] After restart libvirtd, pull mode backup with tls enabled causing qemu crashed
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: yisun
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED ERRATA QA Contact: yisun
Severity: high Docs Contact:
Priority: medium    
Version: 8.3CC: dyuan, jdenemar, lmen, meili, pkrempa, virt-maint, xuzhang
Target Milestone: rc   
Target Release: 8.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-6.6.0-6.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-17 17:50:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gdb-qemu-kvm-vm1.txt
none
libvirtd-debug.log none

Description yisun 2020-08-20 08:44:32 UTC
Descrition:
[incremental_backup] After restart libvirtd, pull mode backup with tls enabled causing qemu crashed

Versions:
qemu-kvm-5.0.0-2.module+el8.3.0+7379+0505d6ca.x86_64
libvirt-6.6.0-2.module+el8.3.0+7567+dc41c0a9.x86_64

How reproducible:
100%

Steps:
0. we should have CA signed serverside cert to support the tls enabled backup, detailed info about how to prepare CA/Server/Client keys/certs can refert to:
http://pastebin.test.redhat.com/895075

1. vm has 2 disks, will use vdb to reproduce this issue
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# virsh domblklist vm1
 Target   Source
--------------------------------------------------------
 vda      /var/lib/libvirt/images/jeos-27-x86_64.qcow2
 vdb      /var/lib/libvirt/images/vdb.qcow2


2. no speical setting for backup tls:
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# cat /etc/libvirt/qemu.conf  | grep "#.*backup_tls.*="
# backup_tls_x509_cert_dir = "/etc/pki/libvirt-backup"
# backup_tls_x509_verify = 1 
# backup_tls_x509_secret_uuid = "00000000-0000-0000-0000-000000000000"


3. prepare the backup xml with vdb 'tls="yes"'
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# cat backup_full.xml 
<domainbackup mode="pull">
	<server name="dell-per730-67.lab.eng.pek2.redhat.com" port="10809" tls="yes"/>
	<disks>
		<disk backup="no" name="vda" />
		<disk backup="yes" name="vdb" type="file">
			<scratch file="/tmp/scratch_file_0" />
		</disk>
	</disks>
</domainbackup>

4. clear libvirtd log
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# echo "" > /var/log/libvirtd-debug.log

5. start the first backup, it's ok
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# virsh backup-begin vm1 backup_full.xml  
Backup started

6. restart libvirtd daemon
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# systemctl restart libvirtd

7. abort the backup job of step 5
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# virsh domjobabort vm1

8. Start the backup job again
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# virsh backup-begin vm1 backup_full.xml  
error: internal error: unexpected async job 7 type expected 0
<=== qemu process crashed

9. gdb backtrace of `pidof /usr/libexec/qemu-kvm` can be found in attchment named "gdb-qemu-kvm-vm1.txt"
libvirtd log can be found in attachment named "libvirtd-debug.log"


Additional info:
1. if tls not enabled, nothing wrong:
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# cat backup.xml 
<domainbackup mode="pull">
	<server name="dell-per730-67.lab.eng.pek2.redhat.com" port="10809"/>
	<disks>
		<disk backup="no" name="vda" />
		<disk backup="yes" name="vdb" type="file">
			<scratch file="/tmp/scratch_file_0" />
		</disk>
	</disks>
</domainbackup>

(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# virsh backup-begin vm1 backup.xml; systemctl restart libvirtd; virsh domjobabort vm1; virsh backup-begin vm1 backup.xml
Backup started
Backup started


2. if not restart libvirtd, nothing wrong
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# cat backup_full.xml 
<domainbackup mode="pull">
	<server name="dell-per730-67.lab.eng.pek2.redhat.com" port="10809" tls="yes"/>
	<disks>
		<disk backup="no" name="vda" />
		<disk backup="yes" name="vdb" type="file">
			<scratch file="/tmp/scratch_file_0" />
		</disk>
	</disks>
</domainbackup>

(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# virsh backup-begin vm1 backup_full.xml; virsh domjobabort vm1; virsh backup-begin vm1 backup_full.xml; virsh domjobabort vm1; virsh backup-begin vm1 backup_full.xml
Backup started
Backup started
Backup started


3. After qemu crash, even if we restart vm and restart libvirtd, the full backup will always provide wrong data
3.1 vdb has 123MB disk size
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# qemu-img info /var/lib/libvirt/images/vdb.qcow2  -U
image: /var/lib/libvirt/images/vdb.qcow2
file format: qcow2
virtual size: 1 GiB (1073741824 bytes)
disk size: 123 MiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

3.2 start the backup job again after the qemu crash
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# virsh backup-begin vm1 backup_full.xml  
Backup started

3.3 dump backup data to local image from nbd exprot
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# qemu-img convert -O qcow2 --object tls-creds-x509,id=tls0,endpoint=client,dir=/etc/pki/libvirt-backup 'json:{"file":{"driver":"nbd", "server":{"host":"dell-per730-67.lab.eng.pek2.redhat.com", "port":10809, "type":"inet"}, "export":"vdb", "tls-creds":"tls0"}}' test.qcow2

3.4 the test.qcow2 only has 196KB disk size, but not 123MB
(.libvirt-ci-venv-ci-runtest-z1MFOW) [root@dell-per730-67 ~]# qemu-img info test.qcow2 
image: test.qcow2
file format: qcow2
virtual size: 1 GiB (1073741824 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

Comment 1 yisun 2020-08-20 08:46:59 UTC
Created attachment 1711978 [details]
gdb-qemu-kvm-vm1.txt

Comment 2 yisun 2020-08-20 08:48:15 UTC
Created attachment 1711979 [details]
libvirtd-debug.log

Comment 3 yisun 2020-08-20 08:50:30 UTC
This is a qemu crash, but involved a libvirtd restart, so set component to 'libvirt' for now. If it's a qemu issue after debug, pls help to move to qemu team, thx

Comment 4 Peter Krempa 2020-09-14 16:25:26 UTC
The qemu proces abort()s because libvirt didn't delete the TLS_x509 and secret objects when aborting the backup job after restart of libvirtd as their aliases were not written out to the status XML.

Note that upstream qemu now reports an error rather than abort()-ing.

Comment 6 Peter Krempa 2020-09-15 14:54:21 UTC
Fixed upstream:

commit 1a5f35dbd2c4d83f7629579bcd8b23929a492b29
Author: Peter Krempa <pkrempa>
Date:   Mon Sep 14 17:59:07 2020 +0200

    qemu: backup: Write TLS cert and secret object aliases into status XML
    
    We've put the aliases into the backup job definition after the status
    XML was already written so they didn't appear in the on-disk state.
    
    Move the code putting them into the private definition earlier, so that
    the status XML update done by saving blockjobs already writes them out.
    
    Also add a note notifying that the block job status update writes the
    status XML.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1870488
    Fixes: 423576679a5
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Michal Privoznik <mprivozn>
    Reviewed-by: Ján Tomko <jtomko>

commit 5058062b5daa6d841154eda7f6a53c39d64e765e
Author: Peter Krempa <pkrempa>
Date:   Mon Sep 14 17:58:09 2020 +0200

    qemu: backup: Remove note that TLS should be implemented
    
    Commit 423576679a5 implementing TLS forgot to remove the comment.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Michal Privoznik <mprivozn>
    Reviewed-by: Ján Tomko <jtomko>

commit 6c2d91118dc99426a79bf48c8d795e243c522dbd
Author: Peter Krempa <pkrempa>
Date:   Mon Sep 14 17:46:42 2020 +0200

    qemustatusxml2xml: backup-pull: Test private data formatting/parsing
    
    Modify the test case to enable TLS and add private data containing
    aliases of objects corresponding to a TLS setup.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Michal Privoznik <mprivozn>
    Reviewed-by: Ján Tomko <jtomko>

Comment 11 yisun 2020-09-17 10:11:26 UTC
Verified with: libvirt-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64
Result: PASS

[root@dell-per740xd-10 ~]# cat backup_full.xml 
<domainbackup mode="pull">
	<server name="dell-per740xd-10.lab.eng.pek2.redhat.com" port="10809" tls="yes"/>
	<disks>
		<disk backup="no" name="vda" />
		<disk backup="yes" name="vdb" type="file">
			<scratch file="/tmp/scratch_file_0" />
		</disk>
	</disks>
</domainbackup>

[root@dell-per740xd-10 ~]# virsh backup-begin vm1 backup_full.xml  
Backup started

[root@dell-per740xd-10 ~]# systemctl restart libvirtd

[root@dell-per740xd-10 ~]# virsh domjobinfo vm1
Job type:         Unbounded   
Operation:        Backup      
Time elapsed:     7529         ms
Temporary disk space use: 0.000 B
Temporary disk space total: 5.000 GiB

[root@dell-per740xd-10 ~]# virsh domjobabort vm1

[root@dell-per740xd-10 ~]# virsh list
 Id   Name   State
----------------------
...
 5    vm1    running

[root@dell-per740xd-10 ~]# virsh backup-begin vm1 backup_full.xml  
Backup started


[root@dell-per740xd-10 ~]# virsh domjobinfo vm1
Job type:         Unbounded   
Operation:        Backup      
Time elapsed:     16057        ms
Temporary disk space use: 0.000 B
Temporary disk space total: 5.000 GiB

[root@dell-per740xd-10 ~]# virsh domjobabort vm1

[root@dell-per740xd-10 ~]# virsh list
 Id   Name   State
----------------------
...
 5    vm1    running

Comment 14 errata-xmlrpc 2020-11-17 17:50:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137