Bug 1946918

Summary: Undefine guest with --snapshots-metadata failed after blockpull
Product: Red Hat Enterprise Linux 9 Reporter: Meina Li <meili>
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED CURRENTRELEASE QA Contact: Meina Li <meili>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 9.0CC: jdenemar, pkrempa, virt-maint, xuzhang
Target Milestone: betaKeywords: Automation, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-7.3.0-1.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1948619 (view as bug list) Environment:
Last Closed: 2021-12-07 21:57:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 7.3.0
Embargoed:
Bug Depends On:    
Bug Blocks: 1948619    
Attachments:
Description Flags
bt.log
none
libvirtd.log
none
coredump.log none

Description Meina Li 2021-04-07 08:59:19 UTC
Created attachment 1769793 [details]
bt.log

Description of problem:
Undefine guest with --snapshots-metadata failed after blockpull

Version-Release number of selected component (if applicable):
libvirt-7.0.0-4.el9.x86_64
qemu-kvm-5.2.0-11.el9.x86_64
kernel-5.11.0-2.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a running guest.
# virsh domstate avocado-vt-vm1 
running

2. Create snapshots for the guest.
# for i in {1..4}; do virsh snapshot-create-as avocado-vt-vm1 s$i --disk-only; done
Domain snapshot s1 created
Domain snapshot s2 created
Domain snapshot s3 created
Domain snapshot s4 created

3. Do blockpull for the virtual disk.
# virsh blockpull avocado-vt-vm1 vda --wait --verbose --async
Block Pull: [100 %]
Pull complete

4. Destroy the guest and undefine it with --snapshots-metadata.
# virsh destroy avocado-vt-vm1 
Domain 'avocado-vt-vm1' destroyed
# virsh undefine avocado-vt-vm1 --snapshots-metadata
error: Disconnected from qemu:///system due to end of file
error: Failed to undefine domain 'avocado-vt-vm1'
error: End of file while reading data: Input/output error

Actual results:
Undefine the guest failed 

Expected results:
Undefine the guest successfully

Additional info:

Comment 1 Peter Krempa 2021-04-07 12:20:08 UTC
I didn't manage to reproduce the problem. Could you please also attach debug logs of libvirtd when the issue reproduces.

Additionally the backtrace which is attached is from a idle libvirtd not from any crashed one, so it doesn't help at all. If libvirtd really crashed the backtrace needs to be collected from the crashed process.

Comment 2 Meina Li 2021-04-08 03:42:36 UTC
(In reply to Peter Krempa from comment #1)
> I didn't manage to reproduce the problem. Could you please also attach debug
> logs of libvirtd when the issue reproduces.
> 
> Additionally the backtrace which is attached is from a idle libvirtd not
> from any crashed one, so it doesn't help at all. If libvirtd really crashed
> the backtrace needs to be collected from the crashed process.

I found the guest is undefined successfully even though it has the "Input/output error". And the libvirtd maybe doesn't crash after checking the backtrace.
Please check the libvirtd log. 
Thank you.

Comment 3 Meina Li 2021-04-08 03:43:22 UTC
Created attachment 1770110 [details]
libvirtd.log

Comment 4 Peter Krempa 2021-04-09 07:11:16 UTC
Could you please try getting the coredump backtrace from the actual crash e.g. via "coredumpctl dump /usr/sbin/libvirtd"

Comment 5 Meina Li 2021-04-09 07:37:30 UTC
Created attachment 1770504 [details]
coredump.log

Comment 6 Peter Krempa 2021-04-09 07:43:04 UTC
Thank you! The important part is:
          
                Stack trace of thread 1099086:
                #0  0x00007fb5d40da7bd virObjectUnref (libvirt.so.0 + 0x10a7bd)
                #1  0x00007fb5d40f24c1 virStorageSourceBackingStoreClear (libvirt.so.0 + 0x1224c1)
                #2  0x00007fb5d40fb7d3 virStorageSourceClear (libvirt.so.0 + 0x12b7d3)
                #3  0x00007fb5d40da76e vir_object_finalize (libvirt.so.0 + 0x10a76e)
                #4  0x00007fb5d3e54e1c g_object_unref (libgobject-2.0.so.0 + 0x20e1c)
                #5  0x00007fb5d40da7d8 virObjectUnref (libvirt.so.0 + 0x10a7d8)
                #6  0x00007fb5d4123bc9 virDomainDiskDefFree (libvirt.so.0 + 0x153bc9)
                #7  0x00007fb5d4124ebc virDomainDefFree (libvirt.so.0 + 0x154ebc)
                #8  0x00007fb5d4125b31 virDomainObjDispose (libvirt.so.0 + 0x155b31)
                #9  0x00007fb5d40da76e vir_object_finalize (libvirt.so.0 + 0x10a76e)
                #10 0x00007fb5d3e54e1c g_object_unref (libgobject-2.0.so.0 + 0x20e1c)
                #11 0x00007fb5d40da7d8 virObjectUnref (libvirt.so.0 + 0x10a7d8)
                #12 0x00007fb5d4125e8d virDomainObjEndAPI (libvirt.so.0 + 0x155e8d)
                #13 0x00007fb5c8760f8a qemuDomainUndefineFlags (libvirt_driver_qemu.so + 0xcaf8a)
                #14 0x00007fb5d42a61e9 virDomainUndefineFlags (libvirt.so.0 + 0x2d61e9)
                #15 0x0000559b1807ed08 remoteDispatchDomainUndefineFlagsHelper.lto_priv.0 (libvirtd + 0x46d08)
                #16 0x00007fb5d41bdab6 virNetServerProgramDispatch (libvirt.so.0 + 0x1edab6)
                #17 0x00007fb5d41c3b28 virNetServerHandleJob (libvirt.so.0 + 0x1f3b28)
                #18 0x00007fb5d40fbcf2 virThreadPoolWorker (libvirt.so.0 + 0x12bcf2)
                #19 0x00007fb5d4102e29 virThreadHelper (libvirt.so.0 + 0x132e29)
                #20 0x00007fb5d336d299 start_thread (libpthread.so.0 + 0x9299)
                #21 0x00007fb5d3b49863 __clone (libc.so.6 + 0x100863)

Comment 7 Peter Krempa 2021-04-09 08:08:26 UTC
Did your VM config have any <backingStore> for disks defined explicitly prior to the startup? Ideally post the full VM xml prior to the startup of the VM.

Comment 8 Meina Li 2021-04-09 09:11:26 UTC
(In reply to Peter Krempa from comment #7)
> Did your VM config have any <backingStore> for disks defined explicitly
> prior to the startup? Ideally post the full VM xml prior to the startup of
> the VM.

Yes, it includes <backingStore/> in xml before startup. But this scenario can be passed in RHEL-AV 8.4.
And it can undefine without error after I removed the element. 

   <disk type='file' device='disk'>virStorageSourceClear
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </disk>

Except this, it also hit this error for iscsi network disk even though it doesn't include <backingStore/> in XML:

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source protocol='iscsi' name='iqn.2021-04.com.virttest:emulated-iscsi.target/0'>
        <host name='127.0.0.1' port='3260'/>
      </source>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </disk>

Similar stack trace for iscsi network disk:
Stack trace of thread 1103759:
#0  0x00007f1eb5b157bd virObjectUnref (libvirt.so.0 + 0x10a7bd)
#1  0x00007f1eb5b2d4c1 virStorageSourceBackingStoreClear (libvirt.so.0 + 0x1224c1)
#2  0x00007f1eb5b367d3 virStorageSourceClear (libvirt.so.0 + 0x12b7d3)
#3  0x00007f1eb5b1576e vir_object_finalize (libvirt.so.0 + 0x10a76e)
#4  0x00007f1eb588fe1c g_object_unref (libgobject-2.0.so.0 + 0x20e1c)
#5  0x00007f1eb5b157d8 virObjectUnref (libvirt.so.0 + 0x10a7d8)
#6  0x00007f1eb5b5ebc9 virDomainDiskDefFree (libvirt.so.0 + 0x153bc9)
#7  0x00007f1eb5b5febc virDomainDefFree (libvirt.so.0 + 0x154ebc)
#8  0x00007f1eb5b60b31 virDomainObjDispose (libvirt.so.0 + 0x155b31)
#9  0x00007f1eb5b1576e vir_object_finalize (libvirt.so.0 + 0x10a76e)
#10 0x00007f1eb588fe1c g_object_unref (libgobject-2.0.so.0 + 0x20e1c)
#11 0x00007f1eb5b157d8 virObjectUnref (libvirt.so.0 + 0x10a7d8)
#12 0x00007f1eb5b60e8d virDomainObjEndAPI (libvirt.so.0 + 0x155e8d)
#13 0x00007f1ea9980f8a qemuDomainUndefineFlags (libvirt_driver_qemu.so + 0xcaf8a)
#14 0x00007f1eb5ce11e9 virDomainUndefineFlags (libvirt.so.0 + 0x2d61e9)
#15 0x000055b06bdbcd08 remoteDispatchDomainUndefineFlagsHelper.lto_priv.0 (libvirtd + 0x46d08)
#16 0x00007f1eb5bf8ab6 virNetServerProgramDispatch (libvirt.so.0 + 0x1edab6)
#17 0x00007f1eb5bfeb28 virNetServerHandleJob (libvirt.so.0 + 0x1f3b28)
#18 0x00007f1eb5b36cf2 virThreadPoolWorker (libvirt.so.0 + 0x12bcf2)
#19 0x00007f1eb5b3de29 virThreadHelper (libvirt.so.0 + 0x132e29)
#20 0x00007f1eb4da8299 start_thread (libpthread.so.0 + 0x9299)
#21 0x00007f1eb5584863 __clone (libc.so.6 + 0x100863)

Comment 9 Peter Krempa 2021-04-09 10:28:53 UTC
Yes, that will be regardless of disk type. The handling differs depending on the presence of <backingStore/>.

Thanks for the info.

Comment 10 Peter Krempa 2021-04-12 15:50:50 UTC
Note that the bug may not always manifest itself in a crash, but it's easily observable when running libvirtd under valgrind.

Already after the 'virsh blockpull' step an invalid read is reported:

==2118357== Invalid read of size 8
==2118357==    at 0x496E121: VIR_IS_OBJECT (virobject.h:44)
==2118357==    by 0x496E121: virObjectUnref (virobject.c:377)
==2118357==    by 0x4A33ABC: virStorageSourceBackingStoreClear (storage_source_conf.c:1105)
==2118357==    by 0x4A34515: virStorageSourceClear (storage_source_conf.c:1142)
==2118357==    by 0x496DAED: vir_object_finalize (virobject.c:326)
==2118357==    by 0x4E8BA6F: g_object_unref (in /usr/lib64/libgobject-2.0.so.0.6600.8)
==2118357==    by 0x496E147: virObjectUnref (virobject.c:380)
==2118357==    by 0x496E147: virObjectUnref (virobject.c:373)
==2118357==    by 0x49BFE64: virDomainDiskDefFree (domain_conf.c:2261)
==2118357==    by 0x49BFE64: virDomainDiskDefFree (domain_conf.c:2256)
==2118357==    by 0x49DAB1B: virDomainDefFree (domain_conf.c:3598)
==2118357==    by 0x49DAB1B: virDomainDefFree (domain_conf.c:3562)
==2118357==    by 0x49DB40C: virDomainObjDispose (domain_conf.c:3744)
==2118357==    by 0x496DAED: vir_object_finalize (virobject.c:326)
==2118357==    by 0x4E8BA6F: g_object_unref (in /usr/lib64/libgobject-2.0.so.0.6600.8)
==2118357==    by 0x496E147: virObjectUnref (virobject.c:380)
==2118357==    by 0x496E147: virObjectUnref (virobject.c:373)
==2118357==  Address 0x1b92a140 is 32 bytes inside a block of size 520 free'd
==2118357==    at 0x483A9F5: free (vg_replace_malloc.c:538)
==2118357==    by 0x4D9170C: g_free (in /usr/lib64/libglib-2.0.so.0.6600.8)
==2118357==    by 0x4DAB9FF: g_slice_free1 (in /usr/lib64/libglib-2.0.so.0.6600.8)
==2118357==    by 0x4EA0203: g_type_free_instance (in /usr/lib64/libgobject-2.0.so.0.6600.8)
==2118357==    by 0x496E147: virObjectUnref (virobject.c:380)
==2118357==    by 0x496E147: virObjectUnref (virobject.c:373)
==2118357==    by 0x16033BA5: qemuBlockJobProcessEventCompletedCommit (qemu_blockjob.c:1229)
==2118357==    by 0x16033BA5: qemuBlockJobEventProcessConcludedTransition (qemu_blockjob.c:1546)
==2118357==    by 0x16033BA5: qemuBlockJobEventProcessConcluded (qemu_blockjob.c:1647)
==2118357==    by 0x16033BA5: qemuBlockJobEventProcess (qemu_blockjob.c:1686)
==2118357==    by 0x16033BA5: qemuBlockJobUpdate (qemu_blockjob.c:1732)
==2118357==    by 0x160A8178: processJobStatusChangeEvent (qemu_driver.c:4195)
==2118357==    by 0x160A8178: qemuProcessEventHandler (qemu_driver.c:4355)
==2118357==    by 0x498A541: virThreadPoolWorker (virthreadpool.c:163)
==2118357==    by 0x4989BC4: virThreadHelper (virthread.c:233)
==2118357==    by 0x59C23F8: start_thread (in /usr/lib64/libpthread-2.32.so)
==2118357==    by 0x51E0B52: clone (in /usr/lib64/libc-2.32.so)
==2118357==  Block was alloc'd at
==2118357==    at 0x4839809: malloc (vg_replace_malloc.c:307)
==2118357==    by 0x4D94BB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8)
==2118357==    by 0x4DAC481: g_slice_alloc (in /usr/lib64/libglib-2.0.so.0.6600.8)
==2118357==    by 0x4DACB0D: g_slice_alloc0 (in /usr/lib64/libglib-2.0.so.0.6600.8)
==2118357==    by 0x4EA4C19: g_type_create_instance (in /usr/lib64/libgobject-2.0.so.0.6600.8)
==2118357==    by 0x4E8D8C4: ??? (in /usr/lib64/libgobject-2.0.so.0.6600.8)
==2118357==    by 0x4E8E69C: g_object_new_with_properties (in /usr/lib64/libgobject-2.0.so.0.6600.8)
==2118357==    by 0x4E8F310: g_object_new (in /usr/lib64/libgobject-2.0.so.0.6600.8)
==2118357==    by 0x496E093: virObjectNew (virobject.c:255)
==2118357==    by 0x4A33B12: virStorageSourceNew (storage_source_conf.c:1191)
==2118357==    by 0x4A33CF2: virStorageSourceCopy (storage_source_conf.c:801)
==2118357==    by 0x1610AAB4: qemuSnapshotDiskPrepareOne (qemu_snapshot.c:1068)

(note that there's wrong function printed in the trace: qemuBlockJobProcessEventCompletedCommit, whereas the bug is in qemuBlockJobProcessEventCompletedPull)

Comment 11 Peter Krempa 2021-04-13 09:01:04 UTC
Fixed upstream:

commit b4d020790642fa4d7b8a6783b81d5d9d73cbe3d9
Author: Peter Krempa <pkrempa>
Date:   Mon Apr 12 17:42:23 2021 +0200

    qemuBlockJobProcessEventCompletedPull: Add backingStore terminators if base is NULL
    
    When doing a blockpull with NULL base the full contents of the disk are
    pulled into the topmost image which then becomes fully self-contained.
    
    qemuBlockJobProcessEventCompletedPull doesn't install the backing chain
    terminators though, although it's guaranteed that there will be no
    backing chain behind disk->src.
    
    Add the terminators for completness and for disabling backing chain
    detection on further boots.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Pavel Hrdina <phrdina>

commit 46e748aa02cbd5923fa4b500352f528de35dc665
Author: Peter Krempa <pkrempa>
Date:   Mon Apr 12 17:24:22 2021 +0200

    qemuBlockJobProcessEventCompletedPull: Avoid dangling pointer after blockpull
    
    When doing a full block pull job (base == NULL) and the config XML
    contains a compatible disk, the completer function would leave a
    dangling pointer in 'cfgdisk->src->backingStore' as cfgdisk->src would
    be set to the value of 'cfgbase' which was always set to
    'cfgdisk->src->backingStore'.
    
    This is wrong though since for the live definition XML we set the
    respective counterpart to 'job->data.pull.base' which is NULL in the
    above scenario.
    
    This leads to a invalid pointer read when saving the config XML and may
    end up in a crash.
    
    Resolve it by setting 'cfgbase' only when 'job->data.pull.base' is
    non-NULL.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1946918
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Pavel Hrdina <phrdina>

Comment 15 Meina Li 2021-06-09 02:08:04 UTC
Verified Version:
libvirt-7.3.0-1.el9.x86_64
qemu-kvm-6.0.0-3.el9.x86_64

Verified Steps:
1. Prepare a running guest.
# virsh domstate lmn 
running
# virsh dumpxml lmn --inactive | grep /disk -B6
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/lmn.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>
2. Create snapshots for the guest.
# for i in {1..4}; do virsh snapshot-create-as lmn s$i --disk-only; done
Domain snapshot s1 created
Domain snapshot s2 created
Domain snapshot s3 created
Domain snapshot s4 created
3. Do blockpull for the virtual disk.
# virsh blockpull lmn vda --wait --verbose --async
Block Pull: [100 %]
Pull complete
4. Destroy the guest and undefine it with --snapshots-metadata.
# virsh destroy lmn
Domain 'lmn' destroyed
# virsh undefine lmn --snapshots-metadata
Domain 'lmn' has been undefined