Bug 1963813 - Libvirt gets SIGSEGV after cancelling migration at lifecycle event and then shutdown+destroy the VM
Summary: Libvirt gets SIGSEGV after cancelling migration at lifecycle event and then s...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Fangge Jin
URL:
Whiteboard:
Depends On: 1949864
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-24 06:08 UTC by Han Han
Modified: 2022-05-10 08:33 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1949864
Environment:
Last Closed: 2022-05-10 08:32:52 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
full backtrace (72.14 KB, text/plain)
2021-05-24 06:08 UTC, Han Han
no flags Details

Description Han Han 2021-05-24 06:08:02 UTC
Created attachment 1786271 [details]
full backtrace

Reproduced on:
libvirt-7.3.0-1.module+el8.5.0+11004+f4810536.x86_64
qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64

See the full threads backtrace in the attachment
+++ This bug was initially created as a clone of Bug #1949864 +++

Description of problem:
As subject

Version-Release number of selected component (if applicable):
libvirt v7.2.0-184-g8674faaf32
qemu-6.0.0-0.1.rc2.fc35.x86_64
libvirt-python v7.2.0-12-g80ed190
python3-3.9.4-1.fc33.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare two hosts for migration, installing them with libvirt v7.2.0-184-g8674faaf32 qemu-6.0.0-0.1.rc2.fc35.x86_64.

2. Start an VM on src host

3. Execute the script libvirt-abortJob-lifecycle_event.py. It will try to abort the domain job when a lifecycle event happens

4. Migrate the VM
# virsh migrate hhan qemu+ssh://root@XXXX/system --live --p2p --copy-storage-all


5. After the console of step3 show msg like following:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-                                                                                                                                                         
hhan: event: VIR_DOMAIN_EVENT_SUSPENDED (3)                                                                                                                                                                
hhan: state: VIR_DOMAIN_PAUSED (3)                                                                                                                                                                         
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-                                                                                                                                                         
                                                                                                                                                                                                           
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-                                                                                                                                                         
hhan: event: VIR_DOMAIN_EVENT_RESUMED (4)                                                                                                                                                                  
hhan: state: VIR_DOMAIN_RUNNING (1)                                                                                                                                                                        
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Then try to execute `virsh shutdown hhan`

6. At last execute `virsh destroy hhan`. Libvirtd will get a SEGSEGV

Actual results:
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f1cf67fccd6 in qemuBlockJobEventProcessLegacy (asyncJob=1, job=0x7f1ca8365610, vm=0x7f1ca81c2820, driver=0x7f1ca812cab0) at ../src/qemu/qemu_blockjob.c:784                                     Missing separate debuginfos, use: dnf debuginfo-install glibc-2.33-5.fc34.x86_64                                                                                                                           --Type <RET> for more, q to quit, c to continue without paging--
784         VIR_DEBUG("disk=%s, mirrorState=%s, type=%d, state=%d, newstate=%d",

(gdb) bt
#0  0x00007f1cf67fccd6 in qemuBlockJobEventProcessLegacy (asyncJob=1, job=0x7f1ca8365610, vm=0x7f1ca81c2820, driver=0x7f1ca812cab0) at ../src/qemu/qemu_blockjob.c:784
#1  qemuBlockJobUpdate (vm=0x7f1ca81c2820, job=0x7f1ca8365610, asyncJob=1) at ../src/qemu/qemu_blockjob.c:1744
#2  0x00007f1cf68865f8 in qemuMigrationSrcNBDCopyCancel (driver=0x7f1ca812cab0, vm=0x7f1ca81c2820, check=true, asyncJob=QEMU_ASYNC_JOB_MIGRATION_OUT, dconn=0x7f1ca8012d70)
    at ../src/qemu/qemu_migration.c:794
#3  0x00007f1cf6892ba1 in qemuMigrationSrcRun
    (driver=0x7f1ca812cab0, vm=0x7f1ca81c2820, persist_xml=<optimized out>, cookiein=<optimized out>, cookieinlen=<optimized out>, cookieout=0x7f1cffd25520, cookieoutlen=0x7f1cffd254f4, flags=67, resource=0, spec=0x7f1cffd25350, dconn=0x7f1ca8012d70, graphicsuri=0x0, nmigrate_disks=0, migrate_disks=0x0, migParams=0x7f1cec004c00, nbdURI=0x0) at ../src/qemu/qemu_migration.c:4275
#4  0x00007f1cf68938a2 in qemuMigrationSrcPerformNative
    (driver=0x7f1ca812cab0, vm=0x7f1ca81c2820, persist_xml=0x0, uri=<optimized out>, cookiein=0x7f1cec036000 "<qemu-migration>\n  <name>hhan</name>\n  <uuid>3dab2b2e-7009-4140-9818-5c7158b448ba</uuid>\n  <hostname>hhan-patchwork</hostname>\n  <hostuuid>f7e23fcb-faef-4935-9af8-e08a2502c901</hostuuid>\n  <graphics t"..., cookieinlen=764, cookieout=0x7f1cffd25520, cookieoutlen=0x7f1cffd254f4, flags=67, resource=0, dconn=0x7f1ca8012d70, graphicsuri=0x0, nmigrate_disks=0, migrate_disks=0x0, migParams=0x7f1cec004c00, nbdURI=0x0) at ../src/qemu/qemu_migration.c:4462
#5  0x00007f1cf68955fd in qemuMigrationSrcPerformPeer2Peer3
    (flags=<optimized out>, useParams=true, bandwidth=<optimized out>, migParams=0x7f1cec004c00, nbdURI=0x0, nbdPort=0, migrate_disks=0x0, nmigrate_disks=<optimized out>, listenAddress=<optimized out>, graphicsuri=0x0, uri=<optimized out>, dname=0x0, persist_xml=0x0, xmlin=<optimized out>, vm=0x7f1ca81c2820, dconnuri=0x7f1cec004a20 "qemu+ssh://root.78.5/system", dconn=0x7f1ca8012d70, sconn=0x7f1ca8012270, driver=0x7f1ca812cab0) at ../src/qemu/qemu_migration.c:4879
#6  qemuMigrationSrcPerformPeer2Peer
    (v3proto=<synthetic pointer>, resource=<optimized out>, dname=0x0, flags=67, migParams=0x7f1cec004c00, nbdURI=0x0, nbdPort=0, migrate_disks=0x0, nmigrate_disks=<optimized out>, listenAddress=<optimized out>, graphicsuri=0x0, uri=<optimized out>, dconnuri=0x7f1cec004a20 "qemu+ssh://root.78.5/system", persist_xml=0x0, xmlin=<optimized out>, vm=0x7f1ca81c2820, sconn=0x7f1ca8012270, driver=0x7f1ca812cab0) at ../src/qemu/qemu_migration.c:5188
#7  qemuMigrationSrcPerformJob
    (driver=0x7f1ca812cab0, conn=0x7f1ca8012270, vm=0x7f1ca81c2820, xmlin=<optimized out>, persist_xml=0x0, dconnuri=0x7f1cec004a20 "qemu+ssh://root.78.5/system", uri=<optimized out>, graphicsuri=<optimized out>, listenAddress=<optimized out>, nmigrate_disks=<optimized out>, migrate_disks=<optimized out>, nbdPort=0, nbdURI=<optimized out>, migParams=<optimized out>, cookiein=<optimized out>, cookieinlen=0, cookieout=<optimized out>, cookieoutlen=<optimized out>, flags=<optimized out>, dname=<optimized out>, resource=<optimized out>, v3proto=<optimized out>) at ../src/qemu/qemu_migration.c:5263
#8  0x00007f1cf6895eb3 in qemuMigrationSrcPerform
    (driver=0x7f1ca812cab0, conn=0x7f1ca8012270, vm=0x7f1ca81c2820, xmlin=0x0, persist_xml=0x0, dconnuri=0x7f1cec004a20 "qemu+ssh://root.78.5/system", uri=0x0, graphicsuri=0x0, listenAddress=0x0, nmigrate_disks=0, migrate_disks=0x0, nbdPort=0, nbdURI=0x0, migParams=0x7f1cec004c00, cookiein=0x0, cookieinlen=0, cookieout=0x7f1cffd258a8, cookieoutlen=0x7f1cffd2589c, flags=67, dname=0x0, resource=0, v3proto=true) at ../src/qemu/qemu_migration.c:5467
#9  0x00007f1cf685bdd8 in qemuDomainMigratePerform3Params
    (dom=0x7f1cf8004330, dconnuri=0x7f1cec004a20 "qemu+ssh://root.78.5/system", params=<optimized out>, nparams=0, cookiein=0x0, cookieinlen=0, cookieout=0x7f1cffd258a8, cookieoutlen=0x7f1cffd2589c, flags=67) at ../src/qemu/qemu_driver.c:11878
#10 0x00007f1d02c00b75 in virDomainMigratePerform3Params
    (domain=domain@entry=0x7f1cf8004330, dconnuri=0x7f1cec004a20 "qemu+ssh://root.78.5/system", params=0x0, nparams=0, cookiein=0x0, cookieinlen=0, cookieout=0x7f1cffd258a8, cookieoutlen=0x7f1cffd2589c, flags=67) at ../src/libvirt-domain.c:5118
#11 0x0000559d16c2f987 in remoteDispatchDomainMigratePerform3Params (server=<optimized out>, msg=0x559d18263880, ret=0x7f1cec0045c0, args=0x7f1cec006c00, rerr=0x7f1cffd259a0, client=<optimized out>)
    at ../src/remote/remote_daemon_dispatch.c:5719
#12 remoteDispatchDomainMigratePerform3ParamsHelper (server=<optimized out>, client=<optimized out>, msg=0x559d18263880, rerr=0x7f1cffd259a0, args=0x7f1cec006c00, ret=0x7f1cec0045c0)
    at src/remote/remote_daemon_dispatch_stubs.h:8761
#13 0x00007f1d02aef33a in virNetServerProgramDispatchCall (msg=0x559d18263880, client=0x559d1825f290, server=0x559d181bf880, prog=0x559d18252810) at ../src/rpc/virnetserverprogram.c:428
#14 virNetServerProgramDispatch (prog=0x559d18252810, server=0x559d181bf880, client=0x559d1825f290, msg=0x559d18263880) at ../src/rpc/virnetserverprogram.c:302
#15 0x00007f1d02af5ab8 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559d181bf880) at ../src/rpc/virnetserver.c:135
#16 virNetServerHandleJob (jobOpaque=0x559d1825e7c0, opaque=0x559d181bf880) at ../src/rpc/virnetserver.c:152
#17 0x00007f1d02a25c62 in virThreadPoolWorker (opaque=<optimized out>) at ../src/util/virthreadpool.c:159
#18 0x00007f1d02a2cf09 in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:233
#19 0x00007f1d01c33299 in start_thread () at /lib64/libpthread.so.0
#20 0x00007f1d024586a3 in clone () at /lib64/libc.so.6

Expected results:
No SEGSEGV

Additional info:
See the script, backtrace, libvirtd log, VM xml in the attachment

Comment 1 John Ferlan 2021-09-09 15:35:59 UTC
Bulk update: Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 2 Martin Kletzander 2022-02-16 13:34:14 UTC
Is the libvirt-abortJob-lifecycle_event.py script available somewhere?  Or can you shed the light on what exact APIs does it call?

Comment 3 Peter Krempa 2022-02-16 15:08:36 UTC
The script, XML and such are attached to the bug this was cloned from.

Looking at the debug log, the VM was started in blockdev mode, but the backtrace points to the crash being in 'qemuBlockJobEventProcessLegacy' which is called in non-blockdev mode. The only way I can see that happening is if at the point (due to all of the shutdown shenaningans that happened) something (I presume qemuProcessStop) cleared priv->qemuCaps, in which case we'd end up in the legacy branch despite starting in blockdev mode.

Comment 4 Martin Kletzander 2022-02-21 14:37:28 UTC
So, the script calls libvirt APIs from the callback, which is not supported.  Even with that I tried reproducing this and with current libvirt the crash does not happen.  Other things do, but they are expected when blocking all APIs from a callback.  After modifying the script to do the reporting and aborting in another thread all works as expected. There is one case which I am not confident enough to dismiss as expected *yet*, but nevertheless it is unrelated to this bug report).

Please try to reproduce with current libvirt so that we see whether that happened because I have a slower machine or a bug in older version.  Please also bear in mind that unless we find something obviously wrong this might still be closed purely because of the API calls from the callback itself.

Comment 5 Han Han 2022-03-15 08:23:02 UTC
(In reply to Martin Kletzander from comment #4)
> So, the script calls libvirt APIs from the callback, which is not supported.
Can we throw the unsupported error when calling libvirt APIs from the callback?
> Even with that I tried reproducing this and with current libvirt the crash
> does not happen.  Other things do, but they are expected when blocking all
> APIs from a callback.  After modifying the script to do the reporting and
> aborting in another thread all works as expected. There is one case which I
> am not confident enough to dismiss as expected *yet*, but nevertheless it is
> unrelated to this bug report).
> 
> Please try to reproduce with current libvirt so that we see whether that
> happened because I have a slower machine or a bug in older version.  Please
> also bear in mind that unless we find something obviously wrong this might
> still be closed purely because of the API calls from the callback itself.
Now I cannot reproduce it on libvirt-8.0.0-6.el9.x86_64 qemu-kvm-6.2.0-11.el9.x86_64, too

Comment 6 Martin Kletzander 2022-03-17 09:57:57 UTC
(In reply to Han Han from comment #5)
I would expect so, thanks for checking that.  I would close this BZ in this case, although I'm not sure what the proper resolution should be.  Moving to MODIFIED seems weird to me.  What do you think Jirka?

Comment 7 Jiri Denemark 2022-03-17 14:21:35 UTC
Since it cannot be reproduced anymore, I guess CLOSED with CURRENTRELEASE or
WORKSFORME would be appropriate. Or if you can come up with a proper
reproducer, we could turn this into a TestOnly bug.

Comment 8 Jaroslav Suchanek 2022-05-10 08:32:52 UTC
All right, lets close it for now.


Note You need to log in before you can comment on or make changes to this bug.