Bug 1949869 - Migration hangs if vm is shutdown during live migration
Summary: Migration hangs if vm is shutdown during live migration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 8.4
Assignee: Jiri Denemark
QA Contact: Fangge Jin
URL:
Whiteboard:
: 1967715 (view as bug list)
Depends On:
Blocks: 1983694
TreeView+ depends on / blocked
 
Reported: 2021-04-15 09:36 UTC by Fangge Jin
Modified: 2021-11-16 08:24 UTC (History)
6 users (show)

Fixed In Version: libvirt-7.6.0-1.module+el8.5.0+12097+2c77910b
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1983694 (view as bug list)
Environment:
Last Closed: 2021-11-16 07:52:40 UTC
Type: Bug
Target Upstream Version: 7.6.0
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
libvirtd backtrace (16.76 KB, text/plain)
2021-04-15 09:36 UTC, Fangge Jin
no flags Details
logs from both src and dest hosts (181.38 KB, application/x-bzip)
2021-04-15 09:39 UTC, Fangge Jin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:53:23 UTC

Description Fangge Jin 2021-04-15 09:36:36 UTC
Created attachment 1772107 [details]
libvirtd backtrace

Description of problem:
Do live migration, shutdown vm from inside vm before migration completes, then src libvirtd will hang there.

Version-Release number of selected component (if applicable):
libvirt-7.0.0-13

How reproducible:
100%

Steps to Reproduce:
1.Start a vm and do live migration
# virsh migrate vm1 qemu+ssh://***/system --live --verbose --p2p  

2.Shutdown vm from inside vm:
[in vm] # shutdown -h now

3. Check migration result, it hangs there
# virsh migrate vm1 qemu+ssh://***/system --live --verbose --p2p  
Migration: [ 80 %]^C^C^C^C^C


Actual results:


Expected results:


Additional info:
1.Can't reproduce it with libvirt-6.6.0-13.1, it can report error and return:
Migration: [ 85 %]error: operation failed: domain is not running

2. Backtrace:
Thread 5 (Thread 0x7fba47b04700 (LWP 245983)):
#0  0x00007fba632982fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fba66f9922a in virCondWait (c=c@entry=0x7fb9ec3a60e0, m=m@entry=0x7fb9ec3a60b8) at ../src/util/virthread.c:148
#2  0x00007fba66fcf485 in virDomainObjWait (vm=vm@entry=0x7fb9ec3a60a0) at ../src/conf/domain_conf.c:3758
#3  0x00007fba1d8fc05d in qemuMigrationSrcWaitForCompletion (driver=driver@entry=0x7fb9ec11d7f0, vm=vm@entry=0x7fb9ec3a60a0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_MIGRATION_OUT, dconn=dconn@entry=0x7fb9ec016490, flags=flags@entry=8) at ../src/qemu/qemu_migration.c:1878
#4  0x00007fba1d902844 in qemuMigrationSrcRun (driver=0x7fb9ec11d7f0, vm=0x7fb9ec3a60a0, persist_xml=<optimized out>, cookiein=<optimized out>, cookieinlen=<optimized out>, cookieout=0x7fba47b03558, cookieoutlen=0x7fba47b03528, flags=3, resource=0, spec=0x7fba47b03380, dconn=0x7fb9ec016490, graphicsuri=<optimized out>, nmigrate_disks=0, migrate_disks=0x0, migParams=<optimized out>, nbdURI=<optimized out>) at ../src/qemu/qemu_migration.c:4261
#5  0x00007fba1d9035d4 in qemuMigrationSrcPerformNative (driver=0x7fb9ec11d7f0, vm=0x7fb9ec3a60a0, persist_xml=0x0, uri=<optimized out>, cookiein=0x7fba38010a70 "<qemu-migration>\n  <name>vm1</name>\n  <uuid>2907ada3-fb7b-43e1-be05-004bc37f2df3</uuid>\n  <hostname>fjin-3-vgpu</hostname>\n  <hostuuid>df9986b0-c0f7-11e6-9c43-bc0000b40000</hostuuid>\n  <graphics type="..., cookieinlen=639, cookieout=0x7fba47b03558, cookieoutlen=0x7fba47b03528, flags=3, resource=0, dconn=0x7fb9ec016490, graphicsuri=0x0, nmigrate_disks=0, migrate_disks=0x0, migParams=0x7fba38061840, nbdURI=0x0) at ../src/qemu/qemu_migration.c:4471
#6  0x00007fba1d9051f3 in qemuMigrationSrcPerformPeer2Peer3 (flags=<optimized out>, useParams=true, bandwidth=<optimized out>, migParams=0x7fba38061840, nbdURI=0x0, nbdPort=0, migrate_disks=0x0, nmigrate_disks=<optimized out>, listenAddress=<optimized out>, graphicsuri=0x0, uri=<optimized out>, dname=0x0, persist_xml=0x0, xmlin=<optimized out>, vm=0x7fb9ec3a60a0, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", dconn=0x7fb9ec016490, sconn=0x7fba3400a270, driver=0x7fb9ec11d7f0) at ../src/qemu/qemu_migration.c:4888
#7  qemuMigrationSrcPerformPeer2Peer (v3proto=<synthetic pointer>, resource=<optimized out>, dname=0x0, flags=3, migParams=0x7fba38061840, nbdURI=0x0, nbdPort=0, migrate_disks=0x0, nmigrate_disks=<optimized out>, listenAddress=<optimized out>, graphicsuri=0x0, uri=<optimized out>, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", persist_xml=0x0, xmlin=<optimized out>, vm=0x7fb9ec3a60a0, sconn=0x7fba3400a270, driver=0x7fb9ec11d7f0) at ../src/qemu/qemu_migration.c:5197
#8  qemuMigrationSrcPerformJob (driver=0x7fb9ec11d7f0, conn=0x7fba3400a270, vm=0x7fb9ec3a60a0, xmlin=<optimized out>, persist_xml=0x0, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", uri=<optimized out>, graphicsuri=<optimized out>, listenAddress=<optimized out>, nmigrate_disks=<optimized out>, migrate_disks=<optimized out>, nbdPort=0, nbdURI=<optimized out>, migParams=<optimized out>, cookiein=<optimized out>, cookieinlen=0, cookieout=<optimized out>, cookieoutlen=<optimized out>, flags=<optimized out>, dname=<optimized out>, resource=<optimized out>, v3proto=<optimized out>) at ../src/qemu/qemu_migration.c:5272
#9  0x00007fba1d90592f in qemuMigrationSrcPerform (driver=driver@entry=0x7fb9ec11d7f0, conn=0x7fba3400a270, vm=0x7fb9ec3a60a0, xmlin=0x0, persist_xml=0x0, dconnuri=dconnuri@entry=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", uri=0x0, graphicsuri=0x0, listenAddress=0x0, nmigrate_disks=0, migrate_disks=0x0, nbdPort=0, nbdURI=0x0, migParams=0x7fba38061840, cookiein=0x0, cookieinlen=0, cookieout=0x7fba47b038f8, cookieoutlen=0x7fba47b038ec, flags=3, dname=0x0, resource=0, v3proto=true) at ../src/qemu/qemu_migration.c:5453
#10 0x00007fba1d8bdb12 in qemuDomainMigratePerform3Params (dom=0x7fba38009500, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", params=<optimized out>, nparams=0, cookiein=0x0, cookieinlen=0, cookieout=0x7fba47b038f8, cookieoutlen=0x7fba47b038ec, flags=3) at ../src/qemu/qemu_driver.c:11840
#11 0x00007fba6715c645 in virDomainMigratePerform3Params (domain=domain@entry=0x7fba38009500, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", params=0x0, nparams=0, cookiein=0x0, cookieinlen=0, cookieout=0x7fba47b038f8, cookieoutlen=0x7fba47b038ec, flags=3) at ../src/libvirt-domain.c:5120
#12 0x0000561583e35333 in remoteDispatchDomainMigratePerform3Params (server=<optimized out>, msg=0x5615852ac9c0, ret=0x7fba3805acd0, ret=0x7fba3805acd0, args=0x7fba38012030, rerr=0x7fba47b039f0, client=<optimized out>) at ../src/remote/remote_daemon_dispatch.c:5722
#13 remoteDispatchDomainMigratePerform3ParamsHelper (server=<optimized out>, client=<optimized out>, msg=0x5615852ac9c0, rerr=0x7fba47b039f0, args=0x7fba38012030, ret=0x7fba3805acd0) at src/remote/remote_daemon_dispatch_stubs.h:8734
#14 0x00007fba6705bee7 in virNetServerProgramDispatchCall (msg=0x5615852ac9c0, client=0x5615852401e0, server=0x5615851f6080, prog=0x561585257810) at ../src/rpc/virnetserverprogram.c:428
#15 virNetServerProgramDispatch (prog=0x561585257810, server=server@entry=0x5615851f6080, client=0x5615852401e0, msg=0x5615852ac9c0) at ../src/rpc/virnetserverprogram.c:302
#16 0x00007fba67061276 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x5615851f6080) at ../src/rpc/virnetserver.c:137
#17 virNetServerHandleJob (jobOpaque=0x56158520da20, opaque=0x5615851f6080) at ../src/rpc/virnetserver.c:154
#18 0x00007fba66f99d0f in virThreadPoolWorker (opaque=<optimized out>) at ../src/util/virthreadpool.c:163
#19 0x00007fba66f9937b in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:233
#20 0x00007fba6329214a in start_thread () from /lib64/libpthread.so.0
#21 0x00007fba65a40db3 in clone () from /lib64/libc.so.6

Comment 1 Fangge Jin 2021-04-15 09:39:18 UTC
Created attachment 1772108 [details]
logs from both src and dest hosts

Comment 2 Fangge Jin 2021-04-15 09:43:11 UTC
It can also be reproduced by "virsh destroy vm"

Comment 3 Fangge Jin 2021-04-15 09:52:54 UTC
libvirtd can respond to other virsh command, it is just migration process hangs.

Comment 5 Jiri Denemark 2021-07-16 15:09:06 UTC
A fix was sent upstream for review: https://listman.redhat.com/archives/libvir-list/2021-July/msg00451.html

Comment 6 Jiri Denemark 2021-07-16 15:11:57 UTC
*** Bug 1967715 has been marked as a duplicate of this bug. ***

Comment 7 Jiri Denemark 2021-07-19 13:51:49 UTC
This issue is fixed upstream by

commit 364995ed5708b71f2cab09c0416a66013f0a283f
Refs: v7.5.0-117-g364995ed57
Author:     Jiri Denemark <jdenemar>
AuthorDate: Fri Jul 16 15:52:50 2021 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Mon Jul 19 15:49:16 2021 +0200

    qemu: Signal domain condition in qemuProcessStop a bit later

    Signaling the condition before vm->def->id is reset to -1 is dangerous:
    in case a waiting thread wakes up, it does not see anything interesting
    (the domain is still marked as running) and just enters virDomainObjWait
    where it waits forever because the condition will never be signalled
    again.

    Originally it was impossible to get into such situation because the vm
    object was locked all the time between signaling the condition and
    resetting vm->def->id, but after commit 860a999802 released in 6.8.0,
    qemuDomainObjStopWorker called in qemuProcessStop between
    virDomainObjBroadcast and setting vm->def->id to -1 unlocks the vm
    object giving other threads a chance to wake up and possibly hang.

    In real world, this can be easily reproduced by killing, destroying, or
    just shutting down (from the guest OS) a domain while it is being
    migrated somewhere else. The migration job would never finish.

    So let's make sure we delay signaling the domain condition to the point
    when a woken up thread can detect the domain is not active anymore.

    https://bugzilla.redhat.com/show_bug.cgi?id=1949869

    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Michal Privoznik <mprivozn>

Comment 10 Fangge Jin 2021-08-02 09:05:37 UTC
Pre-verified with libvirt-7.6.0-1.fc34.x86_64

Comment 13 Fangge Jin 2021-08-16 08:52:13 UTC
Verified with libvirt-7.6.0-1.module+el8.5.0+12097+2c77910b.x86_64

Comment 15 errata-xmlrpc 2021-11-16 07:52:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.