Bug 1949869

Summary: Migration hangs if vm is shutdown during live migration
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Fangge Jin <fjin>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Fangge Jin <fjin>
Severity: high Docs Contact:
Priority: high    
Version: 8.4CC: jdenemar, lmen, mzamazal, virt-maint, xuzhang, ymankad
Target Milestone: rcKeywords: Regression, Triaged, VerifiedUpstream, ZStream
Target Release: 8.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-7.6.0-1.module+el8.5.0+12097+2c77910b Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1983694 (view as bug list) Environment:
Last Closed: 2021-11-16 07:52:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 7.6.0
Embargoed:
Bug Depends On:    
Bug Blocks: 1983694    
Attachments:
Description Flags
libvirtd backtrace
none
logs from both src and dest hosts none

Description Fangge Jin 2021-04-15 09:36:36 UTC
Created attachment 1772107 [details]
libvirtd backtrace

Description of problem:
Do live migration, shutdown vm from inside vm before migration completes, then src libvirtd will hang there.

Version-Release number of selected component (if applicable):
libvirt-7.0.0-13

How reproducible:
100%

Steps to Reproduce:
1.Start a vm and do live migration
# virsh migrate vm1 qemu+ssh://***/system --live --verbose --p2p  

2.Shutdown vm from inside vm:
[in vm] # shutdown -h now

3. Check migration result, it hangs there
# virsh migrate vm1 qemu+ssh://***/system --live --verbose --p2p  
Migration: [ 80 %]^C^C^C^C^C


Actual results:


Expected results:


Additional info:
1.Can't reproduce it with libvirt-6.6.0-13.1, it can report error and return:
Migration: [ 85 %]error: operation failed: domain is not running

2. Backtrace:
Thread 5 (Thread 0x7fba47b04700 (LWP 245983)):
#0  0x00007fba632982fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fba66f9922a in virCondWait (c=c@entry=0x7fb9ec3a60e0, m=m@entry=0x7fb9ec3a60b8) at ../src/util/virthread.c:148
#2  0x00007fba66fcf485 in virDomainObjWait (vm=vm@entry=0x7fb9ec3a60a0) at ../src/conf/domain_conf.c:3758
#3  0x00007fba1d8fc05d in qemuMigrationSrcWaitForCompletion (driver=driver@entry=0x7fb9ec11d7f0, vm=vm@entry=0x7fb9ec3a60a0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_MIGRATION_OUT, dconn=dconn@entry=0x7fb9ec016490, flags=flags@entry=8) at ../src/qemu/qemu_migration.c:1878
#4  0x00007fba1d902844 in qemuMigrationSrcRun (driver=0x7fb9ec11d7f0, vm=0x7fb9ec3a60a0, persist_xml=<optimized out>, cookiein=<optimized out>, cookieinlen=<optimized out>, cookieout=0x7fba47b03558, cookieoutlen=0x7fba47b03528, flags=3, resource=0, spec=0x7fba47b03380, dconn=0x7fb9ec016490, graphicsuri=<optimized out>, nmigrate_disks=0, migrate_disks=0x0, migParams=<optimized out>, nbdURI=<optimized out>) at ../src/qemu/qemu_migration.c:4261
#5  0x00007fba1d9035d4 in qemuMigrationSrcPerformNative (driver=0x7fb9ec11d7f0, vm=0x7fb9ec3a60a0, persist_xml=0x0, uri=<optimized out>, cookiein=0x7fba38010a70 "<qemu-migration>\n  <name>vm1</name>\n  <uuid>2907ada3-fb7b-43e1-be05-004bc37f2df3</uuid>\n  <hostname>fjin-3-vgpu</hostname>\n  <hostuuid>df9986b0-c0f7-11e6-9c43-bc0000b40000</hostuuid>\n  <graphics type="..., cookieinlen=639, cookieout=0x7fba47b03558, cookieoutlen=0x7fba47b03528, flags=3, resource=0, dconn=0x7fb9ec016490, graphicsuri=0x0, nmigrate_disks=0, migrate_disks=0x0, migParams=0x7fba38061840, nbdURI=0x0) at ../src/qemu/qemu_migration.c:4471
#6  0x00007fba1d9051f3 in qemuMigrationSrcPerformPeer2Peer3 (flags=<optimized out>, useParams=true, bandwidth=<optimized out>, migParams=0x7fba38061840, nbdURI=0x0, nbdPort=0, migrate_disks=0x0, nmigrate_disks=<optimized out>, listenAddress=<optimized out>, graphicsuri=0x0, uri=<optimized out>, dname=0x0, persist_xml=0x0, xmlin=<optimized out>, vm=0x7fb9ec3a60a0, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", dconn=0x7fb9ec016490, sconn=0x7fba3400a270, driver=0x7fb9ec11d7f0) at ../src/qemu/qemu_migration.c:4888
#7  qemuMigrationSrcPerformPeer2Peer (v3proto=<synthetic pointer>, resource=<optimized out>, dname=0x0, flags=3, migParams=0x7fba38061840, nbdURI=0x0, nbdPort=0, migrate_disks=0x0, nmigrate_disks=<optimized out>, listenAddress=<optimized out>, graphicsuri=0x0, uri=<optimized out>, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", persist_xml=0x0, xmlin=<optimized out>, vm=0x7fb9ec3a60a0, sconn=0x7fba3400a270, driver=0x7fb9ec11d7f0) at ../src/qemu/qemu_migration.c:5197
#8  qemuMigrationSrcPerformJob (driver=0x7fb9ec11d7f0, conn=0x7fba3400a270, vm=0x7fb9ec3a60a0, xmlin=<optimized out>, persist_xml=0x0, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", uri=<optimized out>, graphicsuri=<optimized out>, listenAddress=<optimized out>, nmigrate_disks=<optimized out>, migrate_disks=<optimized out>, nbdPort=0, nbdURI=<optimized out>, migParams=<optimized out>, cookiein=<optimized out>, cookieinlen=0, cookieout=<optimized out>, cookieoutlen=<optimized out>, flags=<optimized out>, dname=<optimized out>, resource=<optimized out>, v3proto=<optimized out>) at ../src/qemu/qemu_migration.c:5272
#9  0x00007fba1d90592f in qemuMigrationSrcPerform (driver=driver@entry=0x7fb9ec11d7f0, conn=0x7fba3400a270, vm=0x7fb9ec3a60a0, xmlin=0x0, persist_xml=0x0, dconnuri=dconnuri@entry=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", uri=0x0, graphicsuri=0x0, listenAddress=0x0, nmigrate_disks=0, migrate_disks=0x0, nbdPort=0, nbdURI=0x0, migParams=0x7fba38061840, cookiein=0x0, cookieinlen=0, cookieout=0x7fba47b038f8, cookieoutlen=0x7fba47b038ec, flags=3, dname=0x0, resource=0, v3proto=true) at ../src/qemu/qemu_migration.c:5453
#10 0x00007fba1d8bdb12 in qemuDomainMigratePerform3Params (dom=0x7fba38009500, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", params=<optimized out>, nparams=0, cookiein=0x0, cookieinlen=0, cookieout=0x7fba47b038f8, cookieoutlen=0x7fba47b038ec, flags=3) at ../src/qemu/qemu_driver.c:11840
#11 0x00007fba6715c645 in virDomainMigratePerform3Params (domain=domain@entry=0x7fba38009500, dconnuri=0x7fba38011b20 "qemu+ssh://fjin3-vgpu.usersys.redhat.com/system", params=0x0, nparams=0, cookiein=0x0, cookieinlen=0, cookieout=0x7fba47b038f8, cookieoutlen=0x7fba47b038ec, flags=3) at ../src/libvirt-domain.c:5120
#12 0x0000561583e35333 in remoteDispatchDomainMigratePerform3Params (server=<optimized out>, msg=0x5615852ac9c0, ret=0x7fba3805acd0, ret=0x7fba3805acd0, args=0x7fba38012030, rerr=0x7fba47b039f0, client=<optimized out>) at ../src/remote/remote_daemon_dispatch.c:5722
#13 remoteDispatchDomainMigratePerform3ParamsHelper (server=<optimized out>, client=<optimized out>, msg=0x5615852ac9c0, rerr=0x7fba47b039f0, args=0x7fba38012030, ret=0x7fba3805acd0) at src/remote/remote_daemon_dispatch_stubs.h:8734
#14 0x00007fba6705bee7 in virNetServerProgramDispatchCall (msg=0x5615852ac9c0, client=0x5615852401e0, server=0x5615851f6080, prog=0x561585257810) at ../src/rpc/virnetserverprogram.c:428
#15 virNetServerProgramDispatch (prog=0x561585257810, server=server@entry=0x5615851f6080, client=0x5615852401e0, msg=0x5615852ac9c0) at ../src/rpc/virnetserverprogram.c:302
#16 0x00007fba67061276 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x5615851f6080) at ../src/rpc/virnetserver.c:137
#17 virNetServerHandleJob (jobOpaque=0x56158520da20, opaque=0x5615851f6080) at ../src/rpc/virnetserver.c:154
#18 0x00007fba66f99d0f in virThreadPoolWorker (opaque=<optimized out>) at ../src/util/virthreadpool.c:163
#19 0x00007fba66f9937b in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:233
#20 0x00007fba6329214a in start_thread () from /lib64/libpthread.so.0
#21 0x00007fba65a40db3 in clone () from /lib64/libc.so.6

Comment 1 Fangge Jin 2021-04-15 09:39:18 UTC
Created attachment 1772108 [details]
logs from both src and dest hosts

Comment 2 Fangge Jin 2021-04-15 09:43:11 UTC
It can also be reproduced by "virsh destroy vm"

Comment 3 Fangge Jin 2021-04-15 09:52:54 UTC
libvirtd can respond to other virsh command, it is just migration process hangs.

Comment 5 Jiri Denemark 2021-07-16 15:09:06 UTC
A fix was sent upstream for review: https://listman.redhat.com/archives/libvir-list/2021-July/msg00451.html

Comment 6 Jiri Denemark 2021-07-16 15:11:57 UTC
*** Bug 1967715 has been marked as a duplicate of this bug. ***

Comment 7 Jiri Denemark 2021-07-19 13:51:49 UTC
This issue is fixed upstream by

commit 364995ed5708b71f2cab09c0416a66013f0a283f
Refs: v7.5.0-117-g364995ed57
Author:     Jiri Denemark <jdenemar>
AuthorDate: Fri Jul 16 15:52:50 2021 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Mon Jul 19 15:49:16 2021 +0200

    qemu: Signal domain condition in qemuProcessStop a bit later

    Signaling the condition before vm->def->id is reset to -1 is dangerous:
    in case a waiting thread wakes up, it does not see anything interesting
    (the domain is still marked as running) and just enters virDomainObjWait
    where it waits forever because the condition will never be signalled
    again.

    Originally it was impossible to get into such situation because the vm
    object was locked all the time between signaling the condition and
    resetting vm->def->id, but after commit 860a999802 released in 6.8.0,
    qemuDomainObjStopWorker called in qemuProcessStop between
    virDomainObjBroadcast and setting vm->def->id to -1 unlocks the vm
    object giving other threads a chance to wake up and possibly hang.

    In real world, this can be easily reproduced by killing, destroying, or
    just shutting down (from the guest OS) a domain while it is being
    migrated somewhere else. The migration job would never finish.

    So let's make sure we delay signaling the domain condition to the point
    when a woken up thread can detect the domain is not active anymore.

    https://bugzilla.redhat.com/show_bug.cgi?id=1949869

    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Michal Privoznik <mprivozn>

Comment 10 Fangge Jin 2021-08-02 09:05:37 UTC
Pre-verified with libvirt-7.6.0-1.fc34.x86_64

Comment 13 Fangge Jin 2021-08-16 08:52:13 UTC
Verified with libvirt-7.6.0-1.module+el8.5.0+12097+2c77910b.x86_64

Comment 15 errata-xmlrpc 2021-11-16 07:52:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684