Bug 982816 - libvirtd freezes if qemu crashes during virDomainRestore of compressed save image
libvirtd freezes if qemu crashes during virDomainRestore of compressed save i...
Status: CLOSED UPSTREAM
Product: Virtualization Tools
Classification: Community
Component: libvirt (Show other bugs)
unspecified
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Libvirt Maintainers
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-09 20:37 EDT by Benjamin Gilbert
Modified: 2016-04-20 12:39 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-20 12:39:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Benjamin Gilbert 2013-07-09 20:37:49 EDT
Description of problem:
libvirtd freezes if qemu crashes during a virDomainRestore of a compressed save image.  libvirtd can be made responsive again by manually killing the decompressor process.

Version-Release number of selected component (if applicable):
libvirt-1.0.5.2-1.fc19.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Restore a compressed save image.
2. While the domain is restoring, killall -9 qemu-system-x86_64.

Actual results:
virDomainRestore does not return.  Other API calls (e.g. virsh commands) block indefinitely.

Expected results:
virDomainRestore fails.

Additional info:
Fedora 18, with qemu 1.2.2 and libvirt 0.10.2.6, does not have this problem.  Fedora 19, with qemu 1.4.2, does.

While qemu was loading the memory image, I captured a backtrace of the pertinent libvirtd thread.  On Fedora 18, libvirtd blocks here:

Thread 11 (Thread 0x7ff083dbc700 (LWP 9606)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:165
#1  0x00000033cc270206 in virCondWait (c=c@entry=0x7ff07438a018, 
    m=m@entry=0x7ff074389ff0) at util/threads-pthread.c:117
#2  0x00007ff07b06723c in qemuMonitorSend (mon=mon@entry=0x7ff074389fe0, 
    msg=msg@entry=0x7ff083dbad90) at qemu/qemu_monitor.c:898
#3  0x00007ff07b07458e in qemuMonitorJSONCommandWithFd (
    mon=mon@entry=0x7ff074389fe0, cmd=cmd@entry=0x7ff07438a270, 
    scm_fd=scm_fd@entry=-1, reply=reply@entry=0x7ff083dbae18)
    at qemu/qemu_monitor_json.c:259
#4  0x00007ff07b0746dd in qemuMonitorJSONCommand (
    mon=mon@entry=0x7ff074389fe0, cmd=cmd@entry=0x7ff07438a270, 
    reply=reply@entry=0x7ff083dbae18) at qemu/qemu_monitor_json.c:288
#5  0x00007ff07b0758bc in qemuMonitorJSONSetCapabilities (
    mon=mon@entry=0x7ff074389fe0) at qemu/qemu_monitor_json.c:960
#6  0x00007ff07b067e18 in qemuMonitorSetCapabilities (mon=0x7ff074389fe0, 
    caps=0x7ff074383160) at qemu/qemu_monitor.c:1143
#7  0x00007ff07b055769 in qemuConnectMonitor (
    driver=driver@entry=0x7ff07400cd90, vm=vm@entry=0x7ff0743823f0)
    at qemu/qemu_process.c:1253
#8  0x00007ff07b058f30 in qemuProcessWaitForMonitor (
    driver=driver@entry=0x7ff07400cd90, vm=vm@entry=0x7ff0743823f0, 
    caps=0x7ff074383160, pos=pos@entry=1175) at qemu/qemu_process.c:1601
#9  0x00007ff07b05c4d9 in qemuProcessStart (conn=conn@entry=0x7ff074323b60, 
    driver=driver@entry=0x7ff07400cd90, vm=vm@entry=0x7ff0743823f0, 
    migrateFrom=migrateFrom@entry=0x7ff07b0a7447 "stdio", stdin_fd=21, 
    stdin_path=stdin_path@entry=0x7ff074145cf0 "/run/user/1000/vmnetfs-6XuTu4/memory/image", snapshot=snapshot@entry=0x0, 
    vmop=vmop@entry=VIR_NETDEV_VPORT_PROFILE_OP_RESTORE, flags=flags@entry=2)
    at qemu/qemu_process.c:3726
#10 0x00007ff07b09bda7 in qemuDomainSaveImageStartVM (
    conn=conn@entry=0x7ff074323b60, driver=driver@entry=0x7ff07400cd90, 
    vm=vm@entry=0x7ff0743823f0, fd=fd@entry=0x7ff083dbba2c, 
    header=header@entry=0x7ff083dbba40, 
    path=path@entry=0x7ff074145cf0 "/run/user/1000/vmnetfs-6XuTu4/memory/image", start_paused=start_paused@entry=false) at qemu/qemu_driver.c:4723
#11 0x00007ff07b09ccc2 in qemuDomainRestoreFlags (conn=0x7ff074323b60, 
    path=0x7ff074145cf0 "/run/user/1000/vmnetfs-6XuTu4/memory/image", 
    dxml=<optimized out>, flags=<optimized out>) at qemu/qemu_driver.c:4841
#12 0x00000033cc2f0a19 in virDomainRestoreFlags (conn=0x7ff074323b60, 
    from=0x7ff0743825d0 "/run/user/1000/vmnetfs-6XuTu4/memory/image", 
    dxml=0x7ff074382610 "<?xml version='1.0' encoding='UTF-8'?>\n<domain type=\"kvm\">\n  <name>vmnetx-12026-66ea47bd-5fa0-47d9-b286-4474f8b79a15</name>\n  <uuid>b9bed9d5-f174-42f1-83b9-0160e4c2ffc3</uuid>\n  <memory>1048576</memor"..., 
    flags=2) at libvirt.c:2824
#13 0x0000000000415cba in remoteDispatchDomainRestoreFlags (
    server=<optimized out>, msg=<optimized out>, args=0x7ff0743825b0, 
    rerr=0x7ff083dbbc70, client=0x1840780) at remote_dispatch.h:4470
#14 remoteDispatchDomainRestoreFlagsHelper (server=<optimized out>, 
    client=0x1840780, msg=<optimized out>, rerr=0x7ff083dbbc70, 
    args=0x7ff0743825b0, ret=<optimized out>) at remote_dispatch.h:4449
#15 0x00000033cc34f9b2 in virNetServerProgramDispatchCall (msg=0x183ee10, 
    client=0x1840780, server=0x1817a60, prog=0x183b500)
    at rpc/virnetserverprogram.c:431
#16 virNetServerProgramDispatch (prog=0x183b500, 
    server=server@entry=0x1817a60, client=0x1840780, msg=0x183ee10)
    at rpc/virnetserverprogram.c:304
#17 0x00000033cc34bad1 in virNetServerProcessMsg (msg=<optimized out>, 
    prog=<optimized out>, client=<optimized out>, srv=0x1817a60)
    at rpc/virnetserver.c:170
#18 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x1817a60)
    at rpc/virnetserver.c:191
#19 0x00000033cc2705ae in virThreadPoolWorker (opaque=opaque@entry=0x1808510)
    at util/threadpool.c:144
#20 0x00000033cc270039 in virThreadHelper (data=<optimized out>)
    at util/threads-pthread.c:161
#21 0x00000033a8407d15 in start_thread (arg=0x7ff083dbc700)
    at pthread_create.c:308
#22 0x00000033a7cf253d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:114

but on Fedora 19 it blocks here:

Thread 12 (Thread 0x7fa4ac67e700 (LWP 16316)):
#0  0x00007fa4b9501b59 in __libc_waitpid (pid=pid@entry=20637, 
    stat_loc=stat_loc@entry=0x7fa4ac67d894, options=options@entry=0)
    at ../sysdeps/unix/sysv/linux/waitpid.c:40
#1  0x00007fa4bc4da1e2 in virProcessWait (pid=pid@entry=20637, 
    exitstatus=exitstatus@entry=0x7fa4ac67d8fc) at util/virprocess.c:165
#2  0x00007fa4bc4b3cab in virCommandWait (cmd=cmd@entry=0x7fa49c010cd0, 
    exitstatus=exitstatus@entry=0x0) at util/vircommand.c:2326
#3  0x00007fa4a23e0973 in qemuDomainSaveImageStartVM (
    conn=conn@entry=0x7fa48c000a80, driver=driver@entry=0x7fa49c018d10, 
    vm=vm@entry=0x7fa49c0f0220, fd=fd@entry=0x7fa4ac67da2c, 
    header=header@entry=0x7fa4ac67da40, 
    path=path@entry=0x7fa49c188350 "/run/user/500/vmnetfs-vzsC7Z/memory/image", start_paused=start_paused@entry=false) at qemu/qemu_driver.c:4772
#4  0x00007fa4a23e1787 in qemuDomainRestoreFlags (conn=0x7fa48c000a80, 
    path=0x7fa49c188350 "/run/user/500/vmnetfs-vzsC7Z/memory/image", 
    dxml=<optimized out>, flags=<optimized out>) at qemu/qemu_driver.c:4876
#5  0x00007fa4bc558ff7 in virDomainRestoreFlags (conn=0x7fa48c000a80, 
    from=0x7fa49c004e00 "/run/user/500/vmnetfs-vzsC7Z/memory/image", 
    dxml=0x7fa49c184b20 "<?xml version='1.0' encoding='UTF-8'?>\n<domain type=\"kvm\">\n  <name>vmnetx-20613-a7615641-e1e7-4016-af5e-0bb06a96e413</name>\n  <uuid>b9bed9d5-f174-42f1-83b9-0160e4c2ffc3</uuid>\n  <memory>1048576</memor"..., 
    flags=2) at libvirt.c:2861
#6  0x00007fa4bcffebd6 in remoteDispatchDomainRestoreFlags (
    server=<optimized out>, msg=<optimized out>, args=0x7fa49c0ad1a0, 
    rerr=0x7fa4ac67dc90, client=0x7fa4be4deff0) at remote_dispatch.h:6488
#7  remoteDispatchDomainRestoreFlagsHelper (server=<optimized out>, 
    client=0x7fa4be4deff0, msg=<optimized out>, rerr=0x7fa4ac67dc90, 
    args=0x7fa49c0ad1a0, ret=<optimized out>) at remote_dispatch.h:6467
#8  0x00007fa4bc5ba6e7 in virNetServerProgramDispatchCall (msg=0x7fa4be4df850, 
    client=0x7fa4be4deff0, server=0x7fa4be4c6c60, prog=0x7fa4be4dc060)
    at rpc/virnetserverprogram.c:439
#9  virNetServerProgramDispatch (prog=0x7fa4be4dc060, 
    server=server@entry=0x7fa4be4c6c60, client=0x7fa4be4deff0, 
    msg=0x7fa4be4df850) at rpc/virnetserverprogram.c:305
#10 0x00007fa4bc5b58f8 in virNetServerProcessMsg (msg=<optimized out>, 
    prog=<optimized out>, client=<optimized out>, srv=0x7fa4be4c6c60)
    at rpc/virnetserver.c:162
#11 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fa4be4c6c60)
    at rpc/virnetserver.c:183
#12 0x00007fa4bc4e17a5 in virThreadPoolWorker (
    opaque=opaque@entry=0x7fa4be441320) at util/virthreadpool.c:144
#13 0x00007fa4bc4e1231 in virThreadHelper (data=<optimized out>)
    at util/virthreadpthread.c:161
#14 0x00007fa4b94fac53 in start_thread (arg=0x7fa4ac67e700)
    at pthread_create.c:308
#15 0x00007fa4b8e2113d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Starting in qemu 1.3.0, the qemu monitor remains responsive during incoming migration:

    http://wiki.qemu.org/ChangeLog/1.3#Live_Migration.2C_Save.2FRestore

So, with older qemu, the qemuProcessStart call in qemuDomainSaveImageStartVM will not return until migration succeeds or qemu crashes, and the "intermediatefd != 1" cleanup code will run correctly.  With current qemu, qemuProcessStart() returns 0 before qemu has finished migration.  If qemu subsequently crashes, libvirtd fails to forcibly terminate the compressor and gets stuck waiting for it to exit on its own.
Comment 1 Cole Robinson 2016-04-19 14:03:44 EDT
Sorry this never received a timely response. I just tried this with latest libvirt several times and I couldn't reproduce the error, so I suspect it's fixed

$ sudo virsh start f23
error: Failed to start domain f23
error: internal error: early end of file from monitor, possible problem:
Comment 2 Benjamin Gilbert 2016-04-19 18:35:11 EDT
It still happens for me with libvirt 1.2.18.2-3.fc23.x86_64.  Backtrace on Fedora 23:

Thread 14 (Thread 0x7fef91e99700 (LWP 21862)):
#0  0x00007fef9b1e25db in __waitpid (pid=pid@entry=22564, 
    stat_loc=stat_loc@entry=0x7fef91e987bc, options=options@entry=0)
    at ../sysdeps/unix/sysv/linux/waitpid.c:29
#1  0x00007fef9eacbdea in virProcessWait (pid=pid@entry=22564, 
    exitstatus=exitstatus@entry=0x7fef91e98814, raw=raw@entry=true)
    at util/virprocess.c:247
#2  0x00007fef9ea8e3dd in virCommandWait (cmd=cmd@entry=0x7fef700063e0, 
    exitstatus=exitstatus@entry=0x0) at util/vircommand.c:2523
#3  0x00007fef86e0e795 in qemuDomainSaveImageStartVM (
    conn=conn@entry=0x7fef780009a0, driver=driver@entry=0x7fef7c46a4e0, 
    vm=vm@entry=0x7fef7c32d560, fd=fd@entry=0x7fef91e98934, 
    header=header@entry=0x7fef91e98960, 
    path=path@entry=0x7fef7000be80 "/home/bgilbert/.config/libvirt/qemu/save/fedora23.save", start_paused=false) at qemu/qemu_driver.c:6697
#4  0x00007fef86e0ee96 in qemuDomainObjRestore (bypass_cache=false, 
    start_paused=false, 
    path=0x7fef7000be80 "/home/bgilbert/.config/libvirt/qemu/save/fedora23.save", vm=0x7fef7c32d560, driver=0x7fef7c46a4e0, conn=0x7fef780009a0)
    at qemu/qemu_driver.c:7050
#5  qemuDomainObjStart (conn=0x7fef780009a0, 
    driver=driver@entry=0x7fef7c46a4e0, vm=0x7fef7c32d560, flags=flags@entry=0)
    at qemu/qemu_driver.c:7377
#6  0x00007fef86e0f5b6 in qemuDomainCreateWithFlags (dom=0x7fef70002610, 
    flags=0) at qemu/qemu_driver.c:7452
#7  0x00007fef9eb86c2f in virDomainCreate (domain=domain@entry=0x7fef70002610)
    at libvirt-domain.c:6788
#8  0x0000561a14036f6b in remoteDispatchDomainCreate (server=0x561a147f6e50, 
    msg=0x561a1481f6b0, args=<optimized out>, rerr=0x7fef91e98be0, 
    client=0x561a1481f410) at remote_dispatch.h:3613
#9  remoteDispatchDomainCreateHelper (server=0x561a147f6e50, 
    client=0x561a1481f410, msg=0x561a1481f6b0, rerr=0x7fef91e98be0, 
    args=<optimized out>, ret=0x7fef7000be60) at remote_dispatch.h:3589
#10 0x00007fef9ebe8479 in virNetServerProgramDispatchCall (msg=0x561a1481f6b0, 
    client=0x561a1481f410, server=0x561a147f6e50, prog=0x561a1481b9d0)
    at rpc/virnetserverprogram.c:437
#11 virNetServerProgramDispatch (prog=0x561a1481b9d0, 
    server=server@entry=0x561a147f6e50, client=0x561a1481f410, 
    msg=0x561a1481f6b0) at rpc/virnetserverprogram.c:307
#12 0x00007fef9ebe3838 in virNetServerProcessMsg (msg=<optimized out>, 
    prog=<optimized out>, client=<optimized out>, srv=0x561a147f6e50)
    at rpc/virnetserver.c:136
#13 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x561a147f6e50)
    at rpc/virnetserver.c:157
#14 0x00007fef9eadcde6 in virThreadPoolWorker (
    opaque=opaque@entry=0x561a147f70a0) at util/virthreadpool.c:145
#15 0x00007fef9eadc368 in virThreadHelper (data=<optimized out>)
    at util/virthread.c:206
#16 0x00007fef9b1d960a in start_thread (arg=0x7fef91e99700)
    at pthread_create.c:334
#17 0x00007fef9af13a4d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

When you tested, did you use a *compressed* save image?  (That is, one created with save_image_format in qemu.conf set to something other than "raw".)  You can verify by checking that byte 0x1c of the .save file is non-zero.
Comment 3 Cole Robinson 2016-04-19 18:39:44 EDT
(In reply to Benjamin Gilbert from comment #2)
> 
> When you tested, did you use a *compressed* save image?  (That is, one
> created with save_image_format in qemu.conf set to something other than
> "raw".)  You can verify by checking that byte 0x1c of the .save file is
> non-zero.

Hmm nope I missed that part, sorry. I'll test again tomorrow
Comment 4 Cole Robinson 2016-04-20 12:39:38 EDT
Okay, I reproduced with f23 libvirt. But it _is_ indeed fixed in libvirt.git. I bisected it to:

commit 2c4ba8b4f3d20c0cb14e8ee91f087d98a0406802
Author: Jiri Denemark <jdenemar@redhat.com>
Date:   Wed Nov 11 18:02:23 2015 +0100

    qemu: Use -incoming defer for migrations

Which was first introduced in libvirt 1.3.0. It essentially uses a newer qemu capability to avoid blocking while we wait for qemu to handle the incoming migration (which is used for save/restore).

Unfortunately it doesn't really seem feasible to backport to f23, since it's kind of a feature, and the patch is in the midst of a bunch of other cleanups. So just closing this report as UPSTREAM. F24 will have fixed packages FWIW

Note You need to log in before you can comment on or make changes to this bug.