Bug 727254 - unable to kill libvirtd in the middle of managedsave
Summary: unable to kill libvirtd in the middle of managedsave
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 1117142
TreeView+ depends on / blocked
 
Reported: 2011-08-01 16:46 UTC by Eric Blake
Modified: 2020-11-03 16:30 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
: 1117142 (view as bug list)
Environment:
Last Closed: 2020-11-03 16:30:50 UTC
Embargoed:


Attachments (Terms of Use)

Description Eric Blake 2011-08-01 16:46:57 UTC
Description of problem:
attempting to kill libvirtd (such as by SIGINT, or 'service libvirtd stop') while libvirtd is in the middle of a 'virsh managedsave dom' can result in a thread being stuck waiting on the result of the managedsave but no I/O thread remaining to complete the transaction, causing libvirtd to hang instead of cleanly exit.

Version-Release number of selected component (if applicable):
libvirt-0.9.4-0rc2.el6.x86_64

How reproducible:
I'm still trying to reproduce this with an existing build (so far, I have only seen it on a self-built libvirtd that was trying to fix other issues)

Steps to Reproduce:
1. virsh managedsave dom # takes several seconds
2. while that is still running, kill libvirtd
3.
  
Actual results:
libvirtd hung

Expected results:
libvirtd should cleanly exit, and be able to restart and notice that a managed save had been in progress to properly clean up after that action.

Additional info:

This was noticed while patching the issue in bug 727249, but is thought to be an independent issue, and since stopping libvirtd is less common than a completed migration, it should not hold up the upstream release of libvirt 0.9.4.  Still, if a fix can be found in reasonable time, it would be worth backporting into RHEL 6.2.

Comment 1 Eric Blake 2011-08-01 18:00:14 UTC
Definitely an independent bug - I was able to reproduce it using libvirt 0.8.8-4.fc14.x86_64 from virt-preview on Fedora 14.

Here's the backtrace that I got when using libvirt.git at commit 193cd0f3:

Thread 2 (Thread 0x7fb708b65700 (LWP 19323)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007fb710dc0015 in virCondWait (c=0x7fb70015b328, m=0x7fb70015b300)
    at util/threads-pthread.c:117
#2  0x00000000004ba7c5 in qemuMonitorSend (mon=0x7fb70015b300, 
    msg=0x7fb708b64340) at qemu/qemu_monitor.c:802
#3  0x00000000004c84f2 in qemuMonitorJSONCommandWithFd (mon=0x7fb70015b300, 
    cmd=0x7fb6f0000a80, scm_fd=-1, reply=0x7fb708b64440)
    at qemu/qemu_monitor_json.c:225
#4  0x00000000004c8629 in qemuMonitorJSONCommand (mon=0x7fb70015b300, 
    cmd=0x7fb6f0000a80, reply=0x7fb708b64440) at qemu/qemu_monitor_json.c:254
#5  0x00000000004ccee0 in qemuMonitorJSONGetMigrationStatus (
    mon=0x7fb70015b300, status=0x7fb708b64540, transferred=0x7fb708b64530, 
    remaining=0x7fb708b64528, total=0x7fb708b64520)
    at qemu/qemu_monitor_json.c:1920
#6  0x00000000004bcef7 in qemuMonitorGetMigrationStatus (mon=0x7fb70015b300, 
    status=0x7fb708b64540, transferred=0x7fb708b64530, 
    remaining=0x7fb708b64528, total=0x7fb708b64520) at qemu/qemu_monitor.c:1532
#7  0x00000000004b2d06 in qemuMigrationUpdateJobStatus (driver=0x7fb70008ef40, 
    vm=0x7fb700091f20, job=0x5435fe "domain save job", 
    asyncJob=QEMU_ASYNC_JOB_SAVE) at qemu/qemu_migration.c:764
#8  0x00000000004b3075 in qemuMigrationWaitForCompletion (
    driver=0x7fb70008ef40, vm=0x7fb700091f20, asyncJob=QEMU_ASYNC_JOB_SAVE)
    at qemu/qemu_migration.c:845
#9  0x00000000004b854d in qemuMigrationToFile (driver=0x7fb70008ef40, 
    vm=0x7fb700091f20, fd=20, offset=4096, 
    path=0x7fb6f0000a40 "/var/lib/libvirt/qemu/save/fedora_12.save", 
    compressor=0x0, is_reg=true, bypassSecurityDriver=true, 
    asyncJob=QEMU_ASYNC_JOB_SAVE) at qemu/qemu_migration.c:2777
#10 0x000000000046ab4c in qemuDomainSaveInternal (driver=0x7fb70008ef40, 
    dom=0x7fb6f00008c0, vm=0x7fb700091f20, 
    path=0x7fb6f0000a40 "/var/lib/libvirt/qemu/save/fedora_12.save", 
    compressed=0, bypass_cache=false, xmlin=0x0) at qemu/qemu_driver.c:2407
#11 0x000000000046b4d2 in qemuDomainManagedSave (dom=0x7fb6f00008c0, flags=0)
    at qemu/qemu_driver.c:2589
#12 0x00007fb710e5935c in virDomainManagedSave (dom=0x7fb6f00008c0, flags=0)
    at libvirt.c:15319
#13 0x000000000042875f in remoteDispatchDomainManagedSave (server=0x195b280, 
    client=0x1966a30, hdr=0x19a6ce8, rerr=0x7fb708b64b30, args=0x7fb6f0000900)
    at remote_dispatch.h:2573
#14 0x0000000000428658 in remoteDispatchDomainManagedSaveHelper (
    server=0x195b280, client=0x1966a30, hdr=0x19a6ce8, rerr=0x7fb708b64b30, 
    args=0x7fb6f0000900, ret=0x7fb6f0000930) at remote_dispatch.h:2551
#15 0x0000000000453efe in virNetServerProgramDispatchCall (prog=0x195b250, 
    server=0x195b280, client=0x1966a30, msg=0x1966cd0)
    at rpc/virnetserverprogram.c:375
#16 0x0000000000453a00 in virNetServerProgramDispatch (prog=0x195b250, 
    server=0x195b280, client=0x1966a30, msg=0x1966cd0)
    at rpc/virnetserverprogram.c:252
#17 0x0000000000456b21 in virNetServerHandleJob (jobOpaque=0x1960050, 
    opaque=0x195b280) at rpc/virnetserver.c:155
#18 0x00007fb710dc06a4 in virThreadPoolWorker (opaque=0x195b370)
    at util/threadpool.c:98
#19 0x00007fb710dc01d7 in virThreadHelper (data=0x1965120)
    at util/threads-pthread.c:157
#20 0x0000003d16606ccb in start_thread (arg=0x7fb708b65700)
    at pthread_create.c:301
#21 0x0000003d15ee0c2d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7fb71042b860 (LWP 19301)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007fb710dc0015 in virCondWait (c=0x195b3f0, m=0x195b398)
    at util/threads-pthread.c:117
#2  0x00007fb710dc096d in virThreadPoolFree (pool=0x195b370)
    at util/threadpool.c:172
#3  0x000000000045846f in virNetServerFree (srv=0x195b280)
    at rpc/virnetserver.c:757
#4  0x0000000000422337 in main (argc=1, argv=0x7fff94dfbf88) at libvirtd.c:1561

Comment 2 Jiri Denemark 2011-08-02 19:42:35 UTC
The thing is that virThreadPoolFree asks all threads to quit and waits until they do so. If any of the threads is waiting for an I/O event and the I/O thread quits before signaling the waiting thread, that thread will never quit since its condition will never be signaled. This bug is easier to spot when one of the threads is inside qemuMigrationWaitForCompletion, which is sending commands to qemu monitor in a loop so even if the I/O thread signals the right condition before quiting, the migration thread will send another command to qemu and wait for the condition which no-one will ever signal.

Comment 3 dyuan 2011-08-09 11:07:08 UTC
I can reproduce this bug with "kill -SIGINIT" but cann't reproduce it with "service libvirtd stop".

# virsh managedsave dom & sleep 1; service libvirtd stop
[1] 19235
Stopping libvirtd daemon: error: Failed to save domain dom state
error: End of file while reading data: Input/output error

                                                           [  OK  ]
[1]+  Exit 1                  virsh managedsave dom

==========================

# virsh managedsave dom & sleep 1; kill -SIGINT `pidof libvirtd`
[2] 19067

# ps aux|grep libvirtd
root     18957  1.3  0.1 639660 14468 ?        Sl   06:47   0:00 libvirtd --daemon
root     19073  0.0  0.0 103304   876 pts/0    S+   06:48   0:00 grep libvirtd

# service libvirtd status
libvirtd (pid  18957) is running...

# virsh list --all
error: Failed to reconnect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory

# ps aux|grep virsh
root     19067  0.0  0.0 225332  4324 pts/0    S    06:48   0:00 virsh managedsave dom

Comment 10 Cole Robinson 2016-03-23 13:07:29 UTC
Has anyone verified this is still an issue? It hasn't been materially updated for over 4 years

Comment 11 Jiri Denemark 2016-03-23 13:27:04 UTC
Yes.

Comment 12 Daniel Berrangé 2020-11-03 16:30:50 UTC
Thank you for reporting this issue to the libvirt project. Unfortunately we have been unable to resolve this issue due to insufficient maintainer capacity and it will now be closed. This is not a reflection on the possible validity of the issue, merely the lack of resources to investigate and address it, for which we apologise. If you none the less feel the issue is still important, you may choose to report it again at the new project issue tracker https://gitlab.com/libvirt/libvirt/-/issues The project also welcomes contribution from anyone who believes they can provide a solution.


Note You need to log in before you can comment on or make changes to this bug.