Description of problem: attempting to kill libvirtd (such as by SIGINT, or 'service libvirtd stop') while libvirtd is in the middle of a 'virsh managedsave dom' can result in a thread being stuck waiting on the result of the managedsave but no I/O thread remaining to complete the transaction, causing libvirtd to hang instead of cleanly exit. Version-Release number of selected component (if applicable): libvirt-0.9.4-0rc2.el6.x86_64 How reproducible: I'm still trying to reproduce this with an existing build (so far, I have only seen it on a self-built libvirtd that was trying to fix other issues) Steps to Reproduce: 1. virsh managedsave dom # takes several seconds 2. while that is still running, kill libvirtd 3. Actual results: libvirtd hung Expected results: libvirtd should cleanly exit, and be able to restart and notice that a managed save had been in progress to properly clean up after that action. Additional info: This was noticed while patching the issue in bug 727249, but is thought to be an independent issue, and since stopping libvirtd is less common than a completed migration, it should not hold up the upstream release of libvirt 0.9.4. Still, if a fix can be found in reasonable time, it would be worth backporting into RHEL 6.2.
Definitely an independent bug - I was able to reproduce it using libvirt 0.8.8-4.fc14.x86_64 from virt-preview on Fedora 14. Here's the backtrace that I got when using libvirt.git at commit 193cd0f3: Thread 2 (Thread 0x7fb708b65700 (LWP 19323)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007fb710dc0015 in virCondWait (c=0x7fb70015b328, m=0x7fb70015b300) at util/threads-pthread.c:117 #2 0x00000000004ba7c5 in qemuMonitorSend (mon=0x7fb70015b300, msg=0x7fb708b64340) at qemu/qemu_monitor.c:802 #3 0x00000000004c84f2 in qemuMonitorJSONCommandWithFd (mon=0x7fb70015b300, cmd=0x7fb6f0000a80, scm_fd=-1, reply=0x7fb708b64440) at qemu/qemu_monitor_json.c:225 #4 0x00000000004c8629 in qemuMonitorJSONCommand (mon=0x7fb70015b300, cmd=0x7fb6f0000a80, reply=0x7fb708b64440) at qemu/qemu_monitor_json.c:254 #5 0x00000000004ccee0 in qemuMonitorJSONGetMigrationStatus ( mon=0x7fb70015b300, status=0x7fb708b64540, transferred=0x7fb708b64530, remaining=0x7fb708b64528, total=0x7fb708b64520) at qemu/qemu_monitor_json.c:1920 #6 0x00000000004bcef7 in qemuMonitorGetMigrationStatus (mon=0x7fb70015b300, status=0x7fb708b64540, transferred=0x7fb708b64530, remaining=0x7fb708b64528, total=0x7fb708b64520) at qemu/qemu_monitor.c:1532 #7 0x00000000004b2d06 in qemuMigrationUpdateJobStatus (driver=0x7fb70008ef40, vm=0x7fb700091f20, job=0x5435fe "domain save job", asyncJob=QEMU_ASYNC_JOB_SAVE) at qemu/qemu_migration.c:764 #8 0x00000000004b3075 in qemuMigrationWaitForCompletion ( driver=0x7fb70008ef40, vm=0x7fb700091f20, asyncJob=QEMU_ASYNC_JOB_SAVE) at qemu/qemu_migration.c:845 #9 0x00000000004b854d in qemuMigrationToFile (driver=0x7fb70008ef40, vm=0x7fb700091f20, fd=20, offset=4096, path=0x7fb6f0000a40 "/var/lib/libvirt/qemu/save/fedora_12.save", compressor=0x0, is_reg=true, bypassSecurityDriver=true, asyncJob=QEMU_ASYNC_JOB_SAVE) at qemu/qemu_migration.c:2777 #10 0x000000000046ab4c in qemuDomainSaveInternal (driver=0x7fb70008ef40, dom=0x7fb6f00008c0, vm=0x7fb700091f20, path=0x7fb6f0000a40 "/var/lib/libvirt/qemu/save/fedora_12.save", compressed=0, bypass_cache=false, xmlin=0x0) at qemu/qemu_driver.c:2407 #11 0x000000000046b4d2 in qemuDomainManagedSave (dom=0x7fb6f00008c0, flags=0) at qemu/qemu_driver.c:2589 #12 0x00007fb710e5935c in virDomainManagedSave (dom=0x7fb6f00008c0, flags=0) at libvirt.c:15319 #13 0x000000000042875f in remoteDispatchDomainManagedSave (server=0x195b280, client=0x1966a30, hdr=0x19a6ce8, rerr=0x7fb708b64b30, args=0x7fb6f0000900) at remote_dispatch.h:2573 #14 0x0000000000428658 in remoteDispatchDomainManagedSaveHelper ( server=0x195b280, client=0x1966a30, hdr=0x19a6ce8, rerr=0x7fb708b64b30, args=0x7fb6f0000900, ret=0x7fb6f0000930) at remote_dispatch.h:2551 #15 0x0000000000453efe in virNetServerProgramDispatchCall (prog=0x195b250, server=0x195b280, client=0x1966a30, msg=0x1966cd0) at rpc/virnetserverprogram.c:375 #16 0x0000000000453a00 in virNetServerProgramDispatch (prog=0x195b250, server=0x195b280, client=0x1966a30, msg=0x1966cd0) at rpc/virnetserverprogram.c:252 #17 0x0000000000456b21 in virNetServerHandleJob (jobOpaque=0x1960050, opaque=0x195b280) at rpc/virnetserver.c:155 #18 0x00007fb710dc06a4 in virThreadPoolWorker (opaque=0x195b370) at util/threadpool.c:98 #19 0x00007fb710dc01d7 in virThreadHelper (data=0x1965120) at util/threads-pthread.c:157 #20 0x0000003d16606ccb in start_thread (arg=0x7fb708b65700) at pthread_create.c:301 #21 0x0000003d15ee0c2d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7fb71042b860 (LWP 19301)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007fb710dc0015 in virCondWait (c=0x195b3f0, m=0x195b398) at util/threads-pthread.c:117 #2 0x00007fb710dc096d in virThreadPoolFree (pool=0x195b370) at util/threadpool.c:172 #3 0x000000000045846f in virNetServerFree (srv=0x195b280) at rpc/virnetserver.c:757 #4 0x0000000000422337 in main (argc=1, argv=0x7fff94dfbf88) at libvirtd.c:1561
The thing is that virThreadPoolFree asks all threads to quit and waits until they do so. If any of the threads is waiting for an I/O event and the I/O thread quits before signaling the waiting thread, that thread will never quit since its condition will never be signaled. This bug is easier to spot when one of the threads is inside qemuMigrationWaitForCompletion, which is sending commands to qemu monitor in a loop so even if the I/O thread signals the right condition before quiting, the migration thread will send another command to qemu and wait for the condition which no-one will ever signal.
I can reproduce this bug with "kill -SIGINIT" but cann't reproduce it with "service libvirtd stop". # virsh managedsave dom & sleep 1; service libvirtd stop [1] 19235 Stopping libvirtd daemon: error: Failed to save domain dom state error: End of file while reading data: Input/output error [ OK ] [1]+ Exit 1 virsh managedsave dom ========================== # virsh managedsave dom & sleep 1; kill -SIGINT `pidof libvirtd` [2] 19067 # ps aux|grep libvirtd root 18957 1.3 0.1 639660 14468 ? Sl 06:47 0:00 libvirtd --daemon root 19073 0.0 0.0 103304 876 pts/0 S+ 06:48 0:00 grep libvirtd # service libvirtd status libvirtd (pid 18957) is running... # virsh list --all error: Failed to reconnect to the hypervisor error: no valid connection error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory # ps aux|grep virsh root 19067 0.0 0.0 225332 4324 pts/0 S 06:48 0:00 virsh managedsave dom
Has anyone verified this is still an issue? It hasn't been materially updated for over 4 years
Yes.
Thank you for reporting this issue to the libvirt project. Unfortunately we have been unable to resolve this issue due to insufficient maintainer capacity and it will now be closed. This is not a reflection on the possible validity of the issue, merely the lack of resources to investigate and address it, for which we apologise. If you none the less feel the issue is still important, you may choose to report it again at the new project issue tracker https://gitlab.com/libvirt/libvirt/-/issues The project also welcomes contribution from anyone who believes they can provide a solution.