Bug 638285
Summary: | [libvirt] libvirt crash on multiple migration [Segmentation fault] | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> | ||||
Component: | libvirt | Assignee: | Eric Blake <eblake> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 6.1 | CC: | antillon.maurizio, bazulay, berrange, cpelland, dallan, danken, dyuan, eblake, gsun, hateya, iheim, jdenemar, mgoldboi, mjenner, plyons, veillard, weizhan, xen-maint, yeylon, yimwang | ||||
Target Milestone: | rc | Keywords: | TestBlocker, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-0.8.6-1.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
During migration, an application could query block information on the virtual guest being migrated. This resulted in a race condition that crashed libvirt. libvirt now verifies that a guest exists before attempting to start monitoring operations.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 647940 (view as bug list) | Environment: | |||||
Last Closed: | 2011-05-19 13:22:00 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 658141 | ||||||
Attachments: |
|
Okay We reproduced this and I got a reasonnable stack trace: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ff9d6bfd710 (LWP 12718)] __pthread_mutex_lock (mutex=0x0) at pthread_mutex_lock.c:50 50 unsigned int type = PTHREAD_MUTEX_TYPE (mutex); (gdb) p mutex $1 = (pthread_mutex_t *) 0x0 (gdb) where #0 __pthread_mutex_lock (mutex=0x0) at pthread_mutex_lock.c:50 #1 0x000000000043d0c1 in qemuDomainObjEnterMonitor (obj=0x7ff9b8016080) at qemu/qemu_driver.c:840 #2 0x0000000000447995 in qemuDomainGetBlockInfo (dom=<value optimized out>, path=<value optimized out>, info=0x7ff9d6bfcaa0, flags=<value optimized out>) at qemu/qemu_driver.c:10180 #3 0x0000003050079960 in virDomainGetBlockInfo (domain=0x7ff9b80897d0, path=0x7ff9bc005c20 "/rhev/data-center/6d849ebf-755f-4552-ad09-9a090cda105d/8c2b9f05-2839-41f3-93ce-aa839a61fa08/images/0926ef59-3d53-4541-97a7-6f9f2bf22565/2220b9a1-30d9-45b4-b71c-284ab17863d3", info=0x7ff9d6bfcaa0, flags=0) at libvirt.c:4664 #4 0x000000000041d832 in remoteDispatchDomainGetBlockInfo ( server=<value optimized out>, client=<value optimized out>, conn=0x7ff9c400d430, hdr=<value optimized out>, rerr=0x7ff9d6bfcb20, args=0x7ff9d6bfcc10, ret=0x7ff9d6bfcbb0) at remote.c:6524 #5 0x0000000000429441 in remoteDispatchClientCall (server=0x1081640, client=0x7ff9d8001eb0, msg=0x7ff9bc00df90) at dispatch.c:508 #6 0x0000000000429893 in remoteDispatchClientRequest (server=0x1081640, client=0x7ff9d8001eb0, msg=0x7ff9bc00df90) at dispatch.c:390 #7 0x0000000000418fb8 in qemudWorker (data=0x7ff9d8000950) at libvirtd.c:1564 #8 0x00000030440077e1 in start_thread (arg=0x7ff9d6bfd710) at pthread_create.c:301 #9 0x00000030438e153d in clone () So somehow we try to get a domain block information, but as we try to get the mutex is NULL (gdb) p *priv $2 = {jobCond = {cond = {__data = {__lock = 0, __futex = 4, __total_seq = 2, __wakeup_seq = 2, __woken_seq = 2, __mutex = 0x7ff9b8016080, __nwaiters = 0, __broadcast_seq = 0}, __size = "\000\000\000\000\004\000\000\000\002\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\200`\001\270\371\177\000\000\000\000\000\000\000\000\000", __align = 17179869184}}, jobActive = QEMU_JOB_UNSPECIFIED, jobSignals = 0, jobSignalsData = { migrateDowntime = 0}, jobInfo = {type = 0, timeElapsed = 0, timeRemaining = 0, dataTotal = 0, dataProcessed = 0, dataRemaining = 0, memTotal = 0, memProcessed = 0, memRemaining = 0, fileTotal = 0, fileProcessed = 0, fileRemaining = 0}, jobStart = 1288006483341, mon = 0x0, monConfig = 0x0, monJSON = 1, nvcpupids = 0, vcpupids = 0x0, pciaddrs = 0x7ff9b80166b0, persistentAddrs = 1} (gdb) p *obj $3 = {lock = {lock = {__data = {__lock = 2, __count = 0, __owner = 12718, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\256\061\000\000\001", '\000' <repeats 26 times>, __align = 2}}, refs = 1, pid = -1, state = 5, autostart = 0, persistent = 0, def = 0x7ff9b8000dc0, newDef = 0x0, snapshots = { objs = 0x7ff9b8016570}, current_snapshot = 0x0, privateData = 0x7ff9b800dea0, privateDataFreeFunc = 0x43ee00 <qemuDomainObjPrivateFree>} (gdb) p *obj->def $4 = {virtType = 2, id = -1, uuid = "\217Xf\263\324dJ\265\241K6\344\271\356\213\v", name = 0x7ff9b8001070 "qemu-pool-22", description = 0x0, memory = 524288, maxmem = 524288, hugepage_backed = 0 '\000', vcpus = 1, cpumasklen = 0, cpumask = 0x0, onReboot = 1, onPoweroff = 0, onCrash = 0, os = { type = 0x7ff9b8006b50 "hvm", arch = 0x7ff9b800da80 "x86_64", machine = 0x7ff9b800dfe0 "rhel6.0.0", nBootDevs = 1, bootDevs = {2, 0, 0, 0}, init = 0x0, kernel = 0x0, initrd = 0x0, cmdline = 0x0, root = 0x0, loader = 0x0, bootloader = 0x0, bootloaderArgs = 0x0}, emulator = 0x7ff9b800db80 "/usr/libexec/qemu-kvm", features = 1, clock = { offset = 2, data = {adjustment = 0, timezone = 0x0}, ntimers = 0, timers = 0x0}, ngraphics = 1, graphics = 0x7ff9b800d8d0, ndisks = 2, disks = 0x7ff9b8006b30, ncontrollers = 2, controllers = 0x7ff9b800dc60, nfss = 0, fss = 0x0, nnets = 1, nets = 0x7ff9b800d870, ninputs = 2, inputs = 0x7ff9b800d8f0, nsounds = 0, sounds = 0x0, nvideos = 1, videos = 0x7ff9b800de20, nhostdevs = 0, hostdevs = 0x0, nserials = 0, serials = 0x0, nparallels = 0, parallels = 0x0, nchannels = 1, channels = 0x7ff9b800e000, console = 0x0, seclabel = {model = 0x0, label = 0x0, imagelabel = 0x0, type = 0}, watchdog = 0x0, memballoon = 0x7ff9b800d720, cpu = 0x7ff9b800d930} (gdb) The scenario is VMs being migrated back and forth between 2 hosts and the VDSM daemon seems to poll doing monitoring of domains. It seems the virDomainGetBlockInfo gets in, but when trying to lock the domain that one had been migrated in the meantime and when the call tries to lock the domain, well the lock is NULL. Maybe there is a missing ref or the lock is being deallocated while there is still a reference on the domain. Note the id=-1 proving the domain has been moved away. Daniel Also the entry point (python) used to migrate the domain is live untunelled p2p migration: migrateToURI(duri, libvirt.VIR_MIGRATE_LIVE | libvirt.VIR_MIGRATE_PEER2PEER, None, 0) Daniel In qemuDomainGetBlockInfo() there needs to be a call to virDomainObjIsActive() *after* qemuDomainObjBeginJob(), but before qemuDomainObjEnterMonitor(), because there is no guarentee made that the guest is still running once the method has acquired the job lock. Proposed z-stream patch: http://post-office.corp.redhat.com/archives/rhvirt-patches/2010-October/msg00482.html *** Bug 647940 has been marked as a duplicate of this bug. *** Fixed upstream in v0.8.4-236-g054d43f: commit 054d43f570acf932e169f2463e8958bb19d7e966 Author: Eric Blake <eblake> Date: Tue Oct 26 09:31:19 2010 -0600 qemu: check for vm after starting a job https://bugzilla.redhat.com/show_bug.cgi?id=638285 - when migrating a guest, it was very easy to provoke a race where an application could query block information on a VM that had just been migrated away. Any time qemu code obtains a job lock, it must also check that the VM was not taken down in the time where it was waiting for the lock. * src/qemu/qemu_driver.c (qemudDomainSetMemory) (qemudDomainGetInfo, qemuDomainGetBlockInfo): Check that vm still exists after obtaining job lock, before starting monitor action. Test it on build : libvirt-0.8.1-29.el6.x86_64 libvirt-client-0.8.1-29.el6.x86_64 qemu-kvm-0.12.1.2-2.128.el6.x86_64 qemu-img-0.12.1.2-2.128.el6.x86_64 kernel-2.6.32-93.el6.x86_64 Steps: 1.Create 10 VMS on each side. #iptables -F # setsebool -P virt_use_nfs 1 2.Dispatch ssh publick key of source host to target host #ssh-keygen -t rsa #ssh-copy-id -i ~/.ssh/id_rsa.pub root@hostIP 3. Start VMS On each side. #for i in {11..20};do virsh start vm$i;done or #for i in {1..10};do virsh start vm$i;done 4.Running concurrent bidirectional migration On server 1: # for i in `seq 11 20`;do time virsh migrate --live vm$i qemu+ssh://10.66.93.59/system ; virsh list --all; done On server 2: # for i in `seq 1 10`;do time virsh migrate --live vm$i qemu+ssh://10.66.93.206/system ; virsh list --all; done 5. Check the output of step 4, virsh list works fine. So it seems work fine on libvirt without VDSM. Verified PASS without rhevm and VDSM on kernel-2.6.32-92.el6.x86_64 libvirt-0.8.6-1.el6.x86_64 qemu-kvm-0.12.1.2-2.128.el6.x86_64 migrate 13 guests at the same time from source to target and no error occurs. (In reply to comment #13 and comment #14) Actually, the bug was exposed by performing domblkinfo while p2p migration was in progress. Also, the suggested "for i in `seq 11 20`;do time virsh migrate..." performs sequential, not concurrent, migrations. (In reply to comment #15) > (In reply to comment #13 and comment #14) > > Actually, the bug was exposed by performing domblkinfo while p2p migration was > in progress. Also, the suggested "for i in `seq 11 20`;do time virsh > migrate..." performs sequential, not concurrent, migrations. With only p2p migrate concurrent without performing domblkinfo, the result is on Comment 14, forget adding the steps: 1. ssh-keygen -t rsa ssh-copy-id -i ~/.ssh/id_rsa.pub root@{dest ip} 2. for i in {0..10}; do ./migrate-cmd.sh mig$i;done cat migrate-cmd.sh virsh migrate --live --p2p --tunnelled $1 qemu+ssh:/{dest ip}/system & But if during migration do domblkinfo frequently, the libvirtd sometimes will die # for i in {0..10}; do virsh domblkinfo mig$i /mnt/mig$i; done .... # virsh list error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Connection refused error: failed to connect to the hypervisor [root@dhcp-93-197 ~]# service libvirtd status libvirtd dead but pid file exists (In reply to comment #16) > But if during migration do domblkinfo frequently, the libvirtd sometimes will > die > # for i in {0..10}; do virsh domblkinfo mig$i /mnt/mig$i; done > .... > # virsh list > error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need > to be started: Connection refused > error: failed to connect to the hypervisor Could you get a backtrace from libvirtd when it crashed? - install libvirt-debuginfo # ulimit -c unlimited # /etc/init.d/libvirtd restart - do what you need to crash libvirtd - locate libvirtd's coredump (in current directory or /) # gdb -c core run the following in gdb: thread apply all backtrace Are you sure you have the correct libvirt-debuginfo RPM installed ? You shouldn't be seeing all those unresolved symbols in the stack trace. Hi, Deniel After I install some other package debuginfo rpm, I get the coredump more usable (gdb) thread apply all backtrace Thread 7 (Thread 9274): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x000000390803c7b6 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:105 #2 0x000000000041b5c5 in qemudWorker (data=0x7fe9e80009e0) at libvirtd.c:1562 #3 0x00000038f80077e1 in start_thread (arg=0x7fe9cebfd710) at pthread_create.c:301 #4 0x00000038f78e153d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 6 (Thread 10923): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x000000390803c7b6 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:105 #2 0x000000000041b5c5 in qemudWorker (data=0x7fe9e8000998) at libvirtd.c:1562 #3 0x00000038f80077e1 in start_thread (arg=0x7fe9eedf1710) at pthread_create.c:301 #4 0x00000038f78e153d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 5 (Thread 9015): #0 0x00000038f800803d in pthread_join (threadid=140642722195216, thread_return=0x0) at pthread_join.c:89 #1 0x000000000041e570 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:3306 Thread 4 (Thread 9273): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x000000390803c7b6 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:105 #2 0x000000000041b5c5 in qemudWorker (data=0x7fe9e80009c8) at libvirtd.c:1562 #3 0x00000038f80077e1 in start_thread (arg=0x7fe9cf5fe710) at pthread_create.c:301 #4 0x00000038f78e153d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 3 (Thread 9862): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x000000390803c7b6 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:105 #2 0x000000000041b5c5 in qemudWorker (data=0x7fe9e8000980) at libvirtd.c:1562 #3 0x00000038f80077e1 in start_thread (arg=0x7fe9e61fc710) at pthread_create.c:301 #4 0x00000038f78e153d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 2 (Thread 9272): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x000000390803c7b6 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:105 #2 0x000000000041b5c5 in qemudWorker (data=0x7fe9e80009b0) at libvirtd.c:1562 #3 0x00000038f80077e1 in start_thread (arg=0x7fe9cffff710) at pthread_create.c:301 #4 0x00000038f78e153d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 9016): #0 virEventMakePollFDs () at event.c:349 #1 virEventRunOnce () at event.c:570 #2 0x000000000041a4d9 in qemudOneLoop () at libvirtd.c:2231 #3 0x000000000041a7cb in qemudRunLoop (opaque=0x8d8640) at libvirtd.c:2340 #4 0x00000038f80077e1 in start_thread (arg=0x7fe9ef7f2710) at pthread_create.c:301 #5 0x00000038f78e153d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 And I found that during tunnelled migration, when I want to see the domain state with virsh command on source, the virsh command will hang verified on: vdsm-4.9-47.el6.x86_64 libvirt-0.8.7-4.el6.x86_64 2 hosts, performed concurrent migration of 10 vms. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A race condition where an application could query block information on a virtual guest that had just been migrated away could occur when migrating a guest. As a result, the libvirt service crashed. The libvirt application now verifies that a guest exists before attempting to start any monitoring operations. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -A race condition where an application could query block information on a virtual guest that had just been migrated away could occur when migrating a guest. As a result, the libvirt service crashed. The libvirt application now verifies that a guest exists before attempting to start any monitoring operations.+During migration, an application could query block information on a the virtual guest being migrated. This resulted in a race condition that crashed libvirt. libvirt now verifies that a guest exists before attempting to start monitoring operations. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -During migration, an application could query block information on a the virtual guest being migrated. This resulted in a race condition that crashed libvirt. libvirt now verifies that a guest exists before attempting to start monitoring operations.+During migration, an application could query block information on the virtual guest being migrated. This resulted in a race condition that crashed libvirt. libvirt now verifies that a guest exists before attempting to start monitoring operations. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html |
Created attachment 450252 [details] core.dump Description of problem: libvirt service crash on multiple migration. attached gdb output and core dump. Program terminated with signal 11, Segmentation fault. #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 162 62: movl (%rsp), %edi repro: 1) migrate 8+ vms from source to destination 2.6.32-71.el6.x86_64 libvirt-0.8.1-27.el6.x86_64 vdsm-4.9-17.1.x86_64 device-mapper-multipath-0.4.9-30.el6.x86_64 lvm2-2.02.72-8.el6.x86_64 qemu-kvm-0.12.1.2-2.113.el6.x86_64 iptables-1.4.7-3.el6.x86_64 happens when using with rhevm and vdsm on top. this is consistent (3 times!)