RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 670848 - [Libvirt] Libvirt daemon crashes on virDomainDefFree at conf/domain_conf.c:793
Summary: [Libvirt] Libvirt daemon crashes on virDomainDefFree at conf/domain_conf.c:793
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt
Version: 6.1
Hardware: x86_64
OS: Unspecified
low
urgent
Target Milestone: rc
: 6.1
Assignee: Laine Stump
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-19 14:27 UTC by David Naori
Modified: 2011-11-28 07:35 UTC (History)
12 users (show)

Fixed In Version: libvirt-0.8.7-10.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-19 13:25:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
libvirt and gdb logs. (100.00 KB, application/x-tar)
2011-01-19 14:27 UTC, David Naori
no flags Details
libvirt log. (3.87 MB, application/x-gzip)
2011-01-20 09:34 UTC, David Naori
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0596 0 normal SHIPPED_LIVE libvirt bug fix and enhancement update 2011-05-18 17:56:36 UTC

Description David Naori 2011-01-19 14:27:27 UTC
Created attachment 474278 [details]
libvirt and gdb logs.

Description of problem:
libvirt demon crashes while running several vm's using RHEVM.
(virDomainDefFree at conf/domain_conf.c:793)


Version-Release number of selected component (if applicable):
-libvirt-0.8.7-2.el6
-vdsm-cli-4.9-43
-qemu-kvm-0.12.1.2-2.129.el6
-rhel 6.1 based host

Steps to Reproduce:

- on RHEVM
1.create 20+ vms on single host 
2.start and stop the vms.


Additional info:

libvirt and gdb logs attached.

Comment 2 Daniel Veillard 2011-01-20 08:45:06 UTC
Looking at the stack trace it seems the sequence is

qemudDomainBlockStats(domain, "hda") finishes
calls 
qemuDomainObjEndJob(obj) on line 7761

qemuDomainObjEndJob(obj) finishes and calls
virDomainObjUnref(obj); on line 499 of src/qemu/qemu_domain.c

domain count goes to 0 so calls 

virDomainObjFree(dom); on line 910

which starts by calling virDomainDefFree(dom->def) line 884

and virDomainDefFree() segfault at the beginning while dereferencing
the graphics allocated array.

It seems that something managed to free the domain in the meantime
qemudDomainBlockStats() can take some time to process, maybe a domain
kill happened in the meantime. Still the locking should have garanteed
that the object was preserved, something in the framework is not
handling locking/unlocking properly I would guess.

Daniel

Comment 3 Daniel Veillard 2011-01-20 09:17:50 UTC
BTW the domain log are not present in the tarball, it seems a tar of the
gdb.txt for some mysterious reason. Can you provide the libvirt logs
as a separate attachment to this bug, thanks !

Daniel

Comment 4 David Naori 2011-01-20 09:34:50 UTC
Created attachment 474427 [details]
libvirt log.

Comment 5 Daniel Berrangé 2011-01-20 11:12:15 UTC
The qemudDomainBlockStats() method looks correct, so I have to assume that elsewhere there is something that is unref'ing the virDomainObj twice.

Is this crash easily reproducable, and if so, does it occur on earlier 0.8.1 builds of libvirt. If not that would indicate a regression which would help tracking down the problem

Comment 6 Dave Allan 2011-01-20 18:42:57 UTC
Setting needinfo to make sure David Naori sees the question.

Comment 7 Dave Allan 2011-01-21 02:50:49 UTC
We may have a fix.  It fixes a crash caused by starting and stopping large numbers of VMs.  See:

https://www.redhat.com/archives/libvir-list/2011-January/msg00876.html

Comment 8 Daniel Veillard 2011-01-21 08:26:13 UTC
Since the fix is not in the latest build I have made a set of rpms
to test available at
 http://veillard.com/libvirt/test/

they include the above patch on top of RHEL-6 libvirt-0.8.7-3, please give
them a try and see if the issue can be reproduced with those.

Daniel

Comment 9 Dave Allan 2011-01-21 19:36:00 UTC
David, in the description you say "create 20+ VMs".  How many did you actually create?

Comment 10 Eric Blake 2011-01-21 21:32:06 UTC
I created bug 671564 and bug 671567 to track two separate issues found while investigating event.c.  The first is easily reproducible with creating then destroying 60 VMs, but has a stack trace different than the original report.  The second is something that I have not been able to reproduce (I found it by inspection), but since it involves a data race window, it may be the more likely culprit for fixing this report.  If we can definitively prove that either one of those bugs solves this issue, then we can mark this as duplicate.

Comment 11 David Naori 2011-01-22 19:31:44 UTC
Dave Allan, to your question it was 27 VMs to be precise.
Daniel, unfortunately i did not check earlier versions.

Comment 12 Daniel Veillard 2011-01-24 01:53:22 UTC
dnaori can you reproduce the issue with the special build of libvirt
at  http://veillard.com/libvirt/test/ ?

Please upgrade the libvirt in your testing environement to that version
and try to reproduce the problem ! Then report on result

   thanks !

Daniel

Comment 13 Dave Allan 2011-01-24 16:43:28 UTC
David, there is an additional fix that Eric produced after Daniel spun that build.  Eric will get you a scratch build with it today.

Comment 14 David Naori 2011-01-25 12:04:24 UTC
So far seems like this build - http://veillard.com/libvirt/test/ solves the problem.

Comment 15 David Naori 2011-01-25 15:04:35 UTC
i managed to reproduce the crash with Eric's build and with veillard's

Comment 16 David Naori 2011-01-25 15:33:20 UTC
core dump bt:
#0  virDomainDefFree (def=0x7f05d4017080) at conf/domain_conf.c:793
#1  0x00007f05ea2bb6a0 in virDomainObjFree (dom=0x7f05d401e6c0) at conf/domain_conf.c:884
#2  virDomainObjUnref (dom=0x7f05d401e6c0) at conf/domain_conf.c:910
#3  0x000000000043fca9 in qemudDomainBlockStats (dom=<value optimized out>, path=0x7f05d00dfe70 "hda", stats=<value optimized out>) at qemu/qemu_driver.c:7767
#4  0x00007f05ea303429 in virDomainBlockStats (dom=0x7f05d00e8860, path=0x7f05d00dfe70 "hda", stats=0x7f05dbffeae0, size=40) at libvirt.c:4501
#5  0x000000000042706a in remoteDispatchDomainBlockStats (server=<value optimized out>, client=<value optimized out>, conn=0x7f05d00009e0,
    hdr=<value optimized out>, rerr=0x7f05dbffeb90, args=0x7f05dbffecd0, ret=0x7f05dbffec70) at remote.c:918
#6  0x000000000042c5ba in remoteDispatchClientCall (server=0x15e1620, client=0x7f05dc0013a0, msg=0x7f05dc001520) at dispatch.c:530
#7  remoteDispatchClientRequest (server=0x15e1620, client=0x7f05dc0013a0, msg=0x7f05dc001520) at dispatch.c:408
#8  0x000000000041c2c8 in qemudWorker (data=0x7f05dc000908) at libvirtd.c:1582
#9  0x0000003e45a077e1 in ?? ()
#10 0x00007f05dbfff710 in ?? ()
#11 0x0000000000000000 in ?? ()

Comment 17 Daniel Veillard 2011-01-26 01:30:46 UTC
Argh ... that's not good ... let's sync on IRC to try to get a live gdb
session for it. It's nearly impossible in that case to guess just from a stack
trace, actually having detailed debug logs of the daemon may be useful,
let's try to set this up togetther.

Daniel

Comment 20 Eric Blake 2011-01-28 20:25:07 UTC
(In reply to comment #18)
> these upstream patches (two approved, one
> still pending review):
> https://www.redhat.com/archives/libvir-list/2011-January/msg00985.html
> https://www.redhat.com/archives/libvir-list/2011-January/msg01151.html
> https://www.redhat.com/archives/libvir-list/2011-January/msg01106.html

Bug 673588 now tracks getting those three patches into RHEL, whether or not they prove to be the root cause of this crash.

Comment 21 David Naori 2011-01-30 11:08:02 UTC
Started testing the new scratch build https://brewweb.devel.redhat.com/taskinfo?taskID=3072886, and still the crash occurs. 

attached gdb bt of the core dump:

 #0  0x00007fde21408d5c in virDomainDefFree (def=0x7fddfc001ed0) at conf/domain_conf.c:800                                                                        
#1  0x00007fde21409315 in virDomainObjFree (dom=0x7fddfc069bb0) at conf/domain_conf.c:891                                                                        
#2  0x00007fde21409457 in virDomainObjUnref (dom=0x7fddfc069bb0) at conf/domain_conf.c:917                                                                       
#3  0x0000000000471993 in qemuDomainObjEndJob (obj=0x7fddfc069bb0) at qemu/qemu_domain.c:499                                                                     
#4  0x0000000000450360 in qemudDomainBlockStats (dom=0x7fde04169b40, path=0x7fde04017ff0 "hda", stats=0x7fde19753a00) at qemu/qemu_driver.c:7821                 
#5  0x00007fde21455849 in virDomainBlockStats (dom=0x7fde04169b40, path=0x7fde04017ff0 "hda", stats=0x7fde19753ab0, size=40) at libvirt.c:4501                   
#6  0x0000000000423ddb in remoteDispatchDomainBlockStats (server=0x1b6d640, client=0x7fde14001240, conn=0x7fde080009e0, hdr=0x7fde14110860, rerr=0x7fde19753c00, 
    args=0x7fde19753bb0, ret=0x7fde19753b50) at remote.c:918                                                                                                     
#7  0x0000000000430e36 in remoteDispatchClientCall (server=0x1b6d640, client=0x7fde14001240, msg=0x7fde140d0850, qemu_protocol=false) at dispatch.c:530          
#8  0x0000000000430a01 in remoteDispatchClientRequest (server=0x1b6d640, client=0x7fde14001240, msg=0x7fde140d0850) at dispatch.c:408                            
#9  0x000000000041d603 in qemudWorker (data=0x7fde140008c0) at libvirtd.c:1582                                                                                   
#10 0x00000030440077e1 in start_thread (arg=0x7fde19754710) at pthread_create.c:301                                                                              
#11 0x00000030438e153d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 23 Eric Blake 2011-02-02 18:34:49 UTC
Using these steps from Wen Congyang:

1. use gdb to debug libvirtd, and set breakpoint in the function
   qemuConnectMonitor()
2. start a vm, and the libvirtd will be stopped in qemuConnectMonitor()
3. kill -STOP $(cat /var/run/libvirt/qemu/<domain>.pid)
4. continue to run libvirtd in gdb, and libvirtd will be blocked in the
   function qemuMonitorSetCapabilities()
5. kill -9 $(cat /var/run/libvirt/qemu/<domain>.pid)
6. continue to run libvirtd in gdb

I saw libvirt crash:
11:12:44.882: 17952: error : qemuRemoveCgroup:335 : internal error Unable to find cgroup for windows_2008-32
11:12:44.882: 17952: warning : qemudShutdownVMDaemon:3109 : Failed to remove cgroup for windows_2008-32

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff0aaf700 (LWP 17950)]
0x0000003021675705 in malloc_consolidate () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003021675705 in malloc_consolidate () from /lib64/libc.so.6
#1  0x0000003021677f38 in _int_free () from /lib64/libc.so.6
#2  0x00007ffff79e2d73 in virFree (ptrptr=0x7ffff0aae7a0) at util/memory.c:311
#3  0x000000000041dc75 in qemudClientMessageRelease (client=0x7fffec0012f0, 
    msg=0x7fffe0014e10) at libvirtd.c:2065
#4  0x000000000041dd16 in qemudDispatchClientWrite (client=0x7fffec0012f0)
    at libvirtd.c:2095
#5  0x000000000041dfbe in qemudDispatchClientEvent (watch=8, fd=18, events=2, 
    opaque=0x6fadb0) at libvirtd.c:2165
#6  0x00000000004189ee in virEventDispatchHandles (nfds=7, fds=0x7fffec0011b0)
    at event.c:467
#7  0x0000000000419082 in virEventRunOnce () at event.c:599
#8  0x000000000041e1c1 in qemudOneLoop () at libvirtd.c:2265
#9  0x000000000041e6cd in qemudRunLoop (opaque=0x6fadb0) at libvirtd.c:2375
#10 0x0000003021e077e1 in start_thread () from /lib64/libpthread.so.0
#11 0x00000030216e151d in clone () from /lib64/libc.so.6

which, although a different trace, has all the same earmarks of being a case of the event dispatcher referencing double-freed memory.  Then, applying Wen's patch, and reproducing the scenario, I no longer see the crash.  Therefore, it may be worth merging this and bug 673588.

Comment 25 Laine Stump 2011-02-28 18:35:12 UTC
Here is valgrind output from a system that experienced this crash. It shows the thread that has previously freed the domain object. I haven't analyzed it yet:

Invalid read of size 4
==13237==    at 0x333084F3A3: virDomainObjUnref (domain_conf.c:979)
==13237==    by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819)
==13237==    by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518)
==13237==    by 0x427339: remoteDispatchDomainBlockStats (remote.c:918)
==13237==    by 0x42C889: remoteDispatchClientRequest (dispatch.c:530)
==13237==    by 0x41C597: qemudWorker (libvirtd.c:1582)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237==  Address 0x4e6cf78 is 40 bytes inside a block of size 104 free'd
==13237==    at 0x4A04D72: free (vg_replace_malloc.c:325)
==13237==    by 0x33308398A8: virFree (memory.c:311)
==13237==    by 0x333084F44F: virDomainObjUnref (domain_conf.c:965)
==13237==    by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209)
==13237==    by 0x4186FE: virEventCleanupHandles (event.c:538)
==13237==    by 0x418D14: virEventRunOnce (event.c:603)
==13237==    by 0x41B398: qemudOneLoop (libvirtd.c:2238)
==13237==    by 0x41B856: qemudRunLoop (libvirtd.c:2348)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237== 
==13237== Invalid read of size 4
==13237==    at 0x366B60A640: pthread_mutex_unlock (in /lib64/libpthread-2.12.so)
==13237==    by 0x333084F3BF: virDomainObjUnref (domain_conf.c:980)
==13237==    by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819)
==13237==    by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518)
==13237==    by 0x427339: remoteDispatchDomainBlockStats (remote.c:918)
==13237==    by 0x42C889: remoteDispatchClientRequest (dispatch.c:530)
==13237==    by 0x41C597: qemudWorker (libvirtd.c:1582)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237==  Address 0x4e6cf60 is 16 bytes inside a block of size 104 free'd
==13237==    at 0x4A04D72: free (vg_replace_malloc.c:325)
==13237==    by 0x33308398A8: virFree (memory.c:311)
==13237==    by 0x333084F44F: virDomainObjUnref (domain_conf.c:965)
==13237==    by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209)
==13237==    by 0x4186FE: virEventCleanupHandles (event.c:538)
==13237==    by 0x418D14: virEventRunOnce (event.c:603)
==13237==    by 0x41B398: qemudOneLoop (libvirtd.c:2238)
==13237==    by 0x41B856: qemudRunLoop (libvirtd.c:2348)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237== 
==13237== Invalid read of size 4
==13237==    at 0x366B60A1F0: __pthread_mutex_unlock_full (in /lib64/libpthread-2.12.so)
==13237==    by 0x333084F3BF: virDomainObjUnref (domain_conf.c:980)
==13237==    by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819)
==13237==    by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518)
==13237==    by 0x427339: remoteDispatchDomainBlockStats (remote.c:918)
==13237==    by 0x42C889: remoteDispatchClientRequest (dispatch.c:530)
==13237==    by 0x41C597: qemudWorker (libvirtd.c:1582)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237==  Address 0x4e6cf60 is 16 bytes inside a block of size 104 free'd
==13237==    at 0x4A04D72: free (vg_replace_malloc.c:325)
==13237==    by 0x33308398A8: virFree (memory.c:311)
==13237==    by 0x333084F44F: virDomainObjUnref (domain_conf.c:965)
==13237==    by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209)
==13237==    by 0x4186FE: virEventCleanupHandles (event.c:538)
==13237==    by 0x418D14: virEventRunOnce (event.c:603)
==13237==    by 0x41B398: qemudOneLoop (libvirtd.c:2238)
==13237==    by 0x41B856: qemudRunLoop (libvirtd.c:2348)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237== 
==13237== Invalid read of size 8
==13237==    at 0x333084F3F7: virDomainObjUnref (domain_conf.c:955)
==13237==    by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819)
==13237==    by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518)
==13237==    by 0x427339: remoteDispatchDomainBlockStats (remote.c:918)
==13237==    by 0x42C889: remoteDispatchClientRequest (dispatch.c:530)
==13237==    by 0x41C597: qemudWorker (libvirtd.c:1582)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237==  Address 0x4e6cf88 is 56 bytes inside a block of size 104 free'd
==13237==    at 0x4A04D72: free (vg_replace_malloc.c:325)
==13237==    by 0x33308398A8: virFree (memory.c:311)
==13237==    by 0x333084F44F: virDomainObjUnref (domain_conf.c:965)
==13237==    by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209)
==13237==    by 0x4186FE: virEventCleanupHandles (event.c:538)
==13237==    by 0x418D14: virEventRunOnce (event.c:603)
==13237==    by 0x41B398: qemudOneLoop (libvirtd.c:2238)
==13237==    by 0x41B856: qemudRunLoop (libvirtd.c:2348)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237== 
==13237== Invalid read of size 4
==13237==    at 0x333084EC59: virDomainDefFree (domain_conf.c:859)
==13237==    by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955)
==13237==    by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819)
==13237==    by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518)
==13237==    by 0x427339: remoteDispatchDomainBlockStats (remote.c:918)
==13237==    by 0x42C889: remoteDispatchClientRequest (dispatch.c:530)
==13237==    by 0x41C597: qemudWorker (libvirtd.c:1582)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237==  Address 0x4e2fcb8 is 312 bytes inside a block of size 632 free'd
==13237==    at 0x4A04D72: free (vg_replace_malloc.c:325)
==13237==    by 0x33308398A8: virFree (memory.c:311)
==13237==    by 0x333084F1A7: virDomainDefFree (domain_conf.c:945)
==13237==    by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955)
==13237==    by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209)
==13237==    by 0x4186FE: virEventCleanupHandles (event.c:538)
==13237==    by 0x418D14: virEventRunOnce (event.c:603)
==13237==    by 0x41B398: qemudOneLoop (libvirtd.c:2238)
==13237==    by 0x41B856: qemudRunLoop (libvirtd.c:2348)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237== 
==13237== Invalid read of size 8
==13237==    at 0x333084EC70: virDomainDefFree (domain_conf.c:860)
==13237==    by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955)
==13237==    by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819)
==13237==    by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518)
==13237==    by 0x427339: remoteDispatchDomainBlockStats (remote.c:918)
==13237==    by 0x42C889: remoteDispatchClientRequest (dispatch.c:530)
==13237==    by 0x41C597: qemudWorker (libvirtd.c:1582)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237==  Address 0x4e2fcc0 is 320 bytes inside a block of size 632 free'd
==13237==    at 0x4A04D72: free (vg_replace_malloc.c:325)
==13237==    by 0x33308398A8: virFree (memory.c:311)
==13237==    by 0x333084F1A7: virDomainDefFree (domain_conf.c:945)
==13237==    by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955)
==13237==    by 0x46BB02: qemuMonitorUnref (qemu_monitor.c:209)
==13237==    by 0x4186FE: virEventCleanupHandles (event.c:538)
==13237==    by 0x418D14: virEventRunOnce (event.c:603)
==13237==    by 0x41B398: qemudOneLoop (libvirtd.c:2238)
==13237==    by 0x41B856: qemudRunLoop (libvirtd.c:2348)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237== 
==13237== Invalid read of size 8
==13237==    at 0x333084EC7C: virDomainDefFree (domain_conf.c:860)
==13237==    by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955)
==13237==    by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819)
==13237==    by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518)
==13237==    by 0x427339: remoteDispatchDomainBlockStats (remote.c:918)
==13237==    by 0x42C889: remoteDispatchClientRequest (dispatch.c:530)
==13237==    by 0x41C597: qemudWorker (libvirtd.c:1582)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==13237== 
==13237== 
==13237== Process terminating with default action of signal 11 (SIGSEGV)
==13237==  Access not within mapped region at address 0x0
==13237==    at 0x333084EC7C: virDomainDefFree (domain_conf.c:860)
==13237==    by 0x333084F3FF: virDomainObjUnref (domain_conf.c:955)
==13237==    by 0x43FF88: qemudDomainBlockStats (qemu_driver.c:7819)
==13237==    by 0x3330897FE8: virDomainBlockStats (libvirt.c:4518)
==13237==    by 0x427339: remoteDispatchDomainBlockStats (remote.c:918)
==13237==    by 0x42C889: remoteDispatchClientRequest (dispatch.c:530)
==13237==    by 0x41C597: qemudWorker (libvirtd.c:1582)
==13237==    by 0x366B6077E0: start_thread (in /lib64/libpthread-2.12.so)
==13237==    by 0x366B2E5DCC: clone (in /lib64/libc-2.12.so)
==13237==  If you believe this happened as a result of a stack
==13237==  overflow in your program's main thread (unlikely but
==13237==  possible), you can try to increase the size of the
==13237==  main thread stack using the --main-stacksize= flag.
==13237==  The main thread stack size used in this run was 10485760.

Comment 26 Laine Stump 2011-03-01 08:56:51 UTC
Note that the above crash happened while running libvirt-0.8.7-8. Eric says that build has all the fixes discussed above.

Comment 27 Laine Stump 2011-03-01 18:26:43 UTC
I have a local test setup that creates and destroys a transient guest while doing domainblkstats of a block device on the domain, and an application called referential supplied by danpb that keeps track of the refcount on every domain object (when libvirt is run under referential, that is). So far this has not reproduced the crash.

Now in a quest to more exactly replicate the behavior on the reproducing system, I'm wondering exactly what libvirt APIs vdsm is calling to "stop the vms" (I'm calling the destroy API. Maybe I need to be calling "shutdown"? And in that case, what is the minimum client needed to respond properly to that?)

Comment 28 Laine Stump 2011-03-03 20:00:53 UTC
Using the output of referential as a guide (along with Dan B's brain), I found two potential races in the code that deals with removing inactive DomainObjs when a domain is shutdown. The patches for these have been posted upstream, and after review/push, I will submit rebased versions for the RHEL6 version of libvirt:

https://www.redhat.com/archives/libvir-list/2011-March/msg00112.html

Comment 29 Jiri Denemark 2011-03-04 15:20:32 UTC
Patches sent to rhvirt-patches: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-March/msg00100.html

Comment 30 Eric Blake 2011-03-04 19:48:16 UTC
Shoot - I think this introduces a deadlock regression into 'virsh save domain file'.

gdb shows the following backtrace:

Thread 7 (Thread 0x7fffe97fb700 (LWP 29297)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007ffff79e97ef in virCondWait (c=0x6fee68, m=0x6fee40)
    at util/threads-pthread.c:112
#2  0x000000000041cdb9 in qemudWorker (data=0x7fffec000920) at libvirtd.c:1608
#3  0x00000032984077e1 in start_thread (arg=0x7fffe97fb700)
    at pthread_create.c:301
#4  0x0000003297ce5dcd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 6 (Thread 0x7fffea1fc700 (LWP 29296)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007ffff79e97ef in virCondWait (c=0x6fee68, m=0x6fee40)
    at util/threads-pthread.c:112
#2  0x000000000041cdb9 in qemudWorker (data=0x7fffec000908) at libvirtd.c:1608
#3  0x00000032984077e1 in start_thread (arg=0x7fffea1fc700)
    at pthread_create.c:301
#4  0x0000003297ce5dcd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7fffeabfd700 (LWP 29295)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007ffff79e97ef in virCondWait (c=0x6fee68, m=0x6fee40)
    at util/threads-pthread.c:112
#2  0x000000000041cdb9 in qemudWorker (data=0x7fffec0008f0) at libvirtd.c:1608
#3  0x00000032984077e1 in start_thread (arg=0x7fffeabfd700)
    at pthread_create.c:301
#4  0x0000003297ce5dcd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7fffeb5fe700 (LWP 29294)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007ffff79e97ef in virCondWait (c=0x6fee68, m=0x6fee40)
    at util/threads-pthread.c:112
#2  0x000000000041cdb9 in qemudWorker (data=0x7fffec0008d8) at libvirtd.c:1608
#3  0x00000032984077e1 in start_thread (arg=0x7fffeb5fe700)
    at pthread_create.c:301
#4  0x0000003297ce5dcd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7fffebfff700 (LWP 29293)):
#0  __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x0000003298409345 in _L_lock_870 () from /lib64/libpthread.so.0
#2  0x0000003298409217 in __pthread_mutex_lock (mutex=0x6fc6a0)
    at pthread_mutex_lock.c:61
#3  0x00007ffff79e9739 in virMutexLock (m=0x6fc6a0)
    at util/threads-pthread.c:80
#4  0x000000000041805d in virEventUpdateTimeoutImpl (timer=1, frequency=0)
    at event.c:247
#5  0x00007ffff79d5212 in virEventUpdateTimeout (timer=1, timeout=0)
    at util/event.c:70
#6  0x00000000004631da in qemuDomainEventQueue (driver=0x73b1a0, 
    event=0x7fffdc011dd0) at qemu/qemu_domain.c:97
#7  0x000000000043ed12 in qemudDomainSaveFlag (driver=0x73b1a0, 
    dom=0x7fffdc001970, vm=0x761950, 
    path=0x7fffdc000e10 "/var/run/libvirt/qemu/fed12.img", compressed=0)
    at qemu/qemu_driver.c:2074
#8  0x000000000043efd2 in qemudDomainSave (dom=0x7fffdc001970, 
    path=0x7fffdc000e10 "/var/run/libvirt/qemu/fed12.img")
    at qemu/qemu_driver.c:2137
#9  0x00007ffff7a40a9d in virDomainSave (domain=0x7fffdc001970, 
    to=0x7fffdc000e10 "/var/run/libvirt/qemu/fed12.img") at libvirt.c:2280
#10 0x0000000000426267 in remoteDispatchDomainSave (server=0x6fee40, 
    client=0x7fffec001190, conn=0x7fffd8000b00, hdr=0x7fffec041430, 
    rerr=0x7fffebffec00, args=0x7fffebffebb0, ret=0x7fffebffeb50)
    at remote.c:2273
#11 0x000000000043061a in remoteDispatchClientCall (server=0x6fee40, 
    client=0x7fffec001190, msg=0x7fffec001420, qemu_protocol=false)
    at dispatch.c:529
#12 0x00000000004301e5 in remoteDispatchClientRequest (server=0x6fee40, 
    client=0x7fffec001190, msg=0x7fffec001420) at dispatch.c:407
#13 0x000000000041ce79 in qemudWorker (data=0x7fffec0008c0) at libvirtd.c:1629
#14 0x00000032984077e1 in start_thread (arg=0x7fffebfff700)
    at pthread_create.c:301
#15 0x0000003297ce5dcd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 2 (Thread 0x7ffff0aaa700 (LWP 29292)):
#0  __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x0000003298409345 in _L_lock_870 () from /lib64/libpthread.so.0
#2  0x0000003298409217 in __pthread_mutex_lock (mutex=0x761950)
    at pthread_mutex_lock.c:61
#3  0x00007ffff79e9739 in virMutexLock (m=0x761950)
    at util/threads-pthread.c:80
#4  0x00007ffff7a0dbd2 in virDomainObjLock (obj=0x761950)
    at conf/domain_conf.c:8447
#5  0x000000000046feb7 in qemuProcessHandleMonitorDestroy (mon=0x7fffdc001a60, 
    vm=0x761950) at qemu/qemu_process.c:599
#6  0x0000000000478893 in qemuMonitorFree (mon=0x7fffdc001a60)
    at qemu/qemu_monitor.c:209
#7  0x0000000000478931 in qemuMonitorUnref (mon=0x7fffdc001a60)
    at qemu/qemu_monitor.c:229
#8  0x000000000047896d in qemuMonitorUnwatch (monitor=0x7fffdc001a60)
    at qemu/qemu_monitor.c:242
#9  0x0000000000418fd4 in virEventCleanupHandles () at event.c:538
#10 0x00000000004192c4 in virEventRunOnce () at event.c:603
#11 0x000000000041e45e in qemudOneLoop () at libvirtd.c:2285
#12 0x000000000041e96a in qemudRunLoop (opaque=0x6fee40) at libvirtd.c:2395
#13 0x00000032984077e1 in start_thread (arg=0x7ffff0aaa700)
    at pthread_create.c:301
#14 0x0000003297ce5dcd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7ffff7765800 (LWP 29290)):
#0  0x000000329840803d in pthread_join (threadid=140737231103744, 
    thread_return=0x0) at pthread_join.c:89
#1  0x0000000000421ae6 in main (argc=1, argv=0x7fffffffe1f8) at libvirtd.c:3411

Comment 31 Eric Blake 2011-03-07 17:56:14 UTC
This additional patch resolves the deadlock; v0.8.7-9 is DOA with just the first 2 of 3 patches, but 0.8.7-10 should have the third patch and solve this bug.

http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-March/msg00125.html

Comment 33 Haim 2011-03-08 13:57:33 UTC
verified. 

started libvirtd in gdb; and perform several shut-down scenarios (acpi, kill, kill -9), libvirt didn't crash. 

libvirt-0.8.7-10.el6.x86_64
qemu-kvm-0.12.1.2-2.149.el6.x86_64
vdsm-4.9-52.el6.x86_64

Comment 36 errata-xmlrpc 2011-05-19 13:25:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html


Note You need to log in before you can comment on or make changes to this bug.