Bug 723811 - [libvirt] libvirt crash when running large number of domains and vdsm is restarted
Summary: [libvirt] libvirt crash when running large number of domains and vdsm is rest...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt
Version: 6.2
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-21 09:00 UTC by Haim
Modified: 2014-01-13 00:49 UTC (History)
17 users (show)

Fixed In Version: libvirt-0.9.4-0rc1.1.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-06 11:17:19 UTC
Target Upstream Version:


Attachments (Terms of Use)
libvirt core. (798.69 KB, application/x-xz)
2011-07-21 09:01 UTC, Haim
no flags Details
valgrind logs (611.09 KB, application/octet-stream)
2011-07-26 14:31 UTC, David Naori
no flags Details
Attempt to fix ref counting race condition (1.32 KB, patch)
2011-07-26 14:47 UTC, Daniel Berrangé
no flags Details | Diff
relevent part of libvirtd.log (94.21 KB, text/x-log)
2011-07-26 14:54 UTC, David Naori
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1513 0 normal SHIPPED_LIVE libvirt bug fix and enhancement update 2011-12-06 01:23:30 UTC

Description Haim 2011-07-21 09:00:26 UTC
Description of problem:


the following case was found on libvirt-0.9.3-7.el6.x86_64.

1) host runs large number of domains (127)
2) vdsm service is restarted 
3) libvirt crash on the following: 

#0  _int_malloc (av=0x323938be80, bytes=<value optimized out>) at malloc.c:4439
#1  0x0000003239079718 in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4065
#2  0x000000323880acdf in _dl_new_object (realname=0x906b90 "/lib64/libgcc_s.so.1", libname=0x3239151d7e "libgcc_s.so.1", type=2, loader=0x0, mode=-1879048191, nsid=0) at dl-object.c:77
#3  0x0000003238806b5e in _dl_map_object_from_fd (name=0x3239151d7e "libgcc_s.so.1", fd=20, fbp=0x7fff64e41090, realname=0x906b90 "/lib64/libgcc_s.so.1", loader=0x0, l_type=2, mode=-1879048191, stack_endp=0x7fff64e413d8, nsid=0)
    at dl-load.c:960
#4  0x0000003238807cf8 in _dl_map_object (loader=0x0, name=0x3239151d7e "libgcc_s.so.1", type=2, trace_mode=0, mode=<value optimized out>, nsid=<value optimized out>) at dl-load.c:2247
#5  0x0000003238812969 in dl_open_worker (a=0x7fff64e415f0) at dl-open.c:226
#6  0x000000323880e0c6 in _dl_catch_error (objname=0x7fff64e41640, errstring=0x7fff64e41638, mallocedp=0x7fff64e4164f, operate=0x3238812840 <dl_open_worker>, args=0x7fff64e415f0) at dl-error.c:178
#7  0x000000323881241a in _dl_open (file=0x3239151d7e "libgcc_s.so.1", mode=-2147483647, caller_dlopen=0x0, nsid=-2, argc=3, argv=<value optimized out>, env=0x7fff64e42ac8) at dl-open.c:555
#8  0x00000032391236f0 in do_dlopen (ptr=0x7fff64e417e0) at dl-libc.c:86
#9  0x000000323880e0c6 in _dl_catch_error (objname=0x7fff64e41800, errstring=0x7fff64e417f8, mallocedp=0x7fff64e4180f, operate=0x32391236b0 <do_dlopen>, args=0x7fff64e417e0) at dl-error.c:178
#10 0x0000003239123847 in dlerror_run (name=<value optimized out>, mode=<value optimized out>) at dl-libc.c:47
#11 __libc_dlopen_mode (name=<value optimized out>, mode=<value optimized out>) at dl-libc.c:160
#12 0x00000032390fbed5 in init () at ../sysdeps/ia64/backtrace.c:41
#13 0x000000323980cb43 in pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:104
#14 0x00000032390fbfd4 in __backtrace (array=<value optimized out>, size=64) at ../sysdeps/ia64/backtrace.c:85
#15 0x000000323906fdaf in __libc_message (do_abort=2, fmt=0x3239154ed8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:166
#16 0x00000032390756b6 in malloc_printerr (action=3, str=0x3239155180 "double free or corruption (!prev)", ptr=<value optimized out>) at malloc.c:6283
#17 0x00007f7ed34239a9 in virFree (ptrptr=0x7fff64e42218) at util/memory.c:310
#18 0x000000000043c11d in virNetServerClientFree (client=0x908050) at rpc/virnetserverclient.c:542
#19 0x0000000000440dad in virNetSocketEventFree (opaque=0x8e2760) at rpc/virnetsocket.c:1151
#20 0x00007f7ed341a087 in virEventPollCleanupHandles () at util/event_poll.c:547
#21 0x00007f7ed341a143 in virEventPollRunOnce () at util/event_poll.c:583
#22 0x00007f7ed34196f7 in virEventRunDefaultImpl () at util/event.c:247
#23 0x000000000043cb9d in virNetServerRun (srv=0x8e1ae0) at rpc/virnetserver.c:662
#24 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552
(gdb)


this issue seem to reproduce constantly (4 times). 

full core dumps provided to libvirt development team. 

[root@nott-vds3 core]# virsh -r list
Segmentation fault

Comment 1 Haim 2011-07-21 09:01:46 UTC
Created attachment 514167 [details]
libvirt core.

Comment 3 Michal Privoznik 2011-07-21 09:32:31 UTC
I can reproduce virsh segfault as well. Symptoms are:
- connect to 'qemu:///system'
- no libvirtd running

#0  0x00007ffff75e7eb7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007ffff7af30d5 in virNetClientClose () from /usr/lib64/libvirt.so.0
#2  0x00007ffff7ae5cbf in doRemoteOpen () from /usr/lib64/libvirt.so.0
#3  0x00007ffff7ae7d61 in remoteOpen () from /usr/lib64/libvirt.so.0
#4  0x00007ffff7aa6c81 in do_open () from /usr/lib64/libvirt.so.0
#5  0x00007ffff7aa8f96 in virConnectOpenAuth () from /usr/lib64/libvirt.so.0
#6  0x0000000000425400 in main ()

Comment 4 David Naori 2011-07-21 10:01:06 UTC
Managed to reproduce it.

Thread 9 (Thread 0x7fa8de227700 (LWP 18671)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
        ret = <value optimized out>
#2  0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
        job = <value optimized out>
        pool = 0x15c3bd0
#3  0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
        args = 0x15c3110
#4  0x0000003307e077e1 in start_thread (arg=0x7fa8de227700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fa8de227700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363258033920, 7441651036830636411, 140736692556288, 140363258034624, 0, 3, -7487319643737745029, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#5  0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 8 (Thread 0x7fa8d9c69700 (LWP 16991)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
        ret = <value optimized out>
#2  0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
        job = <value optimized out>
        pool = 0x15c3bd0
#3  0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
        args = 0x15f1b10
#4  0x0000003307e077e1 in start_thread (arg=0x7fa8d9c69700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fa8d9c69700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363184903936, 7441651036830636411, 140736692556064, 140363184904640, 0, 3, -7487307232356001413, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#5  0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 7 (Thread 0x7fa8dc424700 (LWP 18674)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
        ret = <value optimized out>
#2  0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
        job = <value optimized out>
        pool = 0x15c3bd0
#3  0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
        args = 0x15cf450
#4  0x0000003307e077e1 in start_thread (arg=0x7fa8dc424700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fa8dc424700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363226564352, 7441651036830636411, 140736692556288, 140363226565056, 0, 3, -7487314968128972421, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#5  0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 6 (Thread 0x7fa8c3fff700 (LWP 17209)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
        ret = <value optimized out>
#2  0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
        job = <value optimized out>
        pool = 0x15c3bd0
#3  0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
        args = 0x15fa820
#4  0x0000003307e077e1 in start_thread (arg=0x7fa8c3fff700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fa8c3fff700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140362819565312, 7441651036830636411, 140736692556064, 140362819566016, 0, 3, -7487364332872459909, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#5  0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 5 (Thread 0x7fa8dba23700 (LWP 18675)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
        ret = <value optimized out>
#2  0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
        job = <value optimized out>
        pool = 0x15c3bd0
#3  0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
        args = 0x15cf400
#4  0x0000003307e077e1 in start_thread (arg=0x7fa8dba23700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fa8dba23700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363216074496, 7441651036830636411, 140736692556288, 140363216075200, 0, 3, -7487311945008866949, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#5  0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 4 (Thread 0x7fa8dce25700 (LWP 18673)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
        ret = <value optimized out>
#2  0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
        job = <value optimized out>
        pool = 0x15c3bd0
#3  0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
        args = 0x15cf4d0
#4  0x0000003307e077e1 in start_thread (arg=0x7fa8dce25700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fa8dce25700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363237054208, 7441651036830636411, 140736692556288, 140363237054912, 0, 3, -7487313595350050437, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#5  0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 3 (Thread 0x7fa8d921b700 (LWP 17226)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
No locals.
#1  0x0000003307e09345 in _L_lock_870 () from /lib64/libpthread-2.12.so
No symbol table info available.
#2  0x0000003307e09217 in __pthread_mutex_lock (mutex=0x1634d48) at pthread_mutex_lock.c:61
        ignore1 = 128
        ignore2 = 23285064
        ignore3 = -512
        type = <value optimized out>
        id = 17226
#3  0x000000000043b6bd in virNetServerClientLock (client=0x1634d40) at rpc/virnetserverclient.c:108
No locals.
#4  virNetServerClientRef (client=0x1634d40) at rpc/virnetserverclient.c:342
        __func__ = "virNetServerClientRef"
#5  0x000000000043d4ee in virNetServerHandleJob (jobOpaque=<value optimized out>, opaque=<value optimized out>) at rpc/virnetserver.c:127
        srv = 0x15c3ae0
        job = 0x17b53d0
        prog = 0x0
        i = <value optimized out>
        __func__ = "virNetServerHandleJob"
#6  0x00007fa8e4121f3a in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:98
        job = 0x1855d70
        pool = 0x15c3bd0
#7  0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
        args = 0x15fa940
#8  0x0000003307e077e1 in start_thread (arg=0x7fa8d921b700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fa8d921b700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363174098688, 7441651036830636411, 140736692556064, 140363174099392, 0, 3, -7487308650768950917, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#9  0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 2 (Thread 0x7fa8dd826700 (LWP 18672)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
        ret = <value optimized out>
#2  0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
        job = <value optimized out>
        pool = 0x15c3bd0
#3  0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
        args = 0x15cf520
#4  0x0000003307e077e1 in start_thread (arg=0x7fa8dd826700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fa8dd826700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363247544064, 7441651036830636411, 140736692556288, 140363247544768, 0, 3, -7487316618470155909, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#5  0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 1 (Thread 0x7fa8e40b9800 (LWP 18670)):
#0  0x0000003307632a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
        resultvar = 0
        pid = <value optimized out>
        selftid = 18670
#1  0x00000033076341ad in abort () at abort.c:121
        act = {__sigaction_handler = {sa_handler = 0, sa_sigaction = 0}, sa_mask = {__val = {18446744073709551615 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x000000330766fdfb in __libc_message (do_abort=2, fmt=0x3307755b58 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:186
        ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fffd0911160, reg_save_area = 0x7fffd0911070}}
        ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fffd0911160, reg_save_area = 0x7fffd0911070}}
        fd = 2
        on_2 = <value optimized out>
        list = <value optimized out>
        nlist = <value optimized out>
        cp = <value optimized out>
        written = <value optimized out>
#3  0x0000003307675716 in malloc_printerr (action=3, str=0x3307755eb8 "free(): corrupted unsorted chunks", ptr=<value optimized out>) at malloc.c:6283
        buf = "0000000001632350"
        cp = <value optimized out>
#4  0x000000330f20961a in _sasl_conn_dispose (conn=0x7fa8d00c2010) at common.c:818
No locals.
#5  0x000000330f20956a in sasl_dispose (pconn=0x7fa8d009ba70) at common.c:803
        result = <value optimized out>
#6  0x000000000044515b in virNetSASLSessionFree (sasl=0x7fa8d009ba70) at rpc/virnetsaslcontext.c:595
No locals.
#7  0x0000000000440c79 in virNetSocketFree (sock=0x18acf40) at rpc/virnetsocket.c:669
        __func__ = "virNetSocketFree"
#8  0x00007fa8e410d087 in virEventPollCleanupHandles () at util/event_poll.c:547
        ff = 0x440d60 <virNetSocketEventFree>
        opaque = 0x18acf40
        i = 132
        gap = <value optimized out>
        __func__ = "virEventPollCleanupHandles"
#9  0x00007fa8e410d143 in virEventPollRunOnce () at util/event_poll.c:583
        fds = 0x0
        ret = <value optimized out>
        timeout = <value optimized out>
        nfds = <value optimized out>
        __func__ = "virEventPollRunOnce"
        __FUNCTION__ = "virEventPollRunOnce"
#10 0x00007fa8e410c6f7 in virEventRunDefaultImpl () at util/event.c:247
        __func__ = "virEventRunDefaultImpl"
#11 0x000000000043cb9d in virNetServerRun (srv=0x15c3ae0) at rpc/virnetserver.c:662
        timerid = -1
        timerActive = 0
        i = <value optimized out>
        __FUNCTION__ = "virNetServerRun"
        __func__ = "virNetServerRun"
#12 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552
        srv = 0x15c3ae0
        remote_config_file = 0x15c31f0 "/etc/libvirt/libvirtd.conf"
        statuswrite = -1
        ret = 1
        pid_file = 0x15cf5f0 "/var/run/libvirtd.pid"
        sock_file = 0x15cf7e0 "/var/run/libvirt/libvirt-sock"
        sock_file_ro = 0x15cf7b0 "/var/run/libvirt/libvirt-sock-ro"
        timeout = -1
        verbose = 0
        godaemon = 1
        ipsock = 1
        config = 0x15c3890
        privileged = true
        opts = {{name = 0x4c83aa "verbose", has_arg = 0, flag = 0x7fffd0911554, val = 1}, {name = 0x4c83b2 "daemon", has_arg = 0, flag = 0x7fffd0911550, val = 1}, {name = 0x4dcc7e "listen", has_arg = 0, flag = 0x7fffd091154c, val = 1}, 
          {name = 0x4d3e8a "config", has_arg = 1, flag = 0x0, val = 102}, {name = 0x4e625c "timeout", has_arg = 1, flag = 0x0, val = 116}, {name = 0x4ebf80 "pid-file", has_arg = 1, flag = 0x0, val = 112}, {name = 0x4d7185 "version", 
            has_arg = 0, flag = 0x0, val = 129}, {name = 0x4d7372 "help", has_arg = 0, flag = 0x0, val = 63}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        __func__ = "main"

Comment 5 Michal Privoznik 2011-07-21 11:12:40 UTC
Although I managed to reproduce virsh, it was on older than this commit:

commit b14800af14f8c13f2ca80f0a912fbb94d68628e8
Author: Guannan Ren <gren>
Date:   Wed Jul 20 15:20:18 2011 +0100

    Don't try to close a NULL virNetClientPtr
    
    * src/rpc/virnetclient.c: Skip close attempt if virNetClientPtr
      is NULL

which solve it. Therefore I am unable to reproduce virsh part of this issue (which is something different than daemon issue) with the most recent GIT build.

For tracking purpose, there was a bug created which covers the virsh part only: https://bugzilla.redhat.com/show_bug.cgi?id=723843

Comment 6 Michal Privoznik 2011-07-21 15:58:47 UTC
Haim,

I've created scratch build for you. Can you please try to reproduce with it?

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3507924

Comment 7 Daniel Berrangé 2011-07-21 16:13:53 UTC
I believe I see where the flaw is.

Originally virNetSocket usage was protected by the mutex on virNetClient (or virNetServerClient)

As part of a different fix, I gave virNetSocket an independant lock:

commit 6198f3a1d7466ea05351e130d771b2d200f49793
Author: Daniel P. Berrange <berrange>
Date:   Tue Jul 19 14:00:24 2011 +0100

    Add mutex locking and reference counting to virNetSocket
    
    Remove the need for a virNetSocket object to be protected by
    locks from the object using it, by introducing its own native
    locking and reference counting


The problem, is that we have virNetSASLSessionPtr and virNetTLSSessionPtr object references which are held by *both* virNetSocket and virNetClient. Again those objects relied on virNetClient's mutex to provide safety.

Unfortunately, when we gave virNetSocket its own mutex, this made the virNetSASLSession & virNetTLSSession usage unsafe.

The sequence of events is:


 1. Some thread calls

      virNetServerClientClose

 2.  This does  virNetSocketRemoveIOCallback to unregister the FD

 3. That thread later calls 

     virNetServerClientFree

 4. THis does virNetSASLSessionFree()

In parallel with step 4, the event loop might finally invoke the async virFreeCallback related to step 2. In this case  virNetSocketEventFree.

virNetSocketEventFree calls  virNetSASLSessionFree. This can race with step 4, and since virNetSASLSessionPtr has no lock of its own, we have a potential for bad stuff.

To fix this, we need to introduce dedicated mutexes for virNetSASLSessionPtr and virNetTLSSessionPtr.

Comment 8 Daniel Veillard 2011-07-26 00:53:10 UTC
commited upstream as 5622830c53b3a78e13c2ff722ebbba6e9c198378

Daniel

Comment 11 David Naori 2011-07-26 14:29:36 UTC
I was able to reproduce a crash on that scenario:

 #0  0x0000003307632a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00000033076341ad in abort () at abort.c:121
#2  0x000000330766fdfb in __libc_message (do_abort=2, fmt=0x3307755b58 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:186
#3  0x0000003307675716 in malloc_printerr (action=3, str=0x3307755eb8 "free(): corrupted unsorted chunks", ptr=<value optimized out>) at malloc.c:6283
#4  0x000000330f20961a in _sasl_conn_dispose (conn=0x7f94a81292f0) at common.c:818
#5  0x000000330f20956a in sasl_dispose (pconn=0x7f94a80ae398) at common.c:803
#6  0x0000000000445435 in virNetSASLSessionFree (sasl=0x7f94a80ae370) at rpc/virnetsaslcontext.c:729
#7  0x0000000000440c79 in virNetSocketFree (sock=0x1eef7b0) at rpc/virnetsocket.c:669
#8  0x00007f94ce692087 in virEventPollCleanupHandles () at util/event_poll.c:547
#9  0x00007f94ce692143 in virEventPollRunOnce () at util/event_poll.c:583
#10 0x00007f94ce6916f7 in virEventRunDefaultImpl () at util/event.c:247
#11 0x000000000043cb9d in virNetServerRun (srv=0x1ec8ae0) at rpc/virnetserver.c:662
#12 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552

wasnt able to reproduce the crash under valgrind but it found some invalid free():

==32481== Invalid free() / delete / delete[]
==32481==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==32481==    by 0x4E599A8: virFree (memory.c:310)
==32481==    by 0x431B90: remoteClientFreeFunc (remote.c:406)
==32481==    by 0x43C0B0: virNetServerClientFree (virnetserverclient.c:531)
==32481==    by 0x43D5FC: virNetServerHandleJob (virnetserver.c:169)
==32481==    by 0x4E64F39: virThreadPoolWorker (threadpool.c:98)
==32481==    by 0x4E649A1: virThreadHelper (threads-pthread.c:157)
==32481==    by 0x3307E077E0: start_thread (pthread_create.c:301)
==32481==    by 0x33076E68EC: clone (clone.S:115)
==32481==  Address 0x117d5460 is 0 bytes inside a block of size 96 free'd
==32481==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==32481==    by 0x4E599A8: virFree (memory.c:310)
==32481==    by 0x431B90: remoteClientFreeFunc (remote.c:406)
==32481==    by 0x43C0B0: virNetServerClientFree (virnetserverclient.c:531)
==32481==    by 0x440DAC: virNetSocketEventFree (virnetsocket.c:1151)
==32481==    by 0x4E50086: virEventPollCleanupHandles (event_poll.c:547)
==32481==    by 0x4E50142: virEventPollRunOnce (event_poll.c:583)
==32481==    by 0x4E4F6F6: virEventRunDefaultImpl (event.c:247)
==32481==    by 0x43CB9C: virNetServerRun (virnetserver.c:662)
==32481==    by 0x41D967: main (libvirtd.c:1552)

attached valgrind logs

Comment 12 David Naori 2011-07-26 14:31:38 UTC
Created attachment 515294 [details]
valgrind logs

Comment 13 Daniel Berrangé 2011-07-26 14:47:10 UTC
Created attachment 515297 [details]
Attempt to fix ref counting race condition

When an incoming RPC message is ready for processing,

  virNetServerClientDispatchRead()

will invoke the 'dispatchFunc' callback. This is set to

  virNetServerDispatchNewMessage

This function puts the message + client in a queue for processing by the thread pool. The thread pool worker function is

  virNetServerHandleJob

The first thing this does is acquire an extra reference on the 'client'.

Unfortunately, between the time the message+client are put on the thread pool queue, and the time the worker runs, the client object may have had its last reference removed.

We clearly need to add the reference to the client object before putting the client on the processing queue

Comment 14 David Naori 2011-07-26 14:54:25 UTC
Created attachment 515300 [details]
relevent part of libvirtd.log

Comment 15 Michal Privoznik 2011-07-26 15:38:03 UTC
David,

I've created a scratch build for you:

https://brewweb.devel.redhat.com/taskinfo?taskID=3517682

could you please give it a try?

Comment 16 David Naori 2011-07-27 14:51:26 UTC
(In reply to comment #15)
> David,
> 
> I've created a scratch build for you:
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=3517682
> 
> could you please give it a try?

Can't reproduce with this build.

Comment 18 Michal Privoznik 2011-07-28 07:12:02 UTC
commit 0302391ee643ad91fdc6d2ecf7e4cf0fc9724516
Author: Daniel P. Berrange <berrange>
Date:   Tue Jul 26 15:49:15 2011 +0100

    Fix race in ref counting when handling RPC jobs
    
    When an incoming RPC message is ready for processing,
    
      virNetServerClientDispatchRead()
    
    will invoke the 'dispatchFunc' callback. This is set to
    
      virNetServerDispatchNewMessage
    
    This function puts the message + client in a queue for processing by the thread
    pool. The thread pool worker function is
    
      virNetServerHandleJob
    
    The first thing this does is acquire an extra reference on the 'client'.
    
    Unfortunately, between the time the message+client are put on the thread pool
    queue, and the time the worker runs, the client object may have had its last
    reference removed.
    
    We clearly need to add the reference to the client object before putting the
    client on the processing queue
    
    * src/rpc/virnetserverclient.c: Add a reference to the client when
      invoking the dispatch function
    * src/rpc/virnetserver.c: Don't acquire a reference to the client
      when in the worker thread

v0.9.4-rc1-6-g0302391

Comment 25 errata-xmlrpc 2011-12-06 11:17:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1513.html


Note You need to log in before you can comment on or make changes to this bug.