Hide Forgot
Description of problem: the following case was found on libvirt-0.9.3-7.el6.x86_64. 1) host runs large number of domains (127) 2) vdsm service is restarted 3) libvirt crash on the following: #0 _int_malloc (av=0x323938be80, bytes=<value optimized out>) at malloc.c:4439 #1 0x0000003239079718 in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4065 #2 0x000000323880acdf in _dl_new_object (realname=0x906b90 "/lib64/libgcc_s.so.1", libname=0x3239151d7e "libgcc_s.so.1", type=2, loader=0x0, mode=-1879048191, nsid=0) at dl-object.c:77 #3 0x0000003238806b5e in _dl_map_object_from_fd (name=0x3239151d7e "libgcc_s.so.1", fd=20, fbp=0x7fff64e41090, realname=0x906b90 "/lib64/libgcc_s.so.1", loader=0x0, l_type=2, mode=-1879048191, stack_endp=0x7fff64e413d8, nsid=0) at dl-load.c:960 #4 0x0000003238807cf8 in _dl_map_object (loader=0x0, name=0x3239151d7e "libgcc_s.so.1", type=2, trace_mode=0, mode=<value optimized out>, nsid=<value optimized out>) at dl-load.c:2247 #5 0x0000003238812969 in dl_open_worker (a=0x7fff64e415f0) at dl-open.c:226 #6 0x000000323880e0c6 in _dl_catch_error (objname=0x7fff64e41640, errstring=0x7fff64e41638, mallocedp=0x7fff64e4164f, operate=0x3238812840 <dl_open_worker>, args=0x7fff64e415f0) at dl-error.c:178 #7 0x000000323881241a in _dl_open (file=0x3239151d7e "libgcc_s.so.1", mode=-2147483647, caller_dlopen=0x0, nsid=-2, argc=3, argv=<value optimized out>, env=0x7fff64e42ac8) at dl-open.c:555 #8 0x00000032391236f0 in do_dlopen (ptr=0x7fff64e417e0) at dl-libc.c:86 #9 0x000000323880e0c6 in _dl_catch_error (objname=0x7fff64e41800, errstring=0x7fff64e417f8, mallocedp=0x7fff64e4180f, operate=0x32391236b0 <do_dlopen>, args=0x7fff64e417e0) at dl-error.c:178 #10 0x0000003239123847 in dlerror_run (name=<value optimized out>, mode=<value optimized out>) at dl-libc.c:47 #11 __libc_dlopen_mode (name=<value optimized out>, mode=<value optimized out>) at dl-libc.c:160 #12 0x00000032390fbed5 in init () at ../sysdeps/ia64/backtrace.c:41 #13 0x000000323980cb43 in pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:104 #14 0x00000032390fbfd4 in __backtrace (array=<value optimized out>, size=64) at ../sysdeps/ia64/backtrace.c:85 #15 0x000000323906fdaf in __libc_message (do_abort=2, fmt=0x3239154ed8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:166 #16 0x00000032390756b6 in malloc_printerr (action=3, str=0x3239155180 "double free or corruption (!prev)", ptr=<value optimized out>) at malloc.c:6283 #17 0x00007f7ed34239a9 in virFree (ptrptr=0x7fff64e42218) at util/memory.c:310 #18 0x000000000043c11d in virNetServerClientFree (client=0x908050) at rpc/virnetserverclient.c:542 #19 0x0000000000440dad in virNetSocketEventFree (opaque=0x8e2760) at rpc/virnetsocket.c:1151 #20 0x00007f7ed341a087 in virEventPollCleanupHandles () at util/event_poll.c:547 #21 0x00007f7ed341a143 in virEventPollRunOnce () at util/event_poll.c:583 #22 0x00007f7ed34196f7 in virEventRunDefaultImpl () at util/event.c:247 #23 0x000000000043cb9d in virNetServerRun (srv=0x8e1ae0) at rpc/virnetserver.c:662 #24 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552 (gdb) this issue seem to reproduce constantly (4 times). full core dumps provided to libvirt development team. [root@nott-vds3 core]# virsh -r list Segmentation fault
Created attachment 514167 [details] libvirt core.
I can reproduce virsh segfault as well. Symptoms are: - connect to 'qemu:///system' - no libvirtd running #0 0x00007ffff75e7eb7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007ffff7af30d5 in virNetClientClose () from /usr/lib64/libvirt.so.0 #2 0x00007ffff7ae5cbf in doRemoteOpen () from /usr/lib64/libvirt.so.0 #3 0x00007ffff7ae7d61 in remoteOpen () from /usr/lib64/libvirt.so.0 #4 0x00007ffff7aa6c81 in do_open () from /usr/lib64/libvirt.so.0 #5 0x00007ffff7aa8f96 in virConnectOpenAuth () from /usr/lib64/libvirt.so.0 #6 0x0000000000425400 in main ()
Managed to reproduce it. Thread 9 (Thread 0x7fa8de227700 (LWP 18671)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 No locals. #1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 ret = <value optimized out> #2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81 job = <value optimized out> pool = 0x15c3bd0 #3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 args = 0x15c3110 #4 0x0000003307e077e1 in start_thread (arg=0x7fa8de227700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fa8de227700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363258033920, 7441651036830636411, 140736692556288, 140363258034624, 0, 3, -7487319643737745029, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 8 (Thread 0x7fa8d9c69700 (LWP 16991)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 No locals. #1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 ret = <value optimized out> #2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81 job = <value optimized out> pool = 0x15c3bd0 #3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 args = 0x15f1b10 #4 0x0000003307e077e1 in start_thread (arg=0x7fa8d9c69700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fa8d9c69700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363184903936, 7441651036830636411, 140736692556064, 140363184904640, 0, 3, -7487307232356001413, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 7 (Thread 0x7fa8dc424700 (LWP 18674)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 No locals. #1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 ret = <value optimized out> #2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81 job = <value optimized out> pool = 0x15c3bd0 #3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 args = 0x15cf450 #4 0x0000003307e077e1 in start_thread (arg=0x7fa8dc424700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fa8dc424700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363226564352, 7441651036830636411, 140736692556288, 140363226565056, 0, 3, -7487314968128972421, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 6 (Thread 0x7fa8c3fff700 (LWP 17209)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 No locals. #1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 ret = <value optimized out> #2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81 job = <value optimized out> pool = 0x15c3bd0 #3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 args = 0x15fa820 #4 0x0000003307e077e1 in start_thread (arg=0x7fa8c3fff700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fa8c3fff700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140362819565312, 7441651036830636411, 140736692556064, 140362819566016, 0, 3, -7487364332872459909, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 5 (Thread 0x7fa8dba23700 (LWP 18675)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 No locals. #1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 ret = <value optimized out> #2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81 job = <value optimized out> pool = 0x15c3bd0 #3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 args = 0x15cf400 #4 0x0000003307e077e1 in start_thread (arg=0x7fa8dba23700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fa8dba23700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363216074496, 7441651036830636411, 140736692556288, 140363216075200, 0, 3, -7487311945008866949, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 4 (Thread 0x7fa8dce25700 (LWP 18673)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 No locals. #1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 ret = <value optimized out> #2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81 job = <value optimized out> pool = 0x15c3bd0 #3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 args = 0x15cf4d0 #4 0x0000003307e077e1 in start_thread (arg=0x7fa8dce25700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fa8dce25700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363237054208, 7441651036830636411, 140736692556288, 140363237054912, 0, 3, -7487313595350050437, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 3 (Thread 0x7fa8d921b700 (LWP 17226)): #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136 No locals. #1 0x0000003307e09345 in _L_lock_870 () from /lib64/libpthread-2.12.so No symbol table info available. #2 0x0000003307e09217 in __pthread_mutex_lock (mutex=0x1634d48) at pthread_mutex_lock.c:61 ignore1 = 128 ignore2 = 23285064 ignore3 = -512 type = <value optimized out> id = 17226 #3 0x000000000043b6bd in virNetServerClientLock (client=0x1634d40) at rpc/virnetserverclient.c:108 No locals. #4 virNetServerClientRef (client=0x1634d40) at rpc/virnetserverclient.c:342 __func__ = "virNetServerClientRef" #5 0x000000000043d4ee in virNetServerHandleJob (jobOpaque=<value optimized out>, opaque=<value optimized out>) at rpc/virnetserver.c:127 srv = 0x15c3ae0 job = 0x17b53d0 prog = 0x0 i = <value optimized out> __func__ = "virNetServerHandleJob" #6 0x00007fa8e4121f3a in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:98 job = 0x1855d70 pool = 0x15c3bd0 #7 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 args = 0x15fa940 #8 0x0000003307e077e1 in start_thread (arg=0x7fa8d921b700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fa8d921b700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363174098688, 7441651036830636411, 140736692556064, 140363174099392, 0, 3, -7487308650768950917, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #9 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 2 (Thread 0x7fa8dd826700 (LWP 18672)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 No locals. #1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 ret = <value optimized out> #2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81 job = <value optimized out> pool = 0x15c3bd0 #3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157 args = 0x15cf520 #4 0x0000003307e077e1 in start_thread (arg=0x7fa8dd826700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fa8dd826700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363247544064, 7441651036830636411, 140736692556288, 140363247544768, 0, 3, -7487316618470155909, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 1 (Thread 0x7fa8e40b9800 (LWP 18670)): #0 0x0000003307632a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 resultvar = 0 pid = <value optimized out> selftid = 18670 #1 0x00000033076341ad in abort () at abort.c:121 act = {__sigaction_handler = {sa_handler = 0, sa_sigaction = 0}, sa_mask = {__val = {18446744073709551615 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0} sigs = {__val = {32, 0 <repeats 15 times>}} #2 0x000000330766fdfb in __libc_message (do_abort=2, fmt=0x3307755b58 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:186 ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fffd0911160, reg_save_area = 0x7fffd0911070}} ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fffd0911160, reg_save_area = 0x7fffd0911070}} fd = 2 on_2 = <value optimized out> list = <value optimized out> nlist = <value optimized out> cp = <value optimized out> written = <value optimized out> #3 0x0000003307675716 in malloc_printerr (action=3, str=0x3307755eb8 "free(): corrupted unsorted chunks", ptr=<value optimized out>) at malloc.c:6283 buf = "0000000001632350" cp = <value optimized out> #4 0x000000330f20961a in _sasl_conn_dispose (conn=0x7fa8d00c2010) at common.c:818 No locals. #5 0x000000330f20956a in sasl_dispose (pconn=0x7fa8d009ba70) at common.c:803 result = <value optimized out> #6 0x000000000044515b in virNetSASLSessionFree (sasl=0x7fa8d009ba70) at rpc/virnetsaslcontext.c:595 No locals. #7 0x0000000000440c79 in virNetSocketFree (sock=0x18acf40) at rpc/virnetsocket.c:669 __func__ = "virNetSocketFree" #8 0x00007fa8e410d087 in virEventPollCleanupHandles () at util/event_poll.c:547 ff = 0x440d60 <virNetSocketEventFree> opaque = 0x18acf40 i = 132 gap = <value optimized out> __func__ = "virEventPollCleanupHandles" #9 0x00007fa8e410d143 in virEventPollRunOnce () at util/event_poll.c:583 fds = 0x0 ret = <value optimized out> timeout = <value optimized out> nfds = <value optimized out> __func__ = "virEventPollRunOnce" __FUNCTION__ = "virEventPollRunOnce" #10 0x00007fa8e410c6f7 in virEventRunDefaultImpl () at util/event.c:247 __func__ = "virEventRunDefaultImpl" #11 0x000000000043cb9d in virNetServerRun (srv=0x15c3ae0) at rpc/virnetserver.c:662 timerid = -1 timerActive = 0 i = <value optimized out> __FUNCTION__ = "virNetServerRun" __func__ = "virNetServerRun" #12 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552 srv = 0x15c3ae0 remote_config_file = 0x15c31f0 "/etc/libvirt/libvirtd.conf" statuswrite = -1 ret = 1 pid_file = 0x15cf5f0 "/var/run/libvirtd.pid" sock_file = 0x15cf7e0 "/var/run/libvirt/libvirt-sock" sock_file_ro = 0x15cf7b0 "/var/run/libvirt/libvirt-sock-ro" timeout = -1 verbose = 0 godaemon = 1 ipsock = 1 config = 0x15c3890 privileged = true opts = {{name = 0x4c83aa "verbose", has_arg = 0, flag = 0x7fffd0911554, val = 1}, {name = 0x4c83b2 "daemon", has_arg = 0, flag = 0x7fffd0911550, val = 1}, {name = 0x4dcc7e "listen", has_arg = 0, flag = 0x7fffd091154c, val = 1}, {name = 0x4d3e8a "config", has_arg = 1, flag = 0x0, val = 102}, {name = 0x4e625c "timeout", has_arg = 1, flag = 0x0, val = 116}, {name = 0x4ebf80 "pid-file", has_arg = 1, flag = 0x0, val = 112}, {name = 0x4d7185 "version", has_arg = 0, flag = 0x0, val = 129}, {name = 0x4d7372 "help", has_arg = 0, flag = 0x0, val = 63}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}} __func__ = "main"
Although I managed to reproduce virsh, it was on older than this commit: commit b14800af14f8c13f2ca80f0a912fbb94d68628e8 Author: Guannan Ren <gren> Date: Wed Jul 20 15:20:18 2011 +0100 Don't try to close a NULL virNetClientPtr * src/rpc/virnetclient.c: Skip close attempt if virNetClientPtr is NULL which solve it. Therefore I am unable to reproduce virsh part of this issue (which is something different than daemon issue) with the most recent GIT build. For tracking purpose, there was a bug created which covers the virsh part only: https://bugzilla.redhat.com/show_bug.cgi?id=723843
Haim, I've created scratch build for you. Can you please try to reproduce with it? http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3507924
I believe I see where the flaw is. Originally virNetSocket usage was protected by the mutex on virNetClient (or virNetServerClient) As part of a different fix, I gave virNetSocket an independant lock: commit 6198f3a1d7466ea05351e130d771b2d200f49793 Author: Daniel P. Berrange <berrange> Date: Tue Jul 19 14:00:24 2011 +0100 Add mutex locking and reference counting to virNetSocket Remove the need for a virNetSocket object to be protected by locks from the object using it, by introducing its own native locking and reference counting The problem, is that we have virNetSASLSessionPtr and virNetTLSSessionPtr object references which are held by *both* virNetSocket and virNetClient. Again those objects relied on virNetClient's mutex to provide safety. Unfortunately, when we gave virNetSocket its own mutex, this made the virNetSASLSession & virNetTLSSession usage unsafe. The sequence of events is: 1. Some thread calls virNetServerClientClose 2. This does virNetSocketRemoveIOCallback to unregister the FD 3. That thread later calls virNetServerClientFree 4. THis does virNetSASLSessionFree() In parallel with step 4, the event loop might finally invoke the async virFreeCallback related to step 2. In this case virNetSocketEventFree. virNetSocketEventFree calls virNetSASLSessionFree. This can race with step 4, and since virNetSASLSessionPtr has no lock of its own, we have a potential for bad stuff. To fix this, we need to introduce dedicated mutexes for virNetSASLSessionPtr and virNetTLSSessionPtr.
commited upstream as 5622830c53b3a78e13c2ff722ebbba6e9c198378 Daniel
I was able to reproduce a crash on that scenario: #0 0x0000003307632a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00000033076341ad in abort () at abort.c:121 #2 0x000000330766fdfb in __libc_message (do_abort=2, fmt=0x3307755b58 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:186 #3 0x0000003307675716 in malloc_printerr (action=3, str=0x3307755eb8 "free(): corrupted unsorted chunks", ptr=<value optimized out>) at malloc.c:6283 #4 0x000000330f20961a in _sasl_conn_dispose (conn=0x7f94a81292f0) at common.c:818 #5 0x000000330f20956a in sasl_dispose (pconn=0x7f94a80ae398) at common.c:803 #6 0x0000000000445435 in virNetSASLSessionFree (sasl=0x7f94a80ae370) at rpc/virnetsaslcontext.c:729 #7 0x0000000000440c79 in virNetSocketFree (sock=0x1eef7b0) at rpc/virnetsocket.c:669 #8 0x00007f94ce692087 in virEventPollCleanupHandles () at util/event_poll.c:547 #9 0x00007f94ce692143 in virEventPollRunOnce () at util/event_poll.c:583 #10 0x00007f94ce6916f7 in virEventRunDefaultImpl () at util/event.c:247 #11 0x000000000043cb9d in virNetServerRun (srv=0x1ec8ae0) at rpc/virnetserver.c:662 #12 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552 wasnt able to reproduce the crash under valgrind but it found some invalid free(): ==32481== Invalid free() / delete / delete[] ==32481== at 0x4A0595D: free (vg_replace_malloc.c:366) ==32481== by 0x4E599A8: virFree (memory.c:310) ==32481== by 0x431B90: remoteClientFreeFunc (remote.c:406) ==32481== by 0x43C0B0: virNetServerClientFree (virnetserverclient.c:531) ==32481== by 0x43D5FC: virNetServerHandleJob (virnetserver.c:169) ==32481== by 0x4E64F39: virThreadPoolWorker (threadpool.c:98) ==32481== by 0x4E649A1: virThreadHelper (threads-pthread.c:157) ==32481== by 0x3307E077E0: start_thread (pthread_create.c:301) ==32481== by 0x33076E68EC: clone (clone.S:115) ==32481== Address 0x117d5460 is 0 bytes inside a block of size 96 free'd ==32481== at 0x4A0595D: free (vg_replace_malloc.c:366) ==32481== by 0x4E599A8: virFree (memory.c:310) ==32481== by 0x431B90: remoteClientFreeFunc (remote.c:406) ==32481== by 0x43C0B0: virNetServerClientFree (virnetserverclient.c:531) ==32481== by 0x440DAC: virNetSocketEventFree (virnetsocket.c:1151) ==32481== by 0x4E50086: virEventPollCleanupHandles (event_poll.c:547) ==32481== by 0x4E50142: virEventPollRunOnce (event_poll.c:583) ==32481== by 0x4E4F6F6: virEventRunDefaultImpl (event.c:247) ==32481== by 0x43CB9C: virNetServerRun (virnetserver.c:662) ==32481== by 0x41D967: main (libvirtd.c:1552) attached valgrind logs
Created attachment 515294 [details] valgrind logs
Created attachment 515297 [details] Attempt to fix ref counting race condition When an incoming RPC message is ready for processing, virNetServerClientDispatchRead() will invoke the 'dispatchFunc' callback. This is set to virNetServerDispatchNewMessage This function puts the message + client in a queue for processing by the thread pool. The thread pool worker function is virNetServerHandleJob The first thing this does is acquire an extra reference on the 'client'. Unfortunately, between the time the message+client are put on the thread pool queue, and the time the worker runs, the client object may have had its last reference removed. We clearly need to add the reference to the client object before putting the client on the processing queue
Created attachment 515300 [details] relevent part of libvirtd.log
David, I've created a scratch build for you: https://brewweb.devel.redhat.com/taskinfo?taskID=3517682 could you please give it a try?
(In reply to comment #15) > David, > > I've created a scratch build for you: > > https://brewweb.devel.redhat.com/taskinfo?taskID=3517682 > > could you please give it a try? Can't reproduce with this build.
commit 0302391ee643ad91fdc6d2ecf7e4cf0fc9724516 Author: Daniel P. Berrange <berrange> Date: Tue Jul 26 15:49:15 2011 +0100 Fix race in ref counting when handling RPC jobs When an incoming RPC message is ready for processing, virNetServerClientDispatchRead() will invoke the 'dispatchFunc' callback. This is set to virNetServerDispatchNewMessage This function puts the message + client in a queue for processing by the thread pool. The thread pool worker function is virNetServerHandleJob The first thing this does is acquire an extra reference on the 'client'. Unfortunately, between the time the message+client are put on the thread pool queue, and the time the worker runs, the client object may have had its last reference removed. We clearly need to add the reference to the client object before putting the client on the processing queue * src/rpc/virnetserverclient.c: Add a reference to the client when invoking the dispatch function * src/rpc/virnetserver.c: Don't acquire a reference to the client when in the worker thread v0.9.4-rc1-6-g0302391
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1513.html