| Summary: | [libvirt] libvirt crash when running large number of domains and vdsm is restarted | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> | ||||||||||
| Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | urgent | ||||||||||||
| Version: | 6.2 | CC: | ajia, berrange, dallan, danken, dnaori, dyuan, hateya, iheim, mgoldboi, mprivozn, mzhan, rwu, vbian, veillard, xhu, yeylon, ykaul | ||||||||||
| Target Milestone: | rc | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | libvirt-0.9.4-0rc1.1.el6 | Doc Type: | Bug Fix | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2011-12-06 11:17:19 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Attachments: |
|
||||||||||||
Created attachment 514167 [details]
libvirt core.
I can reproduce virsh segfault as well. Symptoms are: - connect to 'qemu:///system' - no libvirtd running #0 0x00007ffff75e7eb7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007ffff7af30d5 in virNetClientClose () from /usr/lib64/libvirt.so.0 #2 0x00007ffff7ae5cbf in doRemoteOpen () from /usr/lib64/libvirt.so.0 #3 0x00007ffff7ae7d61 in remoteOpen () from /usr/lib64/libvirt.so.0 #4 0x00007ffff7aa6c81 in do_open () from /usr/lib64/libvirt.so.0 #5 0x00007ffff7aa8f96 in virConnectOpenAuth () from /usr/lib64/libvirt.so.0 #6 0x0000000000425400 in main () Managed to reproduce it.
Thread 9 (Thread 0x7fa8de227700 (LWP 18671)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
ret = <value optimized out>
#2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
job = <value optimized out>
pool = 0x15c3bd0
#3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
args = 0x15c3110
#4 0x0000003307e077e1 in start_thread (arg=0x7fa8de227700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7fa8de227700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363258033920, 7441651036830636411, 140736692556288, 140363258034624, 0, 3, -7487319643737745029, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 8 (Thread 0x7fa8d9c69700 (LWP 16991)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
ret = <value optimized out>
#2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
job = <value optimized out>
pool = 0x15c3bd0
#3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
args = 0x15f1b10
#4 0x0000003307e077e1 in start_thread (arg=0x7fa8d9c69700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7fa8d9c69700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363184903936, 7441651036830636411, 140736692556064, 140363184904640, 0, 3, -7487307232356001413, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 7 (Thread 0x7fa8dc424700 (LWP 18674)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
ret = <value optimized out>
#2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
job = <value optimized out>
pool = 0x15c3bd0
#3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
args = 0x15cf450
#4 0x0000003307e077e1 in start_thread (arg=0x7fa8dc424700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7fa8dc424700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363226564352, 7441651036830636411, 140736692556288, 140363226565056, 0, 3, -7487314968128972421, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 6 (Thread 0x7fa8c3fff700 (LWP 17209)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
ret = <value optimized out>
#2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
job = <value optimized out>
pool = 0x15c3bd0
#3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
args = 0x15fa820
#4 0x0000003307e077e1 in start_thread (arg=0x7fa8c3fff700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7fa8c3fff700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140362819565312, 7441651036830636411, 140736692556064, 140362819566016, 0, 3, -7487364332872459909, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 5 (Thread 0x7fa8dba23700 (LWP 18675)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
ret = <value optimized out>
#2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
job = <value optimized out>
pool = 0x15c3bd0
#3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
args = 0x15cf400
#4 0x0000003307e077e1 in start_thread (arg=0x7fa8dba23700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7fa8dba23700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363216074496, 7441651036830636411, 140736692556288, 140363216075200, 0, 3, -7487311945008866949, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 4 (Thread 0x7fa8dce25700 (LWP 18673)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
ret = <value optimized out>
#2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
job = <value optimized out>
pool = 0x15c3bd0
#3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
args = 0x15cf4d0
#4 0x0000003307e077e1 in start_thread (arg=0x7fa8dce25700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7fa8dce25700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363237054208, 7441651036830636411, 140736692556288, 140363237054912, 0, 3, -7487313595350050437, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 3 (Thread 0x7fa8d921b700 (LWP 17226)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
No locals.
#1 0x0000003307e09345 in _L_lock_870 () from /lib64/libpthread-2.12.so
No symbol table info available.
#2 0x0000003307e09217 in __pthread_mutex_lock (mutex=0x1634d48) at pthread_mutex_lock.c:61
ignore1 = 128
ignore2 = 23285064
ignore3 = -512
type = <value optimized out>
id = 17226
#3 0x000000000043b6bd in virNetServerClientLock (client=0x1634d40) at rpc/virnetserverclient.c:108
No locals.
#4 virNetServerClientRef (client=0x1634d40) at rpc/virnetserverclient.c:342
__func__ = "virNetServerClientRef"
#5 0x000000000043d4ee in virNetServerHandleJob (jobOpaque=<value optimized out>, opaque=<value optimized out>) at rpc/virnetserver.c:127
srv = 0x15c3ae0
job = 0x17b53d0
prog = 0x0
i = <value optimized out>
__func__ = "virNetServerHandleJob"
#6 0x00007fa8e4121f3a in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:98
job = 0x1855d70
pool = 0x15c3bd0
#7 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
args = 0x15fa940
#8 0x0000003307e077e1 in start_thread (arg=0x7fa8d921b700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7fa8d921b700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363174098688, 7441651036830636411, 140736692556064, 140363174099392, 0, 3, -7487308650768950917, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#9 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 2 (Thread 0x7fa8dd826700 (LWP 18672)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1 0x00007fa8e4121b86 in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117
ret = <value optimized out>
#2 0x00007fa8e4121f73 in virThreadPoolWorker (opaque=0x15c3bd0) at util/threadpool.c:81
job = <value optimized out>
pool = 0x15c3bd0
#3 0x00007fa8e41219a2 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:157
args = 0x15cf520
#4 0x0000003307e077e1 in start_thread (arg=0x7fa8dd826700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7fa8dd826700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140363247544064, 7441651036830636411, 140736692556288, 140363247544768, 0, 3, -7487316618470155909, 7430941516268640635}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#5 0x00000033076e68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 1 (Thread 0x7fa8e40b9800 (LWP 18670)):
#0 0x0000003307632a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
resultvar = 0
pid = <value optimized out>
selftid = 18670
#1 0x00000033076341ad in abort () at abort.c:121
act = {__sigaction_handler = {sa_handler = 0, sa_sigaction = 0}, sa_mask = {__val = {18446744073709551615 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x000000330766fdfb in __libc_message (do_abort=2, fmt=0x3307755b58 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:186
ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fffd0911160, reg_save_area = 0x7fffd0911070}}
ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fffd0911160, reg_save_area = 0x7fffd0911070}}
fd = 2
on_2 = <value optimized out>
list = <value optimized out>
nlist = <value optimized out>
cp = <value optimized out>
written = <value optimized out>
#3 0x0000003307675716 in malloc_printerr (action=3, str=0x3307755eb8 "free(): corrupted unsorted chunks", ptr=<value optimized out>) at malloc.c:6283
buf = "0000000001632350"
cp = <value optimized out>
#4 0x000000330f20961a in _sasl_conn_dispose (conn=0x7fa8d00c2010) at common.c:818
No locals.
#5 0x000000330f20956a in sasl_dispose (pconn=0x7fa8d009ba70) at common.c:803
result = <value optimized out>
#6 0x000000000044515b in virNetSASLSessionFree (sasl=0x7fa8d009ba70) at rpc/virnetsaslcontext.c:595
No locals.
#7 0x0000000000440c79 in virNetSocketFree (sock=0x18acf40) at rpc/virnetsocket.c:669
__func__ = "virNetSocketFree"
#8 0x00007fa8e410d087 in virEventPollCleanupHandles () at util/event_poll.c:547
ff = 0x440d60 <virNetSocketEventFree>
opaque = 0x18acf40
i = 132
gap = <value optimized out>
__func__ = "virEventPollCleanupHandles"
#9 0x00007fa8e410d143 in virEventPollRunOnce () at util/event_poll.c:583
fds = 0x0
ret = <value optimized out>
timeout = <value optimized out>
nfds = <value optimized out>
__func__ = "virEventPollRunOnce"
__FUNCTION__ = "virEventPollRunOnce"
#10 0x00007fa8e410c6f7 in virEventRunDefaultImpl () at util/event.c:247
__func__ = "virEventRunDefaultImpl"
#11 0x000000000043cb9d in virNetServerRun (srv=0x15c3ae0) at rpc/virnetserver.c:662
timerid = -1
timerActive = 0
i = <value optimized out>
__FUNCTION__ = "virNetServerRun"
__func__ = "virNetServerRun"
#12 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552
srv = 0x15c3ae0
remote_config_file = 0x15c31f0 "/etc/libvirt/libvirtd.conf"
statuswrite = -1
ret = 1
pid_file = 0x15cf5f0 "/var/run/libvirtd.pid"
sock_file = 0x15cf7e0 "/var/run/libvirt/libvirt-sock"
sock_file_ro = 0x15cf7b0 "/var/run/libvirt/libvirt-sock-ro"
timeout = -1
verbose = 0
godaemon = 1
ipsock = 1
config = 0x15c3890
privileged = true
opts = {{name = 0x4c83aa "verbose", has_arg = 0, flag = 0x7fffd0911554, val = 1}, {name = 0x4c83b2 "daemon", has_arg = 0, flag = 0x7fffd0911550, val = 1}, {name = 0x4dcc7e "listen", has_arg = 0, flag = 0x7fffd091154c, val = 1},
{name = 0x4d3e8a "config", has_arg = 1, flag = 0x0, val = 102}, {name = 0x4e625c "timeout", has_arg = 1, flag = 0x0, val = 116}, {name = 0x4ebf80 "pid-file", has_arg = 1, flag = 0x0, val = 112}, {name = 0x4d7185 "version",
has_arg = 0, flag = 0x0, val = 129}, {name = 0x4d7372 "help", has_arg = 0, flag = 0x0, val = 63}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
__func__ = "main"
Although I managed to reproduce virsh, it was on older than this commit:
commit b14800af14f8c13f2ca80f0a912fbb94d68628e8
Author: Guannan Ren <gren>
Date: Wed Jul 20 15:20:18 2011 +0100
Don't try to close a NULL virNetClientPtr
* src/rpc/virnetclient.c: Skip close attempt if virNetClientPtr
is NULL
which solve it. Therefore I am unable to reproduce virsh part of this issue (which is something different than daemon issue) with the most recent GIT build.
For tracking purpose, there was a bug created which covers the virsh part only: https://bugzilla.redhat.com/show_bug.cgi?id=723843
Haim, I've created scratch build for you. Can you please try to reproduce with it? http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3507924 I believe I see where the flaw is.
Originally virNetSocket usage was protected by the mutex on virNetClient (or virNetServerClient)
As part of a different fix, I gave virNetSocket an independant lock:
commit 6198f3a1d7466ea05351e130d771b2d200f49793
Author: Daniel P. Berrange <berrange>
Date: Tue Jul 19 14:00:24 2011 +0100
Add mutex locking and reference counting to virNetSocket
Remove the need for a virNetSocket object to be protected by
locks from the object using it, by introducing its own native
locking and reference counting
The problem, is that we have virNetSASLSessionPtr and virNetTLSSessionPtr object references which are held by *both* virNetSocket and virNetClient. Again those objects relied on virNetClient's mutex to provide safety.
Unfortunately, when we gave virNetSocket its own mutex, this made the virNetSASLSession & virNetTLSSession usage unsafe.
The sequence of events is:
1. Some thread calls
virNetServerClientClose
2. This does virNetSocketRemoveIOCallback to unregister the FD
3. That thread later calls
virNetServerClientFree
4. THis does virNetSASLSessionFree()
In parallel with step 4, the event loop might finally invoke the async virFreeCallback related to step 2. In this case virNetSocketEventFree.
virNetSocketEventFree calls virNetSASLSessionFree. This can race with step 4, and since virNetSASLSessionPtr has no lock of its own, we have a potential for bad stuff.
To fix this, we need to introduce dedicated mutexes for virNetSASLSessionPtr and virNetTLSSessionPtr.
commited upstream as 5622830c53b3a78e13c2ff722ebbba6e9c198378 Daniel I was able to reproduce a crash on that scenario: #0 0x0000003307632a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00000033076341ad in abort () at abort.c:121 #2 0x000000330766fdfb in __libc_message (do_abort=2, fmt=0x3307755b58 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:186 #3 0x0000003307675716 in malloc_printerr (action=3, str=0x3307755eb8 "free(): corrupted unsorted chunks", ptr=<value optimized out>) at malloc.c:6283 #4 0x000000330f20961a in _sasl_conn_dispose (conn=0x7f94a81292f0) at common.c:818 #5 0x000000330f20956a in sasl_dispose (pconn=0x7f94a80ae398) at common.c:803 #6 0x0000000000445435 in virNetSASLSessionFree (sasl=0x7f94a80ae370) at rpc/virnetsaslcontext.c:729 #7 0x0000000000440c79 in virNetSocketFree (sock=0x1eef7b0) at rpc/virnetsocket.c:669 #8 0x00007f94ce692087 in virEventPollCleanupHandles () at util/event_poll.c:547 #9 0x00007f94ce692143 in virEventPollRunOnce () at util/event_poll.c:583 #10 0x00007f94ce6916f7 in virEventRunDefaultImpl () at util/event.c:247 #11 0x000000000043cb9d in virNetServerRun (srv=0x1ec8ae0) at rpc/virnetserver.c:662 #12 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552 wasnt able to reproduce the crash under valgrind but it found some invalid free(): ==32481== Invalid free() / delete / delete[] ==32481== at 0x4A0595D: free (vg_replace_malloc.c:366) ==32481== by 0x4E599A8: virFree (memory.c:310) ==32481== by 0x431B90: remoteClientFreeFunc (remote.c:406) ==32481== by 0x43C0B0: virNetServerClientFree (virnetserverclient.c:531) ==32481== by 0x43D5FC: virNetServerHandleJob (virnetserver.c:169) ==32481== by 0x4E64F39: virThreadPoolWorker (threadpool.c:98) ==32481== by 0x4E649A1: virThreadHelper (threads-pthread.c:157) ==32481== by 0x3307E077E0: start_thread (pthread_create.c:301) ==32481== by 0x33076E68EC: clone (clone.S:115) ==32481== Address 0x117d5460 is 0 bytes inside a block of size 96 free'd ==32481== at 0x4A0595D: free (vg_replace_malloc.c:366) ==32481== by 0x4E599A8: virFree (memory.c:310) ==32481== by 0x431B90: remoteClientFreeFunc (remote.c:406) ==32481== by 0x43C0B0: virNetServerClientFree (virnetserverclient.c:531) ==32481== by 0x440DAC: virNetSocketEventFree (virnetsocket.c:1151) ==32481== by 0x4E50086: virEventPollCleanupHandles (event_poll.c:547) ==32481== by 0x4E50142: virEventPollRunOnce (event_poll.c:583) ==32481== by 0x4E4F6F6: virEventRunDefaultImpl (event.c:247) ==32481== by 0x43CB9C: virNetServerRun (virnetserver.c:662) ==32481== by 0x41D967: main (libvirtd.c:1552) attached valgrind logs Created attachment 515294 [details]
valgrind logs
Created attachment 515297 [details]
Attempt to fix ref counting race condition
When an incoming RPC message is ready for processing,
virNetServerClientDispatchRead()
will invoke the 'dispatchFunc' callback. This is set to
virNetServerDispatchNewMessage
This function puts the message + client in a queue for processing by the thread pool. The thread pool worker function is
virNetServerHandleJob
The first thing this does is acquire an extra reference on the 'client'.
Unfortunately, between the time the message+client are put on the thread pool queue, and the time the worker runs, the client object may have had its last reference removed.
We clearly need to add the reference to the client object before putting the client on the processing queue
Created attachment 515300 [details]
relevent part of libvirtd.log
David, I've created a scratch build for you: https://brewweb.devel.redhat.com/taskinfo?taskID=3517682 could you please give it a try? (In reply to comment #15) > David, > > I've created a scratch build for you: > > https://brewweb.devel.redhat.com/taskinfo?taskID=3517682 > > could you please give it a try? Can't reproduce with this build. commit 0302391ee643ad91fdc6d2ecf7e4cf0fc9724516
Author: Daniel P. Berrange <berrange>
Date: Tue Jul 26 15:49:15 2011 +0100
Fix race in ref counting when handling RPC jobs
When an incoming RPC message is ready for processing,
virNetServerClientDispatchRead()
will invoke the 'dispatchFunc' callback. This is set to
virNetServerDispatchNewMessage
This function puts the message + client in a queue for processing by the thread
pool. The thread pool worker function is
virNetServerHandleJob
The first thing this does is acquire an extra reference on the 'client'.
Unfortunately, between the time the message+client are put on the thread pool
queue, and the time the worker runs, the client object may have had its last
reference removed.
We clearly need to add the reference to the client object before putting the
client on the processing queue
* src/rpc/virnetserverclient.c: Add a reference to the client when
invoking the dispatch function
* src/rpc/virnetserver.c: Don't acquire a reference to the client
when in the worker thread
v0.9.4-rc1-6-g0302391
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1513.html |
Description of problem: the following case was found on libvirt-0.9.3-7.el6.x86_64. 1) host runs large number of domains (127) 2) vdsm service is restarted 3) libvirt crash on the following: #0 _int_malloc (av=0x323938be80, bytes=<value optimized out>) at malloc.c:4439 #1 0x0000003239079718 in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4065 #2 0x000000323880acdf in _dl_new_object (realname=0x906b90 "/lib64/libgcc_s.so.1", libname=0x3239151d7e "libgcc_s.so.1", type=2, loader=0x0, mode=-1879048191, nsid=0) at dl-object.c:77 #3 0x0000003238806b5e in _dl_map_object_from_fd (name=0x3239151d7e "libgcc_s.so.1", fd=20, fbp=0x7fff64e41090, realname=0x906b90 "/lib64/libgcc_s.so.1", loader=0x0, l_type=2, mode=-1879048191, stack_endp=0x7fff64e413d8, nsid=0) at dl-load.c:960 #4 0x0000003238807cf8 in _dl_map_object (loader=0x0, name=0x3239151d7e "libgcc_s.so.1", type=2, trace_mode=0, mode=<value optimized out>, nsid=<value optimized out>) at dl-load.c:2247 #5 0x0000003238812969 in dl_open_worker (a=0x7fff64e415f0) at dl-open.c:226 #6 0x000000323880e0c6 in _dl_catch_error (objname=0x7fff64e41640, errstring=0x7fff64e41638, mallocedp=0x7fff64e4164f, operate=0x3238812840 <dl_open_worker>, args=0x7fff64e415f0) at dl-error.c:178 #7 0x000000323881241a in _dl_open (file=0x3239151d7e "libgcc_s.so.1", mode=-2147483647, caller_dlopen=0x0, nsid=-2, argc=3, argv=<value optimized out>, env=0x7fff64e42ac8) at dl-open.c:555 #8 0x00000032391236f0 in do_dlopen (ptr=0x7fff64e417e0) at dl-libc.c:86 #9 0x000000323880e0c6 in _dl_catch_error (objname=0x7fff64e41800, errstring=0x7fff64e417f8, mallocedp=0x7fff64e4180f, operate=0x32391236b0 <do_dlopen>, args=0x7fff64e417e0) at dl-error.c:178 #10 0x0000003239123847 in dlerror_run (name=<value optimized out>, mode=<value optimized out>) at dl-libc.c:47 #11 __libc_dlopen_mode (name=<value optimized out>, mode=<value optimized out>) at dl-libc.c:160 #12 0x00000032390fbed5 in init () at ../sysdeps/ia64/backtrace.c:41 #13 0x000000323980cb43 in pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:104 #14 0x00000032390fbfd4 in __backtrace (array=<value optimized out>, size=64) at ../sysdeps/ia64/backtrace.c:85 #15 0x000000323906fdaf in __libc_message (do_abort=2, fmt=0x3239154ed8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:166 #16 0x00000032390756b6 in malloc_printerr (action=3, str=0x3239155180 "double free or corruption (!prev)", ptr=<value optimized out>) at malloc.c:6283 #17 0x00007f7ed34239a9 in virFree (ptrptr=0x7fff64e42218) at util/memory.c:310 #18 0x000000000043c11d in virNetServerClientFree (client=0x908050) at rpc/virnetserverclient.c:542 #19 0x0000000000440dad in virNetSocketEventFree (opaque=0x8e2760) at rpc/virnetsocket.c:1151 #20 0x00007f7ed341a087 in virEventPollCleanupHandles () at util/event_poll.c:547 #21 0x00007f7ed341a143 in virEventPollRunOnce () at util/event_poll.c:583 #22 0x00007f7ed34196f7 in virEventRunDefaultImpl () at util/event.c:247 #23 0x000000000043cb9d in virNetServerRun (srv=0x8e1ae0) at rpc/virnetserver.c:662 #24 0x000000000041d968 in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1552 (gdb) this issue seem to reproduce constantly (4 times). full core dumps provided to libvirt development team. [root@nott-vds3 core]# virsh -r list Segmentation fault