Description of problem: We enabled tcmalloc (bug 1213882) in the hope of increasing performance. Turns out this blinds valgrind. To unblind it, you have to run it with --soname-synonyms='somalloc=*tcmalloc*'. See its documentation at <http://www.valgrind.org/docs/manual/manual-core.html#manual-core.rareopts>. valgrind not working out of the box has already caused trouble: * Bug 1253276 Assignee observed the bug report looks like memory corruption, and asked reporter to retry with a malloc debugger, e.g. ElectricFence. Reporter immediately reached for valgrind, obviously without luck. He then had to manually rebuild with ElectricFence, but didn't get useful results. * Bug 1205100 QE was unable to reproduce. After quite some head-scratching and a lot of time (> 5 weeks), tcmalloc was identified as the culprit, and QE verified with a qemu-kvm manually built without tcmalloc. Testing bits we won't ship instead of the bits we will is suboptimal, of course. * Bug 1262670 Bug report looked like memory corruption. Assignee immediately reached for valgrind, which came up empty. After some head-scratching, he manually replaced tcmalloc by ElectricFence, which led to the bug. A working valgrind would've led to the bug faster. * Bug 1264347 QE was unable to reproduce. The assignee (me) wondered why valgrind works in his local build, but not for QE, looked for differences, and found tcmalloc. Experiments showed valgrind works only when tcmalloc is disabled. I looked for a work-around, and found --soname-synonyms='somalloc=*tcmalloc*'. I'm afraid this will continue to be a trap for the unwary. It's very easy to forget the obscure --soname-synonyms option when you're up to your ass in the swamp fighting the alligators. We'll almost certainly waste time debugging with a blind valgrind until someone remembers the need for running it with this option. Especially bad when we waste a customer's time that way. Even worse, we risk invalid negative results: we test some patch with valgrind, and it comes up with a clean bill of health, so it must be good, right? Well, only if we didn't forget running valgrind with that obscure option. We weren't aware of this issue when we decided to enable tcmalloc (bug 1213882). No we are the tradeoffs need to be evaluated anew. Specifically: * Is the performance gain worth these risks? * Can we somehow reduce the risks sufficiently? As is, I am *very* uncomfortable shipping with tcmalloc enabled. Version-Release number of selected component (if applicable): tcmalloc got enabled in qemu-kvm-rhev-2.3.0-9.el7. How reproducible: Always Steps to Reproduce: Reproducer taken from bug 1264347, requires a qemu-kvm with that bug not fixed, e.g. qemu-kvm-rhev-2.3.0-29.el7 1. Start qemu-kvm under valgrind with QMP on stdin/stdout, e.g. $ valgrind qemu-kvm -nodefaults -S -display none -qmp stdio 2. Run qmp_capabilities to enter command mode { "execute": "qmp_capabilities" } 3. Run device-list-properties for a CPU device { "execute": "device-list-properties", "arguments": { "typename": T } } where T is the name of a CPU device such as "qemu64-x86_64-cpu". 4. Run device-list-properties for a CPU device again Same command as step 3 is fine. Actual results: valgrind doesn't report invalid reads or writes. qemu-kvm crashes (it may not, could be indeterministic). Expected results: valgrind reports invalid reads and writes. Additional info: Running valgrind with --soname-synonyms='somalloc=*tcmalloc*' yields expected results.
The glibc team is still interested in working with virt team on figuring out exactly where in glibc malloc we're loosing performance and patching to help provide better performance. We recently found a free list issue which could cause high arena lock contention if the workload creates and destroys threads on a regular basis (as opposed to long running threads). In bug 693262 the average performance gain for tcmalloc was ~4%, so there is a rather small margin there for deciding one way or the other, particularly if valgrind isn't going to work well, and if you see regressions in other use cases (bug 1251353). I'm not here to judge what the virt team should do, I am here to help provide options if we want to continue with glibc's allocator and make a few enhancements to get back that ~4%.
We are currently fixing a glibc bug which may be related, particularly if you are switching away from glibc malloc due to high contention. qemu-kvm appears to be affected by this glibc bug when running on top glibc malloc (not tcmalloc): <https://sourceware.org/bugzilla/show_bug.cgi?id=19048> <https://bugzilla.redhat.com/show_bug.cgi?id=1264189> This bug can result in serious contention because the regular arena selection logic is completely bypassed, and many threads could end up hitting very few arenas (I've seen it go down to a single arena, see below). I have written a short script which allows to check running processes non-destructively: <https://sourceware.org/bugzilla/attachment.cgi?id=8718> The bug is sticky. Once a process is in this state, it will remain in it until it exits, and the script above will show that the arena free list has turned cyclic. I don't have many VMs to test, but one harmless one (mostly libvirt defaults, with one attached disk in raw format) has run into this bug: (gdb) print free_list $1 = (mstate) 0x7ff088000020 (gdb) print free_list->next_free == free_list $10 = 1 And indeed, threads 2 and 3 share the same arena: (gdb) thread apply all print __libc_tsd_MALLOC Thread 5 (Thread 0x7ff16ed5a700 (LWP 20032)): $5 = (void *) 0x7ff168000020 Thread 4 (Thread 0x7ff16e559700 (LWP 20033)): $6 = (void *) 0x7ff160000020 Thread 3 (Thread 0x7ff16d9ff700 (LWP 16821)): $7 = (void *) 0x7ff088000020 Thread 2 (Thread 0x7ff06affd700 (LWP 16822)): $8 = (void *) 0x7ff088000020 Thread 1 (Thread 0x7ff18531ab00 (LWP 20030)): $9 = (void *) 0x7ff17b763760 <main_arena> Thread 2 and 3 are QEMU worker threads. I don't know what these threads are doing, and what the actual impact of increased malloc contention between the two threads is. This was observed with qemu-kvm-1.6.2-13.fc20.x86_64 (which I know is quite old). I can trigger the glibc bug pretty reliably with qemu-kvm-2.3.1-3.fc22.x86_64 while interacting with the VM through virt-manager (note that it's still the qemu-kvm process that runs into this). It would be interesting to see if any your performance tests hit this bug. If they do, I can provide fixed glibc scratch builds. If your workload is producer-consumer-style and you hit contention during deallaction, maybe there is something we can do inside glibc malloc to reduce that (while still being conservative enough to qualify for backporting).
I attached a new check script to the upstream glibc bug: <https://sourceware.org/bugzilla/show_bug.cgi?id=19048> This version prints an error if no glibc debugging information is available (the old version would print nothing).
You can use the following options to get I/O patterns that match the interesting ones: -drive if=none,id=hd,file=$PATH,aio=native,format=raw -object iothread,id=io -device virtio-blk-pci,drive=hd,iothread=io Some worker threads will still be periodically created, but they do not allocate any memory. All memory allocations happens in the thread whose start function is in iothread.c. Our need is basically a very fast (a few hundred clock cycles), entirely thread-local path for reusing objects that have just been freed and are reallocated.
(In reply to Paolo Bonzini from comment #6) > You can use the following options to get I/O patterns that match the > interesting ones: > > -drive if=none,id=hd,file=$PATH,aio=native,format=raw > -object iothread,id=io > -device virtio-blk-pci,drive=hd,iothread=io > > Some worker threads will still be periodically created, but they do not > allocate any memory. All memory allocations happens in the thread whose > start function is in iothread.c. I get the following backtraces: #0 0x000056042924e4b0 in virtio_blk_alloc_request (s=s@entry=0x56042c90d590) at /usr/src/debug/qemu-2.3.1/hw/block/virtio-blk.c:32 #1 0x000056042924f7e4 in handle_notify (e=0x56042cb25108) at /usr/src/debug/qemu-2.3.1/hw/block/dataplane/virtio-blk.c:107 #2 0x0000560429454e88 in aio_dispatch (ctx=ctx@entry=0x56042b7bcd30) at aio-posix.c:158 #3 0x0000560429455062 in aio_poll (ctx=0x56042b7bcd30, blocking=<optimized out>) at aio-posix.c:248 #4 0x00005604292ec569 in iothread_run (opaque=0x56042b7ba020) at iothread.c:44 #0 0x000056042924e5a0 in virtio_blk_free_request (req=req@entry=0x7f0b58218770) at /usr/src/debug/qemu-2.3.1/hw/block/virtio-blk.c:44 #1 0x000056042924f80a in handle_notify (e=0x56042cb25108) at /usr/src/debug/qemu-2.3.1/hw/block/dataplane/virtio-blk.c:111 #2 0x0000560429454e88 in aio_dispatch (ctx=ctx@entry=0x56042b7bcd30) at aio-posix.c:158 #3 0x0000560429455062 in aio_poll (ctx=0x56042b7bcd30, blocking=<optimized out>) at aio-posix.c:248 #4 0x00005604292ec569 in iothread_run (opaque=0x56042b7ba020) at iothread.c:44 Does this look right? Allocation and deallocation happens on the same thread. Did I pick the right allocation/deallocation functions?
I maintain valgrind and would like to make it so that at least on fedora/rhel/dts usage of tcmalloc/jemalloc as library or staticly linked into the executable is detected by valgrind automatically. But what the best way to do that is still a bit of a question. valgrind doesn't really know about ELF symbol interposition, it matches the symbols against specific library DT_SONAME. There is a simple replacement matching mechanism to tell valgrind to match the "somalloc" functions against a replacement library so name. As used in the description --soname-synonyms='somalloc=*tcmalloc*' to intercept the somalloc functions in any library with so name matching the regular expression "*tcmalloc*". And to intercept the somalloc functions in the main executable one would use --soname-synonyms=somalloc=NONE). So if the user know what library (or the executable) interposes an alternative malloc/free implementation it can make valgrind aware. So the simplest option might be to make valgrind aware of tcmalloc, jemalloc and NONE by default. I think there is a small risk when using NONE if an executable plays tricks interposing some symbols but not really override them (maybe to keep statistics). But there should not be much/any risk always intercepting tcmalloc or jemalloc share library symbols. A few questions about this bug: - Should this bug be assigned to valgrind? - What other packages should we make sure work with this? I found the following, but except for firefox none are in main RHEL only in Fedora/EPEL, but maybe they are they in layered products? I might have missed some since finding any that staticly link tcmalloc/jemalloc is not easy as far as I can tell. - qemu-kvm-rhev (this bug) - firefox (staticly links jemalloc, but might need some other tricks to really work under valgrind. Upstream tells me that only running debug builds under valgrind is support.) - varnish (use jemalloc, EPEL) - nfs-ganesha (jemalloc, EPEL) - ceph (tcmalloc, EPEL) - nginx (tcmalloc, EPEL) - mongodb (tcmalloc, EPEL) - Pound (tcmalloc, EPEL) - redis (jemalloc, EPEL) And a question on how to actually get the qemu-kvm-rhev. I have RHEL7 installed on my workstation through RHN with Employee SKU. But I couldn't figure out how to install qemu-kvm-rhev. So I cheated and pulled it out of brew (which does work and I can replicate the issue and workarounds).
Mark, as far as I know tcmalloc is not used outside virt by RHEL, only by layered products. Ceph is in Red Hat Storage and MongoDB is in Satellite.
The glibc performance aspect is investigated in bug 1275472.
(In reply to Florian Weimer from comment #12) > The glibc performance aspect is investigated in bug 1275472. OK, then if people don't mind I'll take this as a valgrind enhancement bug to make sure valgrind will recognize alternative malloc implementations by default in the future. I am already working on an (upstream) patch.
I posted an upstream patch: https://bugs.kde.org/show_bug.cgi?id=355188
Patch pushed upsteam as valgrind svn r15726 and backported to fedora rawhide valgrind-3.11.0-5.
Already backported and being tested in Fedora, should be good to go.
Perhaps would be worth trying to compile Qemu with the --disable-tcmalloc flags which disable tcmalloc usage. Maybe some tools like sanitizer/valgrind are not that comfortable with a no standard (glibc) memory allocator.
(In reply to Frediano Ziglio from comment #19) > Perhaps would be worth trying to compile Qemu with the --disable-tcmalloc > flags which disable tcmalloc usage. Maybe some tools like sanitizer/valgrind > are not that comfortable with a no standard (glibc) memory allocator. Sorry, wrong bug, please ignore this comment
(In reply to Frediano Ziglio from comment #20) > (In reply to Frediano Ziglio from comment #19) > > Perhaps would be worth trying to compile Qemu with the --disable-tcmalloc > > flags which disable tcmalloc usage. Maybe some tools like sanitizer/valgrind > > are not that comfortable with a no standard (glibc) memory allocator. > > Sorry, wrong bug, please ignore this comment No worries. This bug is actually about making sure valgrind does automagically detect tcmalloc. The code is already in the valgrind package for Fedora 23. So if your issue was with valgrind, then please do try that version to see if it solves your issue without having the rebuild with --disable-tcmalloc. Thanks.
QA note: the test case in #c18 is wrapmalloc*. Verified for build valgrind-3.11.0-22.el7.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2297.html