1271754 – tcmalloc blinds valgrind

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1271754 - tcmalloc blinds valgrind

Summary: tcmalloc blinds valgrind

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	valgrind
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	7.3
Assignee:	Mark Wielaard
QA Contact:	Miloš Prchlík
Docs Contact:	Tomas Capek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-14 15:19 UTC by Markus Armbruster
Modified:	2016-11-04 02:55 UTC (History)
CC List:	21 users (show)
Fixed In Version:	valgrind-3.11.0-20.el7
Doc Type:	Enhancement
Doc Text:	Interception of user-defined allocation functions in valgrind Some applications do not use the glibc allocator. Consequently, it was not always convenient to run such applications under valgrind. With this update, valgrind tries to automatically intercept user-defined memory allocation functions as if the program used the normal glibc allocator, making it possible to use memory tracing utilities such as memcheck on those programs out of the box.
Clone Of:
Environment:
Last Closed:	2016-11-04 02:55:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
KDE Software Compilation	355188	None	None	None	Never
Red Hat Bugzilla	1275472	unspecified	CLOSED	Glibc's malloc is slower for virtio-blk than tcmalloc	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2016:2297	normal	SHIPPED_LIVE	valgrind bug fix and enhancement update	2016-11-03 13:38:42 UTC

Internal Links: 1275472

Description Markus Armbruster 2015-10-14 15:19:48 UTC

Description of problem:
We enabled tcmalloc (bug 1213882) in the hope of increasing
performance.  Turns out this blinds valgrind.  To unblind it, you have
to run it with --soname-synonyms='somalloc=*tcmalloc*'.  See its
documentation at
<http://www.valgrind.org/docs/manual/manual-core.html#manual-core.rareopts>.

valgrind not working out of the box has already caused trouble:

* Bug 1253276

  Assignee observed the bug report looks like memory corruption, and
  asked reporter to retry with a malloc debugger, e.g. ElectricFence.
  Reporter immediately reached for valgrind, obviously without luck.
  He then had to manually rebuild with ElectricFence, but didn't get
  useful results.

* Bug 1205100

  QE was unable to reproduce.  After quite some head-scratching and a
  lot of time (> 5 weeks), tcmalloc was identified as the culprit, and
  QE verified with a qemu-kvm manually built without tcmalloc.
  Testing bits we won't ship instead of the bits we will is
  suboptimal, of course.

* Bug 1262670

  Bug report looked like memory corruption.  Assignee immediately
  reached for valgrind, which came up empty.  After some
  head-scratching, he manually replaced tcmalloc by ElectricFence,
  which led to the bug.  A working valgrind would've led to the bug
  faster.

* Bug 1264347

  QE was unable to reproduce.  The assignee (me) wondered why valgrind
  works in his local build, but not for QE, looked for differences,
  and found tcmalloc.  Experiments showed valgrind works only when
  tcmalloc is disabled.  I looked for a work-around, and found
  --soname-synonyms='somalloc=*tcmalloc*'.
 
I'm afraid this will continue to be a trap for the unwary.  It's very
easy to forget the obscure --soname-synonyms option when you're up to
your ass in the swamp fighting the alligators.  We'll almost certainly
waste time debugging with a blind valgrind until someone remembers the
need for running it with this option.  Especially bad when we waste a
customer's time that way.

Even worse, we risk invalid negative results: we test some patch with
valgrind, and it comes up with a clean bill of health, so it must be
good, right?  Well, only if we didn't forget running valgrind with that
obscure option.

We weren't aware of this issue when we decided to enable tcmalloc (bug
1213882).  No we are the tradeoffs need to be evaluated anew.
Specifically:

* Is the performance gain worth these risks?

* Can we somehow reduce the risks sufficiently?

As is, I am *very* uncomfortable shipping with tcmalloc enabled.

Version-Release number of selected component (if applicable):
tcmalloc got enabled in qemu-kvm-rhev-2.3.0-9.el7.

How reproducible:
Always

Steps to Reproduce:
Reproducer taken from bug 1264347, requires a qemu-kvm with that bug
not fixed, e.g. qemu-kvm-rhev-2.3.0-29.el7
1. Start qemu-kvm under valgrind with QMP on stdin/stdout, e.g.
       $ valgrind qemu-kvm -nodefaults -S -display none -qmp stdio
2. Run qmp_capabilities to enter command mode
       { "execute": "qmp_capabilities" }
3. Run device-list-properties for a CPU device
       { "execute": "device-list-properties", "arguments": { "typename": T } }
   where T is the name of a CPU device such as "qemu64-x86_64-cpu".
4. Run device-list-properties for a CPU device again
   Same command as step 3 is fine.

Actual results:
valgrind doesn't report invalid reads or writes.  qemu-kvm crashes (it
may not, could be indeterministic).

Expected results:
valgrind reports invalid reads and writes.

Additional info:
Running valgrind with --soname-synonyms='somalloc=*tcmalloc*' yields
expected results.

Comment 2 Carlos O'Donell 2015-10-15 00:36:05 UTC

The glibc team is still interested in working with virt team on figuring out exactly where in glibc malloc we're loosing performance and patching to help provide better performance.

We recently found a free list issue which could cause high arena lock contention if the workload creates and destroys threads on a regular basis (as opposed to long running threads).

In bug 693262 the average performance gain for tcmalloc was ~4%, so there is a rather small margin there for deciding one way or the other, particularly if valgrind isn't going to work well, and if you see regressions in other use cases (bug 1251353).

I'm not here to judge what the virt team should do, I am here to help provide options if we want to continue with glibc's allocator and make a few enhancements to get back that ~4%.

Comment 3 Florian Weimer 2015-10-15 09:22:15 UTC

We are currently fixing a glibc bug which may be related, particularly if you are switching away from glibc malloc due to high contention.

qemu-kvm appears to be affected by this glibc bug when running on top glibc malloc (not tcmalloc):

  <https://sourceware.org/bugzilla/show_bug.cgi?id=19048>
  <https://bugzilla.redhat.com/show_bug.cgi?id=1264189>

This bug can result in serious contention because the regular arena selection logic is completely bypassed, and many threads could end up hitting very few arenas (I've seen it go down to a single arena, see below).

I have written a short script which allows to check running processes non-destructively:

  <https://sourceware.org/bugzilla/attachment.cgi?id=8718>

The bug is sticky. Once a process is in this state, it will remain in it until it exits, and the script above will show that the arena free list has turned cyclic.

I don't have many VMs to test, but one harmless one (mostly libvirt defaults, with one attached disk in raw format) has run into this bug:

(gdb) print free_list
$1 = (mstate) 0x7ff088000020
(gdb) print free_list->next_free == free_list
$10 = 1

And indeed, threads 2 and 3 share the same arena:

(gdb) thread apply all print __libc_tsd_MALLOC

Thread 5 (Thread 0x7ff16ed5a700 (LWP 20032)):
$5 = (void *) 0x7ff168000020

Thread 4 (Thread 0x7ff16e559700 (LWP 20033)):
$6 = (void *) 0x7ff160000020

Thread 3 (Thread 0x7ff16d9ff700 (LWP 16821)):
$7 = (void *) 0x7ff088000020

Thread 2 (Thread 0x7ff06affd700 (LWP 16822)):
$8 = (void *) 0x7ff088000020

Thread 1 (Thread 0x7ff18531ab00 (LWP 20030)):
$9 = (void *) 0x7ff17b763760 <main_arena>

Thread 2 and 3 are QEMU worker threads. I don't know what these threads are doing, and what the actual impact of increased malloc contention between the two threads is. This was observed with qemu-kvm-1.6.2-13.fc20.x86_64 (which I know is quite old).

I can trigger the glibc bug pretty reliably with qemu-kvm-2.3.1-3.fc22.x86_64 while interacting with the VM through virt-manager (note that it's still the qemu-kvm process that runs into this).

It would be interesting to see if any your performance tests hit this bug. If they do, I can provide fixed glibc scratch builds.

If your workload is producer-consumer-style and you hit contention during deallaction, maybe there is something we can do inside glibc malloc to reduce that (while still being conservative enough to qualify for backporting).

Comment 5 Florian Weimer 2015-10-15 09:42:36 UTC

I attached a new check script to the upstream glibc bug:

  <https://sourceware.org/bugzilla/show_bug.cgi?id=19048>

This version prints an error if no glibc debugging information is available (the old version would print nothing).

Comment 6 Paolo Bonzini 2015-10-15 15:11:45 UTC

You can use the following options to get I/O patterns that match the interesting ones:

 -drive if=none,id=hd,file=$PATH,aio=native,format=raw
 -object iothread,id=io
 -device virtio-blk-pci,drive=hd,iothread=io

Some worker threads will still be periodically created, but they do not allocate any memory.  All memory allocations happens in the thread whose start function is in iothread.c.

Our need is basically a very fast (a few hundred clock cycles), entirely thread-local path for reusing objects that have just been freed and are reallocated.

Comment 7 Florian Weimer 2015-10-19 14:04:53 UTC

(In reply to Paolo Bonzini from comment #6)
> You can use the following options to get I/O patterns that match the
> interesting ones:
> 
>  -drive if=none,id=hd,file=$PATH,aio=native,format=raw
>  -object iothread,id=io
>  -device virtio-blk-pci,drive=hd,iothread=io
> 
> Some worker threads will still be periodically created, but they do not
> allocate any memory.  All memory allocations happens in the thread whose
> start function is in iothread.c.

I get the following backtraces:

#0  0x000056042924e4b0 in virtio_blk_alloc_request (s=s@entry=0x56042c90d590) at /usr/src/debug/qemu-2.3.1/hw/block/virtio-blk.c:32
#1  0x000056042924f7e4 in handle_notify (e=0x56042cb25108) at /usr/src/debug/qemu-2.3.1/hw/block/dataplane/virtio-blk.c:107
#2  0x0000560429454e88 in aio_dispatch (ctx=ctx@entry=0x56042b7bcd30) at aio-posix.c:158
#3  0x0000560429455062 in aio_poll (ctx=0x56042b7bcd30, blocking=<optimized out>) at aio-posix.c:248
#4  0x00005604292ec569 in iothread_run (opaque=0x56042b7ba020) at iothread.c:44

#0  0x000056042924e5a0 in virtio_blk_free_request (req=req@entry=0x7f0b58218770) at /usr/src/debug/qemu-2.3.1/hw/block/virtio-blk.c:44
#1  0x000056042924f80a in handle_notify (e=0x56042cb25108) at /usr/src/debug/qemu-2.3.1/hw/block/dataplane/virtio-blk.c:111
#2  0x0000560429454e88 in aio_dispatch (ctx=ctx@entry=0x56042b7bcd30) at aio-posix.c:158
#3  0x0000560429455062 in aio_poll (ctx=0x56042b7bcd30, blocking=<optimized out>) at aio-posix.c:248
#4  0x00005604292ec569 in iothread_run (opaque=0x56042b7ba020) at iothread.c:44

Does this look right?  Allocation and deallocation happens on the same thread.

Did I pick the right allocation/deallocation functions?

Comment 9 Mark Wielaard 2015-11-06 13:47:01 UTC

I maintain valgrind and would like to make it so that at least on fedora/rhel/dts usage of tcmalloc/jemalloc as library or staticly linked into the executable is detected by valgrind automatically. But what the best way to do that is still a bit of a question.

valgrind doesn't really know about ELF symbol interposition, it matches the symbols against specific library DT_SONAME. There is a simple replacement matching mechanism to tell valgrind to match the "somalloc" functions against a replacement library so name. As used in the description --soname-synonyms='somalloc=*tcmalloc*' to intercept the somalloc functions in any library with so name matching the regular expression "*tcmalloc*". And to intercept the somalloc functions in the main executable one would use --soname-synonyms=somalloc=NONE). So if the user know what library (or the executable) interposes an alternative malloc/free implementation it can make valgrind aware.

So the simplest option might be to make valgrind aware of tcmalloc, jemalloc and NONE by default. I think there is a small risk when using NONE if an executable plays tricks interposing some symbols but not really override them (maybe to keep statistics). But there should not be much/any risk always intercepting tcmalloc or jemalloc share library symbols.

A few questions about this bug:

- Should this bug be assigned to valgrind?
- What other packages should we make sure work with this?

I found the following, but except for firefox none are in main
RHEL only in Fedora/EPEL, but maybe they are they in layered products?
I might have missed some since finding any that staticly link
tcmalloc/jemalloc is not easy as far as I can tell.

- qemu-kvm-rhev (this bug)
- firefox (staticly links jemalloc, but might need some other tricks to
really work under valgrind. Upstream tells me that only running debug
builds under valgrind is support.)
- varnish (use jemalloc, EPEL)
- nfs-ganesha (jemalloc, EPEL)
- ceph (tcmalloc, EPEL)
- nginx (tcmalloc, EPEL)
- mongodb (tcmalloc, EPEL)
- Pound (tcmalloc, EPEL)
- redis (jemalloc, EPEL)

And a question on how to actually get the qemu-kvm-rhev. I have RHEL7 installed on my workstation through RHN with Employee SKU. But I couldn't figure out how to install qemu-kvm-rhev. So I cheated and pulled it out of brew (which does work and I can replicate the issue and workarounds).

Comment 11 Paolo Bonzini 2015-11-09 14:03:40 UTC

Mark, as far as I know tcmalloc is not used outside virt by RHEL, only by layered products.  Ceph is in Red Hat Storage and MongoDB is in Satellite.

Comment 12 Florian Weimer 2015-11-09 14:07:53 UTC

The glibc performance aspect is investigated in bug 1275472.

Comment 13 Mark Wielaard 2015-11-09 14:13:12 UTC

(In reply to Florian Weimer from comment #12)
> The glibc performance aspect is investigated in bug 1275472.

OK, then if people don't mind I'll take this as a valgrind enhancement bug to make sure valgrind will recognize alternative malloc implementations by default in the future. I am already working on an (upstream) patch.

Comment 14 Mark Wielaard 2015-11-11 15:00:55 UTC

I posted an upstream patch: https://bugs.kde.org/show_bug.cgi?id=355188

Comment 15 Mark Wielaard 2015-11-15 18:18:21 UTC

Patch pushed upsteam as valgrind svn r15726 and backported to fedora rawhide valgrind-3.11.0-5.

Comment 17 Mark Wielaard 2016-01-05 20:47:36 UTC

Already backported and being tested in Fedora, should be good to go.

Comment 19 Frediano Ziglio 2016-03-30 15:30:56 UTC

Perhaps would be worth trying to compile Qemu with the --disable-tcmalloc flags which disable tcmalloc usage. Maybe some tools like sanitizer/valgrind are not that comfortable with a no standard (glibc) memory allocator.

Comment 20 Frediano Ziglio 2016-03-30 15:32:03 UTC

(In reply to Frediano Ziglio from comment #19)
> Perhaps would be worth trying to compile Qemu with the --disable-tcmalloc
> flags which disable tcmalloc usage. Maybe some tools like sanitizer/valgrind
> are not that comfortable with a no standard (glibc) memory allocator.

Sorry, wrong bug, please ignore this comment

Comment 21 Mark Wielaard 2016-03-30 21:41:34 UTC

(In reply to Frediano Ziglio from comment #20)
> (In reply to Frediano Ziglio from comment #19)
> > Perhaps would be worth trying to compile Qemu with the --disable-tcmalloc
> > flags which disable tcmalloc usage. Maybe some tools like sanitizer/valgrind
> > are not that comfortable with a no standard (glibc) memory allocator.
> 
> Sorry, wrong bug, please ignore this comment

No worries. This bug is actually about making sure valgrind does automagically detect tcmalloc. The code is already in the valgrind package for Fedora 23. So if your issue was with valgrind, then please do try that version to see if it solves your issue without having the rebuild with --disable-tcmalloc. Thanks.

Comment 23 Miloš Prchlík 2016-06-08 13:27:23 UTC

QA note: the test case in #c18 is wrapmalloc*.

Verified for build valgrind-3.11.0-22.el7.

Comment 25 errata-xmlrpc 2016-11-04 02:55:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2297.html

Note You need to log in before you can comment on or make changes to this bug.

codonell
eblake
famz
fche
fweimer
fziglio
huding
jakub
jsnow
juzhang
knoel
mbenitez
mcermak
mjw
mprchlik
mst
ohudlick
pbonzini
thuth
virt-maint
xfu