Bug 1452813 - Programs segfault when linked to libtcmalloc: Relink `<...>' with `/lib64/libtcmalloc.so.4' for IFUNC symbol `_ZdlPvm'
Summary: Programs segfault when linked to libtcmalloc: Relink `<...>' with `/lib64/lib...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: gperftools
Version: rawhide
Hardware: ppc64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Tom "spot" Callaway
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1453099 (view as bug list)
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs PPCTracker
TreeView+ depends on / blocked
 
Reported: 2017-05-19 17:11 UTC by Richard W.M. Jones
Modified: 2017-06-09 19:09 UTC (History)
13 users (show)

Fixed In Version: gperftools-2.5.93-1.fc26
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-09 19:09:35 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
build.sh (874 bytes, text/plain)
2017-05-22 22:14 UTC, Richard W.M. Jones
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1453099 0 unspecified CLOSED mongodb-3.4.3-1.fc27 FTBFS: tests fail 2021-02-22 00:41:40 UTC

Internal Links: 1453099

Description Richard W.M. Jones 2017-05-19 17:11:04 UTC
Description of problem:

Simply running /usr/bin/qemu-system-ppc64 -help segfaults.
The host is ppc64 or ppc64le.

See the builds here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=19626405
https://koji.fedoraproject.org/koji/taskinfo?taskID=19626407

checking for qemu-system-ppc64... /usr/bin/qemu-system-ppc64
checking that /usr/bin/qemu-system-ppc64 -help works... no
./configure: line 64597: 22842 Segmentation fault      (core dumped) $QEMU -help 1>&5 2>&1
configure: error: in `/builddir/build/BUILD/libguestfs-1.37.14':
configure: error: /usr/bin/qemu-system-ppc64 -help: command failed.
This could be a very old version of qemu, or qemu might not be
working.

Unfortunately I don't have access to the hosts right now so I
do not know any more details (eg stack trace).

Version-Release number of selected component (if applicable):

qemu 2:2.9.0-1.fc27 (on both architectures)

How reproducible:

100%

Steps to Reproduce:
1. Run: /usr/bin/qemu-system-ppc64 -help

Comment 1 Richard W.M. Jones 2017-05-20 09:27:38 UTC
I was able to reproduce this on emulated hardware.  The problem
happens in an ifunc during ELF relocations:

(gdb) run -help
Starting program: /usr/bin/qemu-system-ppc64 -help

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00003fffb5aa8640 in ?? ()
#2  0x00003fffb5aaa544 in ?? ()
#3  0x00003fffb7fba268 in resolve_ifunc (sym_map=0x3fffb5bb0000, 
    map=<optimized out>, value=70367497069760)
    at ../sysdeps/powerpc/powerpc64/dl-machine.h:666
#4  elf_machine_rela (skip_ifunc=<optimized out>, 
    reloc_addr_arg=0x3fffb5afecd0, version=<optimized out>, 
    sym=<optimized out>, reloc=0x3fffb5aa1a78, map=0x3fffb5bb0000)
    at ../sysdeps/powerpc/powerpc64/dl-machine.h:708
#5  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, 
    nrelative=<optimized out>, relsize=<optimized out>, 
    reladdr=<optimized out>, map=<optimized out>) at do-rel.h:137
#6  _dl_relocate_object (scope=0x3fffb5bb0378, reloc_mode=<optimized out>, 
    consider_profiling=<optimized out>) at dl-reloc.c:259
#7  0x00003fffb7fa588c in dl_main (phdr=<optimized out>, 
    phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>)
    at rtld.c:2047
#8  0x00003fffb7fd18b4 in _dl_sysdep_start (start_argptr=<optimized out>, 
    dl_main=0x3fffb7fa27d0 <dl_main>) at ../elf/dl-sysdep.c:253
#9  0x00003fffb7fa1de8 in _dl_start_final (arg=0x3ffffffff270, 
    info=0x3fffffffecd0) at rtld.c:303
#10 0x00003fffb7fa74b4 in _dl_start (arg=0x3ffffffff270) at rtld.c:411
#11 0x00003fffb7fa1578 in _start () from /lib64/ld64.so.2

Comment 2 Richard W.M. Jones 2017-05-20 09:33:50 UTC
Possibly the library which is failing is /lib64/libtcmalloc.so.4

The symbol which is being relocated may be _ZdlPvm.

The last few lines of LD_DEBUG output before the crash are:

     12630:     symbol=_ZdlPvm;  lookup in file=/lib64/libtcmalloc.so.4 [0]
     12630:     binding file /lib64/libtcmalloc.so.4 [0] to /lib64/libtcmalloc.so.4 [0]: normal symbol `_ZdlPvm'

Comment 3 Richard W.M. Jones 2017-05-22 08:18:47 UTC
I bumped the release and rebuilt libtcmalloc
(https://koji.fedoraproject.org/koji/taskinfo?taskID=19684296)
but I already suspect that was the wrong thing to do.  I suspect
we need instead to rebuild the dependent packages instead (ie. qemu,
mongodb, and possibly many more).

I will rebuild qemu shortly.

Comment 4 Florian Weimer 2017-05-22 08:26:13 UTC
This looks like bug 1312462 has resurfaced.

Comment 5 Richard W.M. Jones 2017-05-22 08:58:02 UTC
Bumped and rebuilt qemu:
https://koji.fedoraproject.org/koji/taskinfo?taskID=19684916

This is just a test to see if it needs to be rebuilt against
the newer tcmalloc which was released on May 15th.

Comment 6 Richard W.M. Jones 2017-05-22 09:09:41 UTC
*** Bug 1453099 has been marked as a duplicate of this bug. ***

Comment 7 Richard W.M. Jones 2017-05-22 09:27:13 UTC
The qemu rebuild failed on ppc64 when it tries to run
the just-built qemu-system-ppc64 command.  So it's not just a
simple matter of rebuilding dependencies.  It's an actual bug in
tcmalloc.  As Florian notes, it's most likely that bug 1312462 has
reappeared.

Comment 8 Aliaksei Kandratsenka 2017-05-22 10:05:25 UTC
Thanks for cc-ing me in. And even more thanks for quickly importing release candidate of gperftools.

Yes, I've re-enabled ifunc-driven runtime switch for sized-deleted support in tcmalloc. The change I've made compared to previous time is to avoid calling any libc functions (like strlen or strcmp). My understanding is that since no calls to any functions should happen, we should be immune to "ifunc handler cannot call anything" problem. So I am curious what exactly is going on. Perhaps I've missed something. Can you post symbolized backtrace of the crash ?

Comment 9 Aliaksei Kandratsenka 2017-05-22 10:07:05 UTC
Also ifunc stuff can be disabled by passing --disable-dynamic-sized-delete-support to configure while we debug.

Comment 10 Aliaksei Kandratsenka 2017-05-22 10:32:19 UTC
Does this patch helps: https://gist.github.com/alk/d97b2df483dfc512621385c53bd6f63f ? I suspect it might, but maybe I am too naive anyways.

Comment 11 Richard W.M. Jones 2017-05-22 10:43:22 UTC
(In reply to Aliaksei Kandratsenka from comment #10)
> Does this patch helps:
> https://gist.github.com/alk/d97b2df483dfc512621385c53bd6f63f ? I suspect it
> might, but maybe I am too naive anyways.

I will try a few things.  It's rather slow going because I have
to test everything under emulation.

Comment 12 Tom "spot" Callaway 2017-05-22 15:21:25 UTC
(In reply to Richard W.M. Jones from comment #11)
> (In reply to Aliaksei Kandratsenka from comment #10)
> > Does this patch helps:
> > https://gist.github.com/alk/d97b2df483dfc512621385c53bd6f63f ? I suspect it
> > might, but maybe I am too naive anyways.
> 
> I will try a few things.  It's rather slow going because I have
> to test everything under emulation.

Richard, do you want me to apply his patch in Comment 10 and do a new build? Alternately, if you need me to pass --disable-dynamic-sized-delete-support for now, let me know.

Comment 13 Richard W.M. Jones 2017-05-22 16:50:09 UTC
Sorry about the delays  - it's very painful building gperftools under
emulation.  However I also used scratch builds in Koji to answer the
questions above:

(In reply to Aliaksei Kandratsenka from comment #9)
> Also ifunc stuff can be disabled by passing
> --disable-dynamic-sized-delete-support to configure while we debug.

Yes, this DOES fix the problem (not surprisingly, really).

Do we lose very much by disabling this?  If C++ code uses
sized delete + tcmalloc, will it fail to {compile|run}?

(In reply to Aliaksei Kandratsenka from comment #10)
> Does this patch helps:
> https://gist.github.com/alk/d97b2df483dfc512621385c53bd6f63f ? I suspect it
> might, but maybe I am too naive anyways.

No this patch does NOT fix the problem.

Comment 14 Aliaksei Kandratsenka 2017-05-22 17:19:01 UTC
Thanks. Is it ppc-only now? And -Wl,-z,now is necessary and sufficient (and maybe also relro?) ?

I'll need some help debugging further. There are no plt calls in the new code and no calls to ifunc-ed functions too. So unless "you cannot call anything at all from ifunc handler on ppc (must inline everything)" holds, I cannot see how it may fail. So there is some generic value w.r.t. clarifying ifunc semantics in debugging this further.

And yes, I can disable this feature upstream (say whitelist arm64 and x86 where I can test).

Comment 15 Richard W.M. Jones 2017-05-22 18:42:37 UTC
(In reply to Aliaksei Kandratsenka from comment #14)
> Thanks. Is it ppc-only now? And -Wl,-z,now is necessary and sufficient (and
> maybe also relro?) ?

It is ppc64 and ppc64le only.

It is NOT related to -z now or any other special linker flag.  Merely
linking to -ltcmalloc is sufficient.

I'm not able to reproduce this with the upstream gperftools (from git),
but still trying ...

Comment 16 Richard W.M. Jones 2017-05-22 22:14:56 UTC
Created attachment 1281209 [details]
build.sh

Well, I tried to reproduce what we see with the Fedora package
using the upstream git repo, and I cannot reproduce it.

This may or may not be surprising - it may be that the ifunc
problems depends in great detail on some aspect of the precise order
in which the libtcmalloc.so library is linked together at build time.

Anyway, attached is the build.sh script I was using to try to
reproduce this (on ppc64le hardware), in case someone else wants
to have a go.

Comment 17 Aliaksei Kandratsenka 2017-05-22 22:57:16 UTC
Then lets debug specific crash that triggered this ticket. Is there any way to get symbol names in the crash?

Comment 18 Aliaksei Kandratsenka 2017-05-22 23:01:01 UTC
Hm. So apparently duplicated ticket #1453099 is crashing on amd64. So perhaps debugging that crash would be easier. Is there some easy for me to reproduce #1453099 say within docker?

Comment 19 Aliaksei Kandratsenka 2017-05-23 01:55:15 UTC
Thanks again for raising it. I can reproduce the problem on debian sid amd64 by adding LDFLAGS='-Wl,-z,now -Wl,-z,relro' and running unit tests.

Specific issue is __environ relocation is not available during ifunc handler invocation.

So this is hopeless indeed. I will disable this feature again. Sorry for the noise (but please consider fixing and expanding scope of ifunc; it would be nice if all normal relocations could be done before ifunc resolutions start).

Comment 20 Richard W.M. Jones 2017-05-23 07:42:44 UTC
Fixed upstream by:

commit f2bae51e7e609855c26095f14ffbb84082694acb
Author: Aliaksey Kandratsenka <alkondratenko>
Date:   Mon May 22 18:58:15 2017 -0700

    Revert "Revert "disable dynamic sized delete support by default""
    
    This reverts commit b82d89cb7c8781a6028f6f5959cabdc5a273aec3.
    
    Dynamic sized delete support relies on ifunc handler being able to
    look up environment variable. The issue is, when stuff is linked with
    -z now linker flags, all relocations are performed early. And sadly
    ifunc relocations are not treated specially. So when ifunc handler
    runs, it cannot rely on any dynamic relocations at all, otherwise
    crash is real possibility. So we cannot afford doing it until (and if)
    ifunc is fixed.
    
    This was brought to my attention by Fedora people at
    https://bugzilla.redhat.com/show_bug.cgi?id=1452813

Spot: You'll have to either disable sized delete on every architecture
or add the above commit.

Comment 21 Florian Weimer 2017-05-23 07:49:16 UTC
(In reply to Aliaksei Kandratsenka from comment #19)
> Specific issue is __environ relocation is not available during ifunc handler
> invocation.
> 
> So this is hopeless indeed. I will disable this feature again. Sorry for the
> noise (but please consider fixing and expanding scope of ifunc; it would be
> nice if all normal relocations could be done before ifunc resolutions start).

We can give you a valid __environ relocation (I have a patch for that), but with BIND_NOW, the variable itself will still not have been initialized when the IFUNC resolver runs, so you still won't be able to detect that something has been configured through the process environment.

Comment 22 Fedora Update System 2017-05-23 15:24:30 UTC
gperftools-2.5.93-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-685f48d47a

Comment 23 Fedora Update System 2017-05-25 19:18:57 UTC
gperftools-2.5.93-1.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-685f48d47a

Comment 24 Fedora Update System 2017-06-09 19:09:35 UTC
gperftools-2.5.93-1.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.