Bug 1226806

Summary: arm: all programs that link to tcmalloc hang forever on startup
Product: [Fedora] Fedora Reporter: Boris Ranto <branto>
Component: libunwindAssignee: Kyle McMartin <kmcmartin>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 22CC: amit.shah, berrange, branto, cfergeau, crobinso, david, dwmw2, extras-qa, fedora, itamar, kmcmartin, ktdreyer, loic, pbonzini, pbrobinson, peterm, redhat-bugzilla, rjones, scottt.tw, steve, tcallawa, tmraz, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: armv7hl   
OS: Unspecified   
Whiteboard:
Fixed In Version: libunwind-1.1-8.fc22 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1222286 Environment:
Last Closed: 2015-06-18 13:20:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1222286    
Bug Blocks: 245418, 910269    

Description Boris Ranto 2015-06-01 07:28:21 UTC
+++ This bug was initially created as a clone of Bug #1222286 +++
... [snipped] ...

FWIW qemu built against this package still seems to hang:

http://koji.fedoraproject.org/koji/taskinfo?taskID=9780467

But I'm trying another build now

--- Additional comment from Fedora Update System on 2015-05-21 22:31:20 EDT ---

Package ceph-0.94.1-3.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing ceph-0.94.1-3.fc22'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-8189/ceph-0.94.1-3.fc22
then log in and leave karma (feedback).

--- Additional comment from Richard W.M. Jones on 2015-05-29 16:32:28 EDT ---

This has regressed somehow, so it's started to happen again.

https://kojipkgs.fedoraproject.org//work/tasks/1496/9871496/build.log

The build log shows that it is using:

DEBUG util.py:388:   librados2                   armv7hl 1:0.94.1-3.fc23                build 1.5 M

https://kojipkgs.fedoraproject.org//work/tasks/1496/9871496/root.log

I don't understand exactly why, since the patch exists and
is still being applied.  Perhaps the patch is not a complete
fix for the problem.

--- Additional comment from Richard W.M. Jones on 2015-05-29 17:15:11 EDT ---

I can reproduce this on baremetal armv7hl with
librados2-0.94.1-3.fc23.armv7hl
qemu-system-arm-2.3.0-5.fc23.armv7hl
and (as before) simply running 'qemu-system-arm -help'.

Unfortunately gdb doesn't let me get a stack trace.  It simply
ignores ^C.

I was able to use strace, and the process is doing this over and
over again:

gettimeofday({1432934064, 725311}, NULL) = 0
nanosleep({0, 2000001}, NULL)           = 0
gettimeofday({1432934064, 728002}, NULL) = 0
nanosleep({0, 2000001}, NULL)           = 0

--- Additional comment from Richard W.M. Jones on 2015-05-29 17:35:13 EDT ---



--- Additional comment from Boris Ranto on 2015-05-31 22:45:40 EDT ---

Just few thoughts:

The ceph itself should not run the loop, now. Is there any chance that qemu is trying to do that instead?

Can you still reproduce on bare metal with the simple reproducer that you mentioned in Comment 10?

What about the older librados2-0.94.1-2.fc23.armv7hl? The only change between -2 and -3 should be that lttng was disabled but one never knows.

I'm trying to get my hands on an armhfp machine to debug further myself but that may take some time.

--- Additional comment from Boris Ranto on 2015-06-01 03:25:04 EDT ---

I'll answer the questions myself as I've finally got my hands on an armhfp machine where I could reproduce.

The steps in Comment 10 do not reproduce the issue so that bug is indeed fixed. I've tapped into the running process (gdb -p $(pidof qemu-system-arm)) and I didn't see anything ceph (rbd/rados) related in the backtrace.

The backtrace leading to nanosleep was

#0  0xffffffff in nanosleep () at /lib/libpthread.so.0
#1  0xffffffff in base::internal::SpinLockDelay(int volatile*, int, int) () at /lib/libtcmalloc.so.4
#2  0xffffffff in SpinLock::SlowLock() () at /lib/libtcmalloc.so.4
#3  0xffffffff in tcmalloc::ThreadCache::InitModule() () at /lib/libtcmalloc.so.4
#4  0xffffffff in tc_malloc () at /lib/libtcmalloc.so.4
#5  0xffffffff in __fopen_internal () at /lib/libc.so.6
#6  0xffffffff in load_debug_frame () at /lib/libunwind.so.8
#7  0xffffffff in locate_debug_info () at /lib/libunwind.so.8
#8  0xffffffff in _ULarm_dwarf_find_debug_frame () at /lib/libunwind.so.8
#9  0xffffffff in _ULarm_dwarf_callback () at /lib/libunwind.so.8
#10 0xffffffff in dl_iterate_phdr () at /lib/libc.so.6
#11 0xffffffff in _ULarm_find_proc_info () at /lib/libunwind.so.8
#12 0xffffffff in fetch_proc_info () at /lib/libunwind.so.8
#13 0xffffffff in _ULarm_dwarf_find_save_locs () at /lib/libunwind.so.8
#14 0xffffffff in _ULarm_dwarf_step () at /lib/libunwind.so.8
#15 0xffffffff in _ULarm_step () at /lib/libunwind.so.8
#16 0xffffffff in GetStackTrace_libunwind(void**, int, int) () at /lib/libtcmalloc.so.4
#17 0xffffffff in GetStackTrace(void**, int, int) () at /lib/libtcmalloc.so.4
#18 0xffffffff in tcmalloc::PageHeap::GrowHeap(unsigned int) () at /lib/libtcmalloc.so.4
#19 0xffffffff in tcmalloc::PageHeap::New(unsigned int) () at /lib/libtcmalloc.so.4
#20 0xffffffff in tcmalloc::CentralFreeList::Populate() () at /lib/libtcmalloc.so.4
#21 0xffffffff in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () at /lib/libtcmalloc.so.4
#22 0xffffffff in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () at /lib/libtcmalloc.so.4
#23 0xffffffff in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, unsigned int) () at /lib/libtcmalloc.so.4
#24 0xffffffff in tc_malloc () at /lib/libtcmalloc.so.4
#25 0xffffffff in strdup () at /lib/libc.so.6
#26 0xffffffff in verify_selinuxmnt () at /lib/libselinux.so.1
#27 0xffffffff in init_lib () at /lib/libselinux.so.1
#28 0xffffffff in call_init.part () at /lib/ld-linux-armhf.so.3
#29 0xffffffff in _dl_init_internal () at /lib/ld-linux-armhf.so.3
#30 0xffffffff in _dl_start_user () at /lib/ld-linux-armhf.so.3

The suspects seemed to be tcmalloc (gperftools) or libunwind. While both of these are used by ceph, neither of them is used by the client libraries (rbd/rados) that qemu links to.

I've done some more tests and it seems that tcmalloc (gperftools) is actually the one causing the hang-ups, i.e. similar to comment 10, I did

$ echo 'main(){}' > test.c
$ gcc test.c -ltcmalloc -o test
$ ./test

and the program hanged indefinitely.

I'm moving this bz back to ON_QA (as there actually was a bug in ceph code that was fixed in the -2 release) and I'll clone this bz for the tcmalloc issue.

Comment 1 Boris Ranto 2015-06-01 07:34:19 UTC
Moving to tcmalloc as I forgot to change that when I was cloning the bz...

I'm attaching some more (concise) info:

gperftools version: gperftools-libs-2.4-2.fc23.armv7hl
reproducer steps:

$ echo 'main(){}' > test.c
$ gcc test.c -ltcmalloc -o test
$ ./test

result: ./test hangs indefinitely -- gettimeofday and nanosleep syscalls occur in the (probably) infinite loop
./test backtrace:

#0  0xb6e521e8 in nanosleep () from /lib/libc.so.6
#1  0xb6f2b230 in base::internal::SpinLockDelay(int volatile*, int, int) () from /lib/libtcmalloc.so.4
#2  0xb6f2b0a0 in SpinLock::SlowLock() () from /lib/libtcmalloc.so.4
#3  0xb6f1ea08 in tcmalloc::ThreadCache::InitModule() () from /lib/libtcmalloc.so.4
#4  0xb6f2e3f0 in tc_malloc () from /lib/libtcmalloc.so.4
#5  0xb6e0fbb4 in __fopen_internal () from /lib/libc.so.6
#6  0xb6d590a0 in load_debug_frame () from /lib/libunwind.so.8
#7  0xb6d59bb8 in locate_debug_info () from /lib/libunwind.so.8
#8  0xb6d59d24 in _ULarm_dwarf_find_debug_frame () from /lib/libunwind.so.8
#9  0xb6d5a304 in _ULarm_dwarf_callback () from /lib/libunwind.so.8
#10 0xb6ecbe34 in dl_iterate_phdr () from /lib/libc.so.6
#11 0xb6d563c8 in _ULarm_find_proc_info () from /lib/libunwind.so.8
#12 0xb6d579d4 in fetch_proc_info () from /lib/libunwind.so.8
#13 0xb6d589dc in _ULarm_dwarf_find_save_locs () from /lib/libunwind.so.8
#14 0xb6d59014 in _ULarm_dwarf_step () from /lib/libunwind.so.8
#15 0xb6d55628 in _ULarm_step () from /lib/libunwind.so.8
#16 0xb6f2b684 in GetStackTrace_libunwind(void**, int, int) () from /lib/libtcmalloc.so.4
#17 0xb6f2be90 in GetStackTrace(void**, int, int) () from /lib/libtcmalloc.so.4
#18 0xb6f1bb34 in tcmalloc::PageHeap::GrowHeap(unsigned int) () from /lib/libtcmalloc.so.4
#19 0xb6f1bea0 in tcmalloc::PageHeap::New(unsigned int) () from /lib/libtcmalloc.so.4
#20 0xb6f1a7b4 in tcmalloc::CentralFreeList::Populate() () from /lib/libtcmalloc.so.4
#21 0xb6f1aa04 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /lib/libtcmalloc.so.4
#22 0xb6f1aab4 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /lib/libtcmalloc.so.4
#23 0xb6f1dd28 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, unsigned int) () from /lib/libtcmalloc.so.4
#24 0xb6f2e05c in tc_malloc () from /lib/libtcmalloc.so.4

Comment 2 Richard W.M. Jones 2015-06-01 08:43:16 UTC
I am able to reproduce this bug as well.

Test:

$ echo 'main(){}' > test.c
$ gcc test.c -ltcmalloc -o test
test.c:1:1: warning: return type defaults to ‘int’ [-Wimplicit-int]
 main(){}
$ ./test
<--- hangs here forever

gperftools-libs-2.4-1.fc22.armv7hl
kernel-4.0.0-0.rc4.git0.1.fc22.armv7hl+lpae

Comment 3 Paolo Bonzini 2015-06-01 14:47:46 UTC
#ifdef __arm__
    // ARM linux doesn't support sys_futex1(void*, int, int, struct timespec*);
    have_futex = 0;
#else
    have_futex = (sizeof (Atomic32) == sizeof (int) &&
                  sys_futex(&x, FUTEX_WAKE, 1, 0) >= 0);
#endif

Comment 4 Tom "spot" Callaway 2015-06-02 14:46:10 UTC
Despite the ugliness of that code that Paolo found, the hang is in libunwind.

On ARM, the default unwind method is for dwarf tables - causing an fopen() with malloc if debug symbols are not part of the executable itself. Forcing to use EXID table method with

export UNW_ARM_UNWIND_METHOD=4

makes the libunwind method work. You can reproduce that with the test case in Comment 2.

Some docs here: https://wiki.linaro.org/KenWerner/Sandbox/libunwind#UNW_ARM_METHOD_EXIDX_0x04

This is in the upstream gperftools issue tracker here:
https://code.google.com/p/gperftools/issues/detail?id=629

I don't think the libunwind upstream considers this a bug. That said, it makes sense to change the unwinding method default on ARM to be Method 4 (at least in Fedora), so that this just works (tm). If someone wants the old broken behavior for some reason, the UNW_ARM_UNWIND_METHOD env variable will let them go back to it.

I've confirmed changing the default also resolves this issue, so I'll push a libunwind update for that.

Comment 5 Fedora Update System 2015-06-02 14:59:30 UTC
libunwind-1.1-8.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/libunwind-1.1-8.fc22

Comment 6 Fedora Update System 2015-06-04 20:14:52 UTC
Package libunwind-1.1-8.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libunwind-1.1-8.fc22'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-9378/libunwind-1.1-8.fc22
then log in and leave karma (feedback).

Comment 7 Fedora Update System 2015-06-18 13:20:13 UTC
libunwind-1.1-8.fc22 has been pushed to the Fedora 22 stable repository.  If problems still persist, please make note of it in this bug report.