Bug 844166
Summary: | Invoking pthread_cancel() from a non-dl_opened object triggers SIGSEGV | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Bart Van Assche <bart.vanassche+redhat> |
Component: | valgrind | Assignee: | Jakub Jelinek <jakub> |
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 17 | CC: | dodji, jakub, law, mjw, mjw, pfrankli, schwab |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-09-26 12:20:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Bart Van Assche
2012-07-29 13:37:12 UTC
Are you sure this isn't a problem in valgrind itself; I can certainly reproduce the failure when using the instructions above. However, when I run the pthread_cancel_locked test without valgrind, it runs without triggering any kind of fault. Yes, I'm sure. When running the test under Valgrind the pthread_cancel() call is intercepted and some intercept code in a Valgrind shared object is invoked. If I interpret the glibc code correctly there is an assumption in glibc that all shared objects have been loaded via dl_open(). Valgrind loads its own shared objects via mmap(). It sounds like you've already done some analysis here that would save me a lot of time looking at this problem (I'm really not a glibc expert, I'm just filling in) Can you explain where/how glibc makes that assumption? glibc maintains a table with information about DSOs loaded by dlopen(), namely GL(dl_ns). _dl_map_object() and _dl_map_object_from_fd() look up information in that table. If my interpretation is correct the problem is that these functions assume that information is present in GL(dl_ns) about the DSO calling into glibc. That only holds for DSOs loaded by dlopen() and not for DSOs loaded via mmap(). The failure I'm seeing looks quite different. Using --vgdb=full --db-attach=yes on a system with debuginfos installed I get the following backtrace: #0 _dl_map_object_from_fd (name=name@entry=0x3833c10acf "libgcc_s.so.1", fd=-1, fd@entry=3, fbp=fbp@entry=0x7fefff1c8, realname=0x543d160 "/lib64/libgcc_s.so.1", loader=loader@entry=0x0, l_type=l_type@entry=2, mode=mode@entry=-1879048191, stack_endp=stack_endp@entry=0x7fefff1c0, nsid=nsid@entry=0) at dl-load.c:1580 #1 0x0000003833408313 in _dl_map_object (loader=0x0, name=name@entry=0x3833c10acf "libgcc_s.so.1", type=type@entry=2, trace_mode=trace_mode@entry=0, mode=mode@entry=-1879048191, nsid=nsid@entry=0) at dl-load.c:2355 #2 0x0000003833412ffc in dl_open_worker (a=a@entry=0x7fefff768) at dl-open.c:226 #3 0x000000383340ecb6 in _dl_catch_error (objname=objname@entry=0x7fefff758, errstring=errstring@entry=0x7fefff760, mallocedp=mallocedp@entry=0x7fefff748, operate=operate@entry=0x3833412ed0 <dl_open_worker>, args=args@entry=0x7fefff768) at dl-error.c:178 #4 0x0000003833412b2c in _dl_open (file=0x3833c10acf "libgcc_s.so.1", mode=-2147483647, caller_dlopen=<optimized out>, nsid=-2, argc=1, argv=0x7fefffbc8, env=0x7fefffbd8) at dl-open.c:652 #5 0x000000383392d512 in do_dlopen (ptr=ptr@entry=0x7fefff978) at dl-libc.c:89 #6 0x000000383340ecb6 in _dl_catch_error (objname=0x7fefff958, errstring=0x7fefff968, mallocedp=0x7fefff948, operate=0x383392d4d0 <do_dlopen>, args=0x7fefff978) at dl-error.c:178 #7 0x000000383392d5d2 in dlerror_run (args=0x7fefff978, operate=0x383392d4d0 <do_dlopen>) at dl-libc.c:48 #8 __GI___libc_dlopen_mode (name=name@entry=0x3833c10acf "libgcc_s.so.1", mode=mode@entry=-2147483647) at dl-libc.c:165 #9 0x0000003833c0f704 in pthread_cancel_init () at ../nptl/sysdeps/pthread/unwind-forcedunwind.c:53 #10 0x0000003833c0c6f3 in pthread_cancel (th=88327936) at pthread_cancel.c:40 #11 0x0000000004a0b051 in pthread_cancel_intercept (pt_thread=88330656) at drd_pthread_intercepts.c:547 #12 _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel (thread=88330656) at drd_pthread_intercepts.c:553 #13 0x00000000004009c2 in main (argc=1, argv=0x7fefffbc8) at pth_cancel_locked.c:43 We actually fault here: _dl_map_object_from_fd (name=name@entry=0x3833c10acf "libgcc_s.so.1", fd=-1, fd@entry=3, fbp=fbp@entry=0x7fefff1c8, realname=0x543d160 "/lib64/libgcc_s.so.1", loader=loader@entry=0x0, l_type=l_type@entry=2, mode=mode@entry=-1879048191, stack_endp=stack_endp@entry=0x7fefff1c0, nsid=nsid@entry=0) at dl-load.c:1580 1580 l->l_dev = st.st_dev; (gdb) x/10i $pc => 0x3833406b75 <_dl_map_object_from_fd+2517>: movdqa -0xc0(%rbp),%xmm0 (gdb) p/x $rbp $1 = 0x7fefff138 %rbp is directly derived from $rsp. And looking at %rsp, it's not correctly aligned. So we're trying to issue a movdqa on an address that is not suitably aligned. %rbp is directly derived from rsp and AFIACT the wrappers y'all are using are not keeping the stack suitably aligned. At function entry sp - 0x8 just be 16 byte aligned. How were your interceptor routines compiled? From the looks of them they are not keeping the stack frame suitably aligned. At function entry, the sp - 0x8 must be 16 byte aligned. If we look at the start of the pthread_cancel wrapper: (gdb) x/10i _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel 0x4a0af70 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel>: push %rbp 0x4a0af71 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+1>: mov %rsp,%rbp 0x4a0af74 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+4>: mov %rbx,-0x20(%rbp) 0x4a0af78 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+8>: mov %r12,-0x18(%rbp) 0x4a0af7c <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+12>: mov %rdi,%rbx 0x4a0af7f <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+15>: mov %r13,-0x10(%rbp) 0x4a0af83 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+19>: mov %r15,-0x8(%rbp) 0x4a0af87 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+23>: sub $0x48,%rsp That series of stack adjustments looks wrong as it's going to leave the stack misaligned. I'm reassigning to valgrind as the valgrind wrappers are clearly mucking up the stack. We can come back to the question about DSO handling once the stack alignment problem has been resolved. From the Valgrind Subversion changelog (included in version 3.8.0): ------------------------------------------------------------------------ r12811 | tom | 2012-08-02 09:23:45 +0000 (Thu, 02 Aug 2012) | 20 lines Ensure CALL_FN_xx macros align the stack properly The CALL_FN_xx macros in valgrind.h perform function calls by signalling to valgrind using the client request system. Because they are making function calls which are invisible to the compiler they need to make sure that any stack alignment constraints imposed by the ABI are enforced when making the call. This commit enforces 16 byte alignment for x86, amd64, ppc32 and ppc64 platforms, and 8 byte alignment for arm platforms. It does not touch s390x where the ABI requires 8 byte alignment to be maintained at all times, not just when making a function call. It also does not touch mips32 as I'm not currently aware what if any alignment constraints exist there. Fixes BZ#304054 and observed alignment faults on amd64 when running the regtests using a valgrind compiled with gcc 4.7 releases. ------------------------------------------------------------------------ So I've reenabled the pthread_cancel() intercept and it's working fine now. Sorry for the confusion. (In reply to comment #6) > From the Valgrind Subversion changelog (included in version 3.8.0): > [...] > So I've reenabled the pthread_cancel() intercept and it's working fine now. > Sorry for the confusion. Are you using the fedora f18/rawhide valgrind 3.8.0 packages? (In reply to comment #7) > (In reply to comment #6) > > From the Valgrind Subversion changelog (included in version 3.8.0): > > [...] > > So I've reenabled the pthread_cancel() intercept and it's working fine now. > > Sorry for the confusion. > > Are you using the fedora f18/rawhide valgrind 3.8.0 packages? ah, apparently not. Sorry for the confusion. I see You just committed: bart 2012-08-28 18:57:09 +0100 (Tue, 28 Aug 2012) New Revision: 12907 Log: drd: Re-enable the pthread_cancel() intercept now that the CALL_FN_*() ABI violation has been fixed (r12811). Modified files: trunk/drd/drd_pthread_intercepts.c I can backport that to the fedora valgrind package. Will you push it for the 3.8.1 release (~next week). Two DRD semaphore fixes will be included in the 3.8.1 release (r12897 and r12898) but there are no plans yet to include r12907 in the 3.8.1 release. That last revision is closer to a new / reintroduced feature than a bug fix. |