Bug 844166

Summary:	Invoking pthread_cancel() from a non-dl_opened object triggers SIGSEGV
Product:	[Fedora] Fedora	Reporter:	Bart Van Assche <bart.vanassche+redhat>
Component:	valgrind	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED NOTABUG	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	17	CC:	dodji, jakub, law, mjw, mjw, pfrankli, schwab
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-09-26 12:20:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bart Van Assche 2012-07-29 13:37:12 UTC

Description of problem: invoking pthread_cancel() from an object not loaded with dl_open() triggers SIGSEV.


Version-Release number of selected component (if applicable):

$ rpm -q glibc
glibc-2.15-51.fc17.x86_64


How reproducible: always.


Steps to Reproduce:

svn co -r12799 svn://svn.valgrind.org/valgrind/trunk valgrind
cd valgrind
svn update -r2451 VEX
./autogen.sh && ./configure && make -s && make -s check
./vg-in-place --tool=drd drd/tests/pth_cancel_locked
  

Actual results:

==13809== drd, a thread error detector
==13809== Copyright (C) 2006-2011, and GNU GPL'd, by Bart Van Assche.
==13809== Using Valgrind-3.8.0.SVN and LibVEX; rerun with -h for copyright info
==13809== Command: drd/tests/pth_cancel_locked
==13809== 
==13809== 
==13809== Process terminating with default action of signal 11 (SIGSEGV)
==13809==  General Protection Fault
==13809==    at 0x4006B75: _dl_map_object_from_fd (dl-load.c:1580)
==13809==    by 0x4008312: _dl_map_object (dl-load.c:2355)
==13809==    by 0x4012FFB: dl_open_worker (dl-open.c:226)
==13809==    by 0x400ECB5: _dl_catch_error (dl-error.c:178)
==13809==    by 0x4012B2B: _dl_open (dl-open.c:652)
==13809==    by 0x5184511: do_dlopen (dl-libc.c:89)
==13809==    by 0x400ECB5: _dl_catch_error (dl-error.c:178)
==13809==    by 0x51845D1: __libc_dlopen_mode (dl-libc.c:48)
==13809==    by 0x4E4A703: pthread_cancel_init (unwind-forcedunwind.c:53)
==13809==    by 0x4E476F2: pthread_cancel (pthread_cancel.c:40)
==13809==    by 0x4C2C050: pthread_cancel (drd_pthread_intercepts.c:547)
==13809==    by 0x4009C1: main (pth_cancel_locked.c:43)
==13809== 
==13809== For counts of detected and suppressed errors, rerun with: -v
==13809== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 8)
./vg-in-place: line 31: 13809 Killed                  VALGRIND_LIB="$vgbasedir/.in_place" VALGRIND_LIB_INNER="$vgbasedir/.in_place" "$vgbasedir/coregrind/valgrind" "$@"


Expected results:

==13842== drd, a thread error detector
==13842== Copyright (C) 2006-2011, and GNU GPL'd, by Bart Van Assche.
==13842== Using Valgrind-3.8.0.SVN and LibVEX; rerun with -h for copyright info
==13842== Command: drd/tests/pth_cancel_locked
==13842== 
==13842== Mutex still locked at thread exit: mutex 0x600ee0, recursion count 1, owner 2.
==13842==    at 0x4C2B67D: pthread_join (drd_pthread_intercepts.c:509)
==13842==    by 0x4009D2: main (pth_cancel_locked.c:46)
==13842== mutex 0x600ee0 was first observed at:
==13842==    at 0x4C2C8AF: pthread_mutex_init (drd_pthread_intercepts.c:614)
==13842==    by 0x400977: main (pth_cancel_locked.c:32)
==13842== 
Test finished.
==13842== 
==13842== For counts of detected and suppressed errors, rerun with: -v
==13842== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 8 from 8)

Comment 1 Jeff Law 2012-08-23 16:56:51 UTC

Are you sure this isn't a problem in valgrind itself; I can certainly reproduce the failure when using the instructions above.  However, when I run the pthread_cancel_locked test without valgrind, it runs without triggering any kind of fault.

Comment 2 Bart Van Assche 2012-08-24 13:32:21 UTC

Yes, I'm sure. When running the test under Valgrind the pthread_cancel() call is intercepted and some intercept code in a Valgrind shared object is invoked. If I interpret the glibc code correctly there is an assumption in glibc that all shared objects have been loaded via dl_open(). Valgrind loads its own shared objects via mmap().

Comment 3 Jeff Law 2012-08-24 16:00:21 UTC

It sounds like you've already done some analysis here that would save me a lot of time looking at this problem (I'm really not a glibc expert, I'm just filling in)

Can you explain where/how glibc makes that assumption?

Comment 4 Bart Van Assche 2012-08-24 16:40:16 UTC

glibc maintains a table with information about DSOs loaded by dlopen(), namely GL(dl_ns). _dl_map_object() and _dl_map_object_from_fd() look up information in that table. If my interpretation is correct the problem is that these functions assume that information is present in GL(dl_ns) about the DSO calling into glibc. That only holds for DSOs loaded by dlopen() and not for DSOs loaded via mmap().

Comment 5 Jeff Law 2012-08-27 19:21:25 UTC

The failure I'm seeing looks quite different.  Using --vgdb=full --db-attach=yes on a system with debuginfos installed I get the following backtrace:

#0  _dl_map_object_from_fd (name=name@entry=0x3833c10acf "libgcc_s.so.1", fd=-1, fd@entry=3, 
    fbp=fbp@entry=0x7fefff1c8, realname=0x543d160 "/lib64/libgcc_s.so.1", loader=loader@entry=0x0, 
    l_type=l_type@entry=2, mode=mode@entry=-1879048191, stack_endp=stack_endp@entry=0x7fefff1c0, 
    nsid=nsid@entry=0) at dl-load.c:1580
#1  0x0000003833408313 in _dl_map_object (loader=0x0, name=name@entry=0x3833c10acf "libgcc_s.so.1", 
    type=type@entry=2, trace_mode=trace_mode@entry=0, mode=mode@entry=-1879048191, nsid=nsid@entry=0)
    at dl-load.c:2355
#2  0x0000003833412ffc in dl_open_worker (a=a@entry=0x7fefff768) at dl-open.c:226
#3  0x000000383340ecb6 in _dl_catch_error (objname=objname@entry=0x7fefff758, 
    errstring=errstring@entry=0x7fefff760, mallocedp=mallocedp@entry=0x7fefff748, 
    operate=operate@entry=0x3833412ed0 <dl_open_worker>, args=args@entry=0x7fefff768)
    at dl-error.c:178
#4  0x0000003833412b2c in _dl_open (file=0x3833c10acf "libgcc_s.so.1", mode=-2147483647, 
    caller_dlopen=<optimized out>, nsid=-2, argc=1, argv=0x7fefffbc8, env=0x7fefffbd8)
    at dl-open.c:652
#5  0x000000383392d512 in do_dlopen (ptr=ptr@entry=0x7fefff978) at dl-libc.c:89
#6  0x000000383340ecb6 in _dl_catch_error (objname=0x7fefff958, errstring=0x7fefff968, 
    mallocedp=0x7fefff948, operate=0x383392d4d0 <do_dlopen>, args=0x7fefff978) at dl-error.c:178
#7  0x000000383392d5d2 in dlerror_run (args=0x7fefff978, operate=0x383392d4d0 <do_dlopen>)
    at dl-libc.c:48
#8  __GI___libc_dlopen_mode (name=name@entry=0x3833c10acf "libgcc_s.so.1", 
    mode=mode@entry=-2147483647) at dl-libc.c:165
#9  0x0000003833c0f704 in pthread_cancel_init () at ../nptl/sysdeps/pthread/unwind-forcedunwind.c:53
#10 0x0000003833c0c6f3 in pthread_cancel (th=88327936) at pthread_cancel.c:40
#11 0x0000000004a0b051 in pthread_cancel_intercept (pt_thread=88330656)
    at drd_pthread_intercepts.c:547
#12 _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel (thread=88330656) at drd_pthread_intercepts.c:553
#13 0x00000000004009c2 in main (argc=1, argv=0x7fefffbc8) at pth_cancel_locked.c:43

We actually fault here:

_dl_map_object_from_fd (name=name@entry=0x3833c10acf "libgcc_s.so.1", fd=-1, fd@entry=3, 
    fbp=fbp@entry=0x7fefff1c8, realname=0x543d160 "/lib64/libgcc_s.so.1", loader=loader@entry=0x0, 
    l_type=l_type@entry=2, mode=mode@entry=-1879048191, stack_endp=stack_endp@entry=0x7fefff1c0, 
    nsid=nsid@entry=0) at dl-load.c:1580
1580      l->l_dev = st.st_dev;
(gdb) x/10i $pc
=> 0x3833406b75 <_dl_map_object_from_fd+2517>:  movdqa -0xc0(%rbp),%xmm0


(gdb) p/x $rbp
$1 = 0x7fefff138

%rbp is directly derived from $rsp.  And looking at %rsp, it's not correctly aligned.  



So we're trying to issue a movdqa on an address that is not suitably aligned.  %rbp is directly derived from rsp and AFIACT the wrappers y'all are using are not keeping the stack suitably aligned.  At function entry sp - 0x8 just be 16 byte aligned.

  

How were your interceptor routines compiled?  From the looks of them they are not keeping the stack frame suitably aligned.  At function entry, the sp - 0x8 must be 16 byte aligned.  

If we look at the start of the pthread_cancel wrapper:

(gdb) x/10i _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel
   0x4a0af70 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel>:   push   %rbp
   0x4a0af71 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+1>: mov    %rsp,%rbp
   0x4a0af74 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+4>: mov    %rbx,-0x20(%rbp)
   0x4a0af78 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+8>: mov    %r12,-0x18(%rbp)
   0x4a0af7c <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+12>:        mov    %rdi,%rbx
   0x4a0af7f <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+15>:        mov    %r13,-0x10(%rbp)
   0x4a0af83 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+19>:        mov    %r15,-0x8(%rbp)
   0x4a0af87 <_vgw00000ZZ_libpthreadZdsoZd0_pthreadZucancel+23>:        sub    $0x48,%rsp

That series of stack adjustments looks wrong as it's going to leave the stack misaligned.

I'm reassigning to valgrind as the valgrind wrappers are clearly mucking up the stack.  We can come back to the question about DSO handling once the stack alignment problem has been resolved.

Comment 6 Bart Van Assche 2012-08-28 18:02:56 UTC

From the Valgrind Subversion changelog (included in version 3.8.0):

------------------------------------------------------------------------
r12811 | tom | 2012-08-02 09:23:45 +0000 (Thu, 02 Aug 2012) | 20 lines

Ensure CALL_FN_xx macros align the stack properly

The CALL_FN_xx macros  in valgrind.h perform function calls by
signalling to valgrind using the client request system. Because
they are making function calls which are invisible to the compiler
they need to make sure that any stack alignment constraints
imposed by the ABI are enforced when making the call.

This commit enforces 16 byte alignment for x86, amd64, ppc32 and
ppc64 platforms, and 8 byte alignment for arm platforms.

It does not touch s390x where the ABI requires 8 byte alignment to
be maintained at all times, not just when making a function call.

It also does not touch mips32 as I'm not currently aware what if
any alignment constraints exist there.

Fixes BZ#304054 and observed alignment faults on amd64 when running
the regtests using a valgrind compiled with gcc 4.7 releases.
------------------------------------------------------------------------

So I've reenabled the pthread_cancel() intercept and it's working fine now. Sorry for the confusion.

Comment 7 Mark Wielaard 2012-08-28 18:25:47 UTC

(In reply to comment #6)
> From the Valgrind Subversion changelog (included in version 3.8.0):
> [...]
> So I've reenabled the pthread_cancel() intercept and it's working fine now.
> Sorry for the confusion.

Are you using the fedora f18/rawhide valgrind 3.8.0 packages?

Comment 8 Mark Wielaard 2012-08-28 18:30:06 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > From the Valgrind Subversion changelog (included in version 3.8.0):
> > [...]
> > So I've reenabled the pthread_cancel() intercept and it's working fine now.
> > Sorry for the confusion.
> 
> Are you using the fedora f18/rawhide valgrind 3.8.0 packages?

ah, apparently not. Sorry for the confusion. I see You just committed:

bart    2012-08-28 18:57:09 +0100 (Tue, 28 Aug 2012)

  New Revision: 12907

  Log:
    drd: Re-enable the pthread_cancel() intercept now that the CALL_FN_*() ABI
    violation has been fixed (r12811).

  Modified files:
    trunk/drd/drd_pthread_intercepts.c

I can backport that to the fedora valgrind package. Will you push it for the 3.8.1 release (~next week).

Comment 9 Bart Van Assche 2012-08-28 18:34:40 UTC

Two DRD semaphore fixes will be included in the 3.8.1 release (r12897 and r12898) but there are no plans yet to include r12907 in the 3.8.1 release. That last revision is closer to a new / reintroduced feature than a bug fix.