Bug 1273103 - valgrind: running programs through /lib64/ld64.so.1 does not work on ppc64
valgrind: running programs through /lib64/ld64.so.1 does not work on ppc64
Status: NEW
Product: Fedora
Classification: Fedora
Component: valgrind (Show other bugs)
27
ppc64 Linux
medium Severity high
: ---
: ---
Assigned To: Mark Wielaard
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks: PPCTracker
  Show dependency treegraph
 
Reported: 2015-10-19 11:26 EDT by Karsten Hopp
Modified: 2017-08-15 05:27 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-26 07:44:48 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Karsten Hopp 2015-10-19 11:26:58 EDT
Description of problem:
+ elf/ld.so --library-path .:elf:nptl:dlfcn /usr/bin/valgrind elf/ld.so --library-path .:elf:nptl:dlfcn /usr/bin/true
==29098== Memcheck, a memory error detector
==29098== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==29098== Using Valgrind-3.11.0.TEST1 and LibVEX; rerun with -h for copyright info
==29098== Command: elf/ld.so --library-path .:elf:nptl:dlfcn /usr/bin/true
==29098== 
==29098== Jump to the invalid address stated on the next line
==29098==    at 0x3640: ???
==29098==  Address 0x3640 is not stack'd, malloc'd or (recently) free'd
==29098== 
==29098== 
==29098== Process terminating with default action of signal 11 (SIGSEGV)
==29098==  Bad permissions for mapped region at address 0x3640
==29098==    at 0x3640: ???
==29098== 
==29098== HEAP SUMMARY:
==29098==     in use at exit: 0 bytes in 0 blocks
==29098==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==29098== 
==29098== All heap blocks were freed -- no leaks are possible
==29098== 
==29098== For counts of detected and suppressed errors, rerun with: -v
==29098== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
/var/tmp/rpm-tmp.3gBBSM: line 116: 29098 Segmentation fault      (core dumped) elf/ld.so --library-path .:elf:nptl:dlfcn /usr/bin/valgrind elf/ld.so --library-path .:elf:nptl:dlfcn /usr/bin/true
error: Bad exit status from /var/tmp/rpm-tmp.3gBBSM (%check)

Version-Release number of selected component (if applicable):
glibc-2.22.90-8.fc24

How reproducible:
always

Steps to Reproduce:
1. ppc-koji build --scratch f24 glibc-2.22.90-8.fc24.src.rpm
2.
3.

Actual results:
http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=2816221
http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=2819401

Expected results:


Additional info:
build log attached
Comment 1 Florian Weimer 2015-10-26 07:44:48 EDT

*** This bug has been marked as a duplicate of bug 1274974 ***
Comment 2 Florian Weimer 2015-10-26 11:45:53 EDT
I went way back in the history, and it seems that this particular check never worked.

In fact, with the F21 glibc (glibc-2.20-8.fc21.ppc64p7), I get the same failure with:

/lib64/ld64.so.1 /usr/bin/valgrind /lib64/ld64.so.1 /usr/bin/true

(That is, same trick as before, but with an installed glibc which generally supports valgrind just fine.)

So either this check is invalid, or it is a valgrind issue.
Comment 3 Carlos O'Donell 2015-10-26 17:02:49 EDT
(In reply to Florian Weimer from comment #2)
> /lib64/ld64.so.1 /usr/bin/valgrind /lib64/ld64.so.1 /usr/bin/true
> So either this check is invalid, or it is a valgrind issue.

I can't see how the glibc check is invalid, but perhaps unsupportable by valgrind on ppc64 ELFv1 BE, which is the only remaining architecture with official procedure descriptors.

The "trick" is nothing more than using ld.so as it was intended to be used, and is a valid way to run an application, and the only way you can run applications under alternate loader paths for testing. Therefore it quickly becomes an important use case for core tools work or users working with bundled libraries.

On FC23 POWER8 ppc64 BE boxes the simpler test case is this:

/usr/bin/valgrind /lib64/ld-2.22.so

==3866== Memcheck, a memory error detector
==3866== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3866== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3866== Command: /lib64/ld-2.22.so
==3866== 
==3866== Jump to the invalid address stated on the next line
==3866==    at 0x34A0: ???
==3866==  Address 0x34a0 is not stack'd, malloc'd or (recently) free'd
==3866== 
==3866== 
==3866== Process terminating with default action of signal 11 (SIGSEGV)
==3866==  Bad permissions for mapped region at address 0x34A0
==3866==    at 0x34A0: ???
==3866== 
==3866== HEAP SUMMARY:
==3866==     in use at exit: 0 bytes in 0 blocks
==3866==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==3866== 
==3866== All heap blocks were freed -- no leaks are possible
==3866== 
==3866== For counts of detected and suppressed errors, rerun with: -v
==3866== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

So it looks like simply trying to run ld.so causes the problem.

The address 0x34a0 is the _start of ld.so.

gdb shows it's 0x00000000200034a0

(gdb) bt
#0  0x00000000200034ac in ._start ()

All normal.

cat /proc/3901/maps
20000000-20040000 r-xp 00000000 fd:00 67167791                           /usr/lib64/ld-2.22.so
20040000-20060000 rw-p 00030000 fd:00 67167791                           /usr/lib64/ld-2.22.so
3fffb7fe0000-3fffb8000000 r-xp 00000000 00:00 0                          [vdso]
3ffffffd0000-400000000000 rw-p 00000000 00:00 0                          [stack]

Shows the kernel put the executable mapping at 0x20000000 which is fine.

Dump of assembler code for function ._start:
   0x00000000200034a0 <+0>:	mr      r3,r1
   0x00000000200034a4 <+4>:	li      r4,0
   0x00000000200034a8 <+8>:	stdu    r4,-128(r1)
=> 0x00000000200034ac <+12>:	bl      0x20007b20 <._dl_start>
   0x00000000200034b0 <+16>:	nop
   0x00000000200034b4 <+20>:	b       0x200034d0 <._dl_start_user>
   0x00000000200034b8 <+24>:	.long 0x0
   0x00000000200034bc <+28>:	.long 0xc2440
   0x00000000200034c0 <+32>:	.long 0x0
   0x00000000200034c4 <+36>:	.long 0x18
   0x00000000200034c8 <+40>:	.long 0x65f73
   0x00000000200034cc <+44>:	andis.  r1,r3,29300

The jump to 0x34a0 (unrelocated) should be valid.

Note that the entry point for ld.so is 0x50000 which is:

Dump of assembler code for function _start:
   0x0000000020050000 <+0>:	.long 0x0
   0x0000000020050004 <+4>:	.long 0x34a0

Which is the first OPD entry of .opd, the entry for _start.

This is normal so far.

Rebuilding valgrind on ppc64 shows many generic testsuite failures and at lest two SIGSEGV.

e.g.
sh: line 1: 30904 Segmentation fault      (core dumped) VALGRIND_LIB=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place VALGRIND_LIB_INNER=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place /root/rpmbuild/BUILD/valgrind-3.11.0/./coregrind/valgrind --command-line-only=yes --memcheck:leak-check=no --tool=memcheck -q --suppressions=supp_unknown.supp ./supp_unknown > supp_unknown.stdout.out 2> supp_unknown.stderr.out

And some aborts:

sh: line 1: 23853 Aborted                 (core dumped) PATH=/tmp/bruhaha:$PATH VALGRIND_LIB=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place VALGRIND_LIB_INNER=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place /root/rpmbuild/BUILD/valgrind-3.11.0/./coregrind/valgrind --command-line-only=yes --memcheck:leak-check=no --tool=helgrind ./tc22_exit_w_lock > tc22_exit_w_lock.stdout.out 2> tc22_exit_w_lock.stderr.out
*** tc22_exit_w_lock failed (stderr) ***

Though all the none/tests/ppc64 tests pass.

A valgrind expert will have to look.

I took a pass at debugging it, but the amount of infrastructure knowledge required is beyond a simple triage right now. I just wanted to give some confidence that ld.so is OK and that it's somehow related to valgrind.
Comment 4 Carlos O'Donell 2015-10-26 17:06:10 EDT
Is perhaps valgrind relying on ld.so to relocate all the OPDs? But here because it's running ld.so, we don't do that and the kernel does the load + opd to start the application?
Comment 5 Mark Wielaard 2015-11-11 12:13:14 EST
(In reply to Carlos O'Donell from comment #3)
> Rebuilding valgrind on ppc64 shows many generic testsuite failures and at
> lest two SIGSEGV.

That really shouldn't be. The testsuite isn't zero fail, but it should be close. These are the results on ppc64 f22 that I am getting:

== 589 tests, 5 stderr failures, 0 stdout failures, 0 stderrB failures, 1 stdoutB failure, 2 post failures ==
gdbserver_tests/hgtls                    (stdoutB)
memcheck/tests/bug340392                 (stderr)
memcheck/tests/leak_cpp_interior         (stderr)
massif/tests/new-cpp                     (post)
massif/tests/overloaded-new              (post)
helgrind/tests/tc22_exit_w_lock          (stderr)
drd/tests/std_thread                     (stderr)
drd/tests/std_thread2                    (stderr)

Which is comparable to other architectures on fedora.
> e.g.
> sh: line 1: 30904 Segmentation fault      (core dumped)
> VALGRIND_LIB=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place
> VALGRIND_LIB_INNER=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place
> /root/rpmbuild/BUILD/valgrind-3.11.0/./coregrind/valgrind
> --command-line-only=yes --memcheck:leak-check=no --tool=memcheck -q
> --suppressions=supp_unknown.supp ./supp_unknown > supp_unknown.stdout.out 2>
> supp_unknown.stderr.out

That is expected, that tests that valgrind catches a bad jump. The expected output is:

Process terminating with default action of signal 11 (SIGSEGV)
 Access not within mapped region at address 0x........
   ...
   by 0x........: main (badjump.c:17)

> And some aborts:
> 
> sh: line 1: 23853 Aborted                 (core dumped)
> PATH=/tmp/bruhaha:$PATH
> VALGRIND_LIB=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place
> VALGRIND_LIB_INNER=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place
> /root/rpmbuild/BUILD/valgrind-3.11.0/./coregrind/valgrind
> --command-line-only=yes --memcheck:leak-check=no --tool=helgrind
> ./tc22_exit_w_lock > tc22_exit_w_lock.stdout.out 2>
> tc22_exit_w_lock.stderr.out
> *** tc22_exit_w_lock failed (stderr) ***

This one fails indeed, but not because of the SIGABRT, which is intended to be caught. It tests what happens when a mutex is locked and the process gets a fatal signal.
Comment 6 Mark Wielaard 2015-11-11 12:14:21 EST
(In reply to Carlos O'Donell from comment #4)
> Is perhaps valgrind relying on ld.so to relocate all the OPDs? But here
> because it's running ld.so, we don't do that and the kernel does the load +
> opd to start the application?

That sounds like a very likely cause. Thanks. I'll look into the startup sequence and see if there is anything valgrind can do here.
Comment 7 Jan Kurik 2016-02-24 10:49:02 EST
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
Comment 8 Fedora End Of Life 2017-07-25 15:22:45 EDT
This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Comment 9 Mark Wielaard 2017-07-27 11:46:14 EDT
This is still an issue.
Comment 10 Jan Kurik 2017-08-15 05:27:17 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle.
Changing version to '27'.

Note You need to log in before you can comment on or make changes to this bug.