Bug 1273103
| Summary: | valgrind: running programs through /lib64/ld64.so.1 does not work on ppc64 | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Karsten Hopp <karsten> | 
| Component: | valgrind | Assignee: | Mark Wielaard <mjw> | 
| Status: | CLOSED EOL | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | 
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 27 | CC: | arjun, codonell, dodji, fweimer, jakub, law, mfabian, mjw, pfrankli, siddhesh | 
| Target Milestone: | --- | Keywords: | Reopened | 
| Target Release: | --- | ||
| Hardware: | ppc64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-11-28 19:40:21 UTC | Type: | Bug | 
| Regression: | --- | Mount Type: | --- | 
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1071880 | ||
| 
        
          Description
        
        
          Karsten Hopp
        
        
        
        
        
          2015-10-19 15:26:58 UTC
        
       *** This bug has been marked as a duplicate of bug 1274974 *** I went way back in the history, and it seems that this particular check never worked. In fact, with the F21 glibc (glibc-2.20-8.fc21.ppc64p7), I get the same failure with: /lib64/ld64.so.1 /usr/bin/valgrind /lib64/ld64.so.1 /usr/bin/true (That is, same trick as before, but with an installed glibc which generally supports valgrind just fine.) So either this check is invalid, or it is a valgrind issue. (In reply to Florian Weimer from comment #2) > /lib64/ld64.so.1 /usr/bin/valgrind /lib64/ld64.so.1 /usr/bin/true > So either this check is invalid, or it is a valgrind issue. I can't see how the glibc check is invalid, but perhaps unsupportable by valgrind on ppc64 ELFv1 BE, which is the only remaining architecture with official procedure descriptors. The "trick" is nothing more than using ld.so as it was intended to be used, and is a valid way to run an application, and the only way you can run applications under alternate loader paths for testing. Therefore it quickly becomes an important use case for core tools work or users working with bundled libraries. On FC23 POWER8 ppc64 BE boxes the simpler test case is this: /usr/bin/valgrind /lib64/ld-2.22.so ==3866== Memcheck, a memory error detector ==3866== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==3866== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==3866== Command: /lib64/ld-2.22.so ==3866== ==3866== Jump to the invalid address stated on the next line ==3866== at 0x34A0: ??? ==3866== Address 0x34a0 is not stack'd, malloc'd or (recently) free'd ==3866== ==3866== ==3866== Process terminating with default action of signal 11 (SIGSEGV) ==3866== Bad permissions for mapped region at address 0x34A0 ==3866== at 0x34A0: ??? ==3866== ==3866== HEAP SUMMARY: ==3866== in use at exit: 0 bytes in 0 blocks ==3866== total heap usage: 0 allocs, 0 frees, 0 bytes allocated ==3866== ==3866== All heap blocks were freed -- no leaks are possible ==3866== ==3866== For counts of detected and suppressed errors, rerun with: -v ==3866== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) Segmentation fault (core dumped) So it looks like simply trying to run ld.so causes the problem. The address 0x34a0 is the _start of ld.so. gdb shows it's 0x00000000200034a0 (gdb) bt #0 0x00000000200034ac in ._start () All normal. cat /proc/3901/maps 20000000-20040000 r-xp 00000000 fd:00 67167791 /usr/lib64/ld-2.22.so 20040000-20060000 rw-p 00030000 fd:00 67167791 /usr/lib64/ld-2.22.so 3fffb7fe0000-3fffb8000000 r-xp 00000000 00:00 0 [vdso] 3ffffffd0000-400000000000 rw-p 00000000 00:00 0 [stack] Shows the kernel put the executable mapping at 0x20000000 which is fine. Dump of assembler code for function ._start: 0x00000000200034a0 <+0>: mr r3,r1 0x00000000200034a4 <+4>: li r4,0 0x00000000200034a8 <+8>: stdu r4,-128(r1) => 0x00000000200034ac <+12>: bl 0x20007b20 <._dl_start> 0x00000000200034b0 <+16>: nop 0x00000000200034b4 <+20>: b 0x200034d0 <._dl_start_user> 0x00000000200034b8 <+24>: .long 0x0 0x00000000200034bc <+28>: .long 0xc2440 0x00000000200034c0 <+32>: .long 0x0 0x00000000200034c4 <+36>: .long 0x18 0x00000000200034c8 <+40>: .long 0x65f73 0x00000000200034cc <+44>: andis. r1,r3,29300 The jump to 0x34a0 (unrelocated) should be valid. Note that the entry point for ld.so is 0x50000 which is: Dump of assembler code for function _start: 0x0000000020050000 <+0>: .long 0x0 0x0000000020050004 <+4>: .long 0x34a0 Which is the first OPD entry of .opd, the entry for _start. This is normal so far. Rebuilding valgrind on ppc64 shows many generic testsuite failures and at lest two SIGSEGV. e.g. sh: line 1: 30904 Segmentation fault (core dumped) VALGRIND_LIB=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place VALGRIND_LIB_INNER=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place /root/rpmbuild/BUILD/valgrind-3.11.0/./coregrind/valgrind --command-line-only=yes --memcheck:leak-check=no --tool=memcheck -q --suppressions=supp_unknown.supp ./supp_unknown > supp_unknown.stdout.out 2> supp_unknown.stderr.out And some aborts: sh: line 1: 23853 Aborted (core dumped) PATH=/tmp/bruhaha:$PATH VALGRIND_LIB=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place VALGRIND_LIB_INNER=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place /root/rpmbuild/BUILD/valgrind-3.11.0/./coregrind/valgrind --command-line-only=yes --memcheck:leak-check=no --tool=helgrind ./tc22_exit_w_lock > tc22_exit_w_lock.stdout.out 2> tc22_exit_w_lock.stderr.out *** tc22_exit_w_lock failed (stderr) *** Though all the none/tests/ppc64 tests pass. A valgrind expert will have to look. I took a pass at debugging it, but the amount of infrastructure knowledge required is beyond a simple triage right now. I just wanted to give some confidence that ld.so is OK and that it's somehow related to valgrind. Is perhaps valgrind relying on ld.so to relocate all the OPDs? But here because it's running ld.so, we don't do that and the kernel does the load + opd to start the application? (In reply to Carlos O'Donell from comment #3) > Rebuilding valgrind on ppc64 shows many generic testsuite failures and at > lest two SIGSEGV. That really shouldn't be. The testsuite isn't zero fail, but it should be close. These are the results on ppc64 f22 that I am getting: == 589 tests, 5 stderr failures, 0 stdout failures, 0 stderrB failures, 1 stdoutB failure, 2 post failures == gdbserver_tests/hgtls (stdoutB) memcheck/tests/bug340392 (stderr) memcheck/tests/leak_cpp_interior (stderr) massif/tests/new-cpp (post) massif/tests/overloaded-new (post) helgrind/tests/tc22_exit_w_lock (stderr) drd/tests/std_thread (stderr) drd/tests/std_thread2 (stderr) Which is comparable to other architectures on fedora. > e.g. > sh: line 1: 30904 Segmentation fault (core dumped) > VALGRIND_LIB=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place > VALGRIND_LIB_INNER=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place > /root/rpmbuild/BUILD/valgrind-3.11.0/./coregrind/valgrind > --command-line-only=yes --memcheck:leak-check=no --tool=memcheck -q > --suppressions=supp_unknown.supp ./supp_unknown > supp_unknown.stdout.out 2> > supp_unknown.stderr.out That is expected, that tests that valgrind catches a bad jump. The expected output is: Process terminating with default action of signal 11 (SIGSEGV) Access not within mapped region at address 0x........ ... by 0x........: main (badjump.c:17) > And some aborts: > > sh: line 1: 23853 Aborted (core dumped) > PATH=/tmp/bruhaha:$PATH > VALGRIND_LIB=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place > VALGRIND_LIB_INNER=/root/rpmbuild/BUILD/valgrind-3.11.0/.in_place > /root/rpmbuild/BUILD/valgrind-3.11.0/./coregrind/valgrind > --command-line-only=yes --memcheck:leak-check=no --tool=helgrind > ./tc22_exit_w_lock > tc22_exit_w_lock.stdout.out 2> > tc22_exit_w_lock.stderr.out > *** tc22_exit_w_lock failed (stderr) *** This one fails indeed, but not because of the SIGABRT, which is intended to be caught. It tests what happens when a mutex is locked and the process gets a fatal signal. (In reply to Carlos O'Donell from comment #4) > Is perhaps valgrind relying on ld.so to relocate all the OPDs? But here > because it's running ld.so, we don't do that and the kernel does the load + > opd to start the application? That sounds like a very likely cause. Thanks. I'll look into the startup sequence and see if there is anything valgrind can do here. This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle. Changing version to '24'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase This message is a reminder that Fedora 24 is nearing its end of life. Approximately 2 (two) weeks from now Fedora will stop maintaining and issuing updates for Fedora 24. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '24'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 24 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. This is still an issue. This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle. Changing version to '27'. This message is a reminder that Fedora 27 is nearing its end of life. On 2018-Nov-30 Fedora will stop maintaining and issuing updates for Fedora 27. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '27'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 27 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Yes, this is still a problem. But given that ppc64 has been dropped since f29 (there is only ppc64le now which doesn't have function descriptors and where things seem to work fine) I am going to close this. |