Description of problem: Debugging multi-threaded 32-bit programs with a 32-bit GDB on ia64, using ia32el, the fix for bug 175083 exposed problems in multi-threaded debugging that, in the end, proved to be caused by waitpid() modifying status even when returning zero. Before the fix for bug 175083 that resulted in gdb-6.3.0.0-1.130.EL3 (1.129.EL3 did not have the revised fix and did not trigger the problem described here), GDB was simply unable to debug multiple threads on ia32el, because the exception thrown when attempting to write to debug registers effectively disabled threaded debugging. As soon as that was fixed, GDB would often fail this assertion in linux-nat.c: /* We shouldn't end up here unless we want to try again. */ gdb_assert (status == 0); It turned out that my_waitpid was returning 0 but nevertheless modifying *status. Since all my_waitpid() does is to repeatedly call waitpid() until it stops returning -1 with errno == EINTR, it became clear that waitpid() is at fault. I've worked around this problem in gdb-6.3.0.0-1.132.EL3, such that waitpid saves the original status and restores it if it's about to return 0, issuing a warning while at that. Version-Release number of selected component (if applicable): ia32el-1.3-1.EL3.ia64 gdb-6.3.0.0-1.130.EL3.i386 gdb-6.3.0.0-1.132.EL3.i386 How reproducible: Almost every time Steps to Reproduce: 1.Compile and link for IA32 the print-thread program from the GDB testsuite, as requested in bug 175083 2.Run gdb from 6.3.0.0-1.130.EL3 on IA64 3.Set a breakpoint in print-thread's main() 4.Issue the `run' command, and then `continue' after the breakpoint is hit 5.Repeat with gdb-6.3.0.0-1.132.EL3 Actual results: After 4., GDB prints an assertion failure error with 1.130, and the waitpid() warning with 1.132. Expected results: http://www.opengroup.org/onlinepubs/009695399/functions/waitpid.html says *stat_loc should be set if stat_loc is not NULL pointer and waitpid is returning ID of one of the child processes, or if returning with -1/EINTR then *stat_loc is undefined. I'm not positive that the *status modification by waitpid() is caused by ia32el, but since this never happens on plain ia32, it's my prime suspect. Additional info: Unfortunately, using gdb to debug gdb debugging a multi-threaded program doesn't work very well because of too complex ptrace interactions, so I can't tell for sure what's going on inside waitpid when it fails as described above.
Created attachment 131175 [details] patch for bug 195715
Attached is a patch for IA-32 EL V6 (6097 - shipped with RHEL 4 U4 beta, ia32el-1.6-8.1.EL4.ia64.rpm), could you please have a try?
Intel will want to do a full test pass on this patch before it ships, if it turns out it does fix your bug. So please do this test and report back results ASAP. Thanks!
I tried to reproduce a bug without luck, so I can't tell anything about the patch. aoliva, who reported the problem, is away at summit, so he can't check the patch either.
I *can* reproduce it after all. The problem is that 6.3.0.0-1.130.EL3.1rh doesn't assert, but merely warns, so I overlooked it. But it warns even with ia32el patch applied. But it's not that simple: * ia32el-1.6 won't compile on EL3, due to ancient gcj. I had to port the patch to compile it (I will attach, for reference). It might have introduced more bugs in ia32el. * the patch can't be applied to ia32el-1.3. I trimmed the parts that seemed irrelevant in context of ia32el-1.3 off the patch. Knowing nothing about internal workings of ia32el, I expect the result is nonsense (only first four chunks of the original patch were left in, the remaining four were trimmed). Both of these may impact the results. What next?
Created attachment 131760 [details] ia32el-1.6 support for old gcj
OK, we cannot reproduce the failure on EL3 - we will try to understand why our fix doens't work and provide another fix.
Sorry, I mean "we can reproduce the failure on EL3"
Just for sure, my output with ia32el-1.6-9.EL3.ia64 is following. The warning appears just after the 'run' command. .qa.[root@ia64-3as root]# gdb-130/usr/bin/gdb ./print-threads GNU gdb Red Hat Linux (6.3.0.0-1.130.EL3.1rh) <... Copyright ...> Using host libthread_db library "/emul/ia32-linux/lib/tls/libthread_db.so.1". (gdb) break main Breakpoint 1 at 0x80484d2 (gdb) run Starting program: /root/print-threads warning: linux_test_for_tracefork: unexpected result from waitpid (5791, status 0x0) warning: linux_test_for_tracefork: failed to kill child (no debugging symbols found) [Thread debugging using libthread_db enabled] [New Thread 1074055392 (LWP 5771)] Error while reading shared library symbols: Couldn't write debug register: Input/output error. (no debugging symbols found) warning: the debug information found in "/usr/lib/debug//lib/ld-2.3.2.so.debug" does not match "/lib/ld-linux.so.2" (CRC mismatch). (no debugging symbols found) [Switching to Thread 1074055392 (LWP 5771)] Breakpoint 1, 0x080484d2 in main () (gdb)
This warning you're seeing is one I don't remember having seen before. What I added, and I get it after main actually runs and starts other threads is: warning ("waitpid: non-zero status %x for zero return value", *status);
Aha, so what I found was completely unrelated waitpid warning. When I check gdb-132 output, the right warning actually is there, down below, and the patch does fix that! So OK, the patch works for us. I'll leave it to aoliva to decide what with the other warning.
Created attachment 131820 [details] Fix for RHEL-3 This patch is applicable to 1.3 version of ia32el, and appears to fix the abovementioned problem. It's basically trimmed-down version of 1.6 patch. I'd appreciate if Intel looked at it and confirmed that it's not complete nonsense.
Yes, you fix for 1.3 is OK. And IA-32 EL will provide a patch specific for 1.3 version of ia32el after validation.
Created attachment 131913 [details] patch against ia32el V5 (1.3) Patch against ia32el V5 (rpm package version 1.3.1)
Thanks. Any news on testing of 1.6 patch?
1.6 patch test is finished - I have submit a new IT (97239) for RHEL 4 U4 which will include IA-32 EL 1.6. But it seems we need an exception for putting it into RHEL 4.4. The following is the IT 97239 information by Gary: "Would you like to try for an exception in RHEL4 U4 or is this something you'd be comfortable waiting for until RHEL4.5?" I don't know your Gary's role and your role in EL4.4 and who will make the decsion but I myself think the bug is so serious that we need an exception for EL4.4.
Ok, thanks, I didn't know about the IT. Will coordinate with gcase.
"but I myself think the bug is so serious that we need an exception for EL4.4." Typo - it should be "I myself DON'T think....." Sorry for any inconvenience
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.