Red Hat Bugzilla – Bug 195715
waitpid() modifies status even when returning zero
Last modified: 2015-05-04 21:32:24 EDT
Description of problem:
Debugging multi-threaded 32-bit programs with a 32-bit GDB on ia64, using
ia32el, the fix for bug 175083 exposed problems in multi-threaded debugging
that, in the end, proved to be caused by waitpid() modifying status even when
Before the fix for bug 175083 that resulted in gdb-18.104.22.168-1.130.EL3 (1.129.EL3
did not have the revised fix and did not trigger the problem described here),
GDB was simply unable to debug multiple threads on ia32el, because the exception
thrown when attempting to write to debug registers effectively disabled threaded
As soon as that was fixed, GDB would often fail this assertion in linux-nat.c:
/* We shouldn't end up here unless we want to try again. */
gdb_assert (status == 0);
It turned out that my_waitpid was returning 0 but nevertheless modifying
*status. Since all my_waitpid() does is to repeatedly call waitpid() until it
stops returning -1 with errno == EINTR, it became clear that waitpid() is at
fault. I've worked around this problem in gdb-22.214.171.124-1.132.EL3, such that
waitpid saves the original status and restores it if it's about to return 0,
issuing a warning while at that.
Version-Release number of selected component (if applicable):
Almost every time
Steps to Reproduce:
1.Compile and link for IA32 the print-thread program from the GDB testsuite, as
requested in bug 175083
2.Run gdb from 126.96.36.199-1.130.EL3 on IA64
3.Set a breakpoint in print-thread's main()
4.Issue the `run' command, and then `continue' after the breakpoint is hit
5.Repeat with gdb-188.8.131.52-1.132.EL3
After 4., GDB prints an assertion failure error with 1.130, and the waitpid()
warning with 1.132.
says *stat_loc should be set if stat_loc is not NULL pointer and
waitpid is returning ID of one of the child processes, or if
returning with -1/EINTR then *stat_loc is undefined.
I'm not positive that the *status modification by waitpid() is caused by ia32el,
but since this never happens on plain ia32, it's my prime suspect.
Unfortunately, using gdb to debug gdb debugging a multi-threaded program doesn't
work very well because of too complex ptrace interactions, so I can't tell for
sure what's going on inside waitpid when it fails as described above.
Created attachment 131175 [details]
patch for bug 195715
Attached is a patch for IA-32 EL V6 (6097 - shipped with RHEL 4 U4 beta,
ia32el-1.6-8.1.EL4.ia64.rpm), could you please have a try?
Intel will want to do a full test pass on this patch before it ships, if it
turns out it does fix your bug. So please do this test and report back results
I tried to reproduce a bug without luck, so I can't tell anything about the
patch. aoliva, who reported the problem, is away at summit, so he can't check
the patch either.
I *can* reproduce it after all. The problem is that 184.108.40.206-1.130.EL3.1rh
doesn't assert, but merely warns, so I overlooked it. But it warns even with
ia32el patch applied. But it's not that simple:
* ia32el-1.6 won't compile on EL3, due to ancient gcj. I had to port the patch
to compile it (I will attach, for reference). It might have introduced more
bugs in ia32el.
* the patch can't be applied to ia32el-1.3. I trimmed the parts that seemed
irrelevant in context of ia32el-1.3 off the patch. Knowing nothing about
internal workings of ia32el, I expect the result is nonsense (only first four
chunks of the original patch were left in, the remaining four were trimmed).
Both of these may impact the results. What next?
Created attachment 131760 [details]
ia32el-1.6 support for old gcj
OK, we cannot reproduce the failure on EL3 - we will try to understand why our
fix doens't work and provide another fix.
Sorry, I mean "we can reproduce the failure on EL3"
Just for sure, my output with ia32el-1.6-9.EL3.ia64 is following. The warning
appears just after the 'run' command.
.qa.[root@ia64-3as root]# gdb-130/usr/bin/gdb ./print-threads
GNU gdb Red Hat Linux (220.127.116.11-1.130.EL3.1rh)
<... Copyright ...>
Using host libthread_db library "/emul/ia32-linux/lib/tls/libthread_db.so.1".
(gdb) break main
Breakpoint 1 at 0x80484d2
Starting program: /root/print-threads
warning: linux_test_for_tracefork: unexpected result from waitpid (5791, status 0x0)
warning: linux_test_for_tracefork: failed to kill child
(no debugging symbols found)
[Thread debugging using libthread_db enabled]
[New Thread 1074055392 (LWP 5771)]
Error while reading shared library symbols:
Couldn't write debug register: Input/output error.
(no debugging symbols found)
warning: the debug information found in "/usr/lib/debug//lib/ld-2.3.2.so.debug"
does not match "/lib/ld-linux.so.2" (CRC mismatch).
(no debugging symbols found)
[Switching to Thread 1074055392 (LWP 5771)]
Breakpoint 1, 0x080484d2 in main ()
This warning you're seeing is one I don't remember having seen before. What I
added, and I get it after main actually runs and starts other threads is:
warning ("waitpid: non-zero status %x for zero return value",
Aha, so what I found was completely unrelated waitpid warning. When I check
gdb-132 output, the right warning actually is there, down below, and the patch
does fix that!
So OK, the patch works for us. I'll leave it to aoliva to decide what with the
Created attachment 131820 [details]
Fix for RHEL-3
This patch is applicable to 1.3 version of ia32el, and appears to fix the
abovementioned problem. It's basically trimmed-down version of 1.6 patch. I'd
appreciate if Intel looked at it and confirmed that it's not complete nonsense.
Yes, you fix for 1.3 is OK.
And IA-32 EL will provide a patch specific for 1.3 version of ia32el after
Created attachment 131913 [details]
patch against ia32el V5 (1.3)
Patch against ia32el V5 (rpm package version 1.3.1)
Thanks. Any news on testing of 1.6 patch?
1.6 patch test is finished - I have submit a new IT (97239) for RHEL 4 U4
which will include IA-32 EL 1.6. But it seems we need an exception for putting
it into RHEL 4.4. The following is the IT 97239 information by Gary:
"Would you like to try for an exception in RHEL4 U4 or is this something you'd
be comfortable waiting for until RHEL4.5?"
I don't know your Gary's role and your role in EL4.4 and who will make the
decsion but I myself think the bug is so serious that we need an exception for
Ok, thanks, I didn't know about the IT. Will coordinate with gcase.
"but I myself think the bug is so serious that we need an exception for
Typo - it should be "I myself DON'T think....."
Sorry for any inconvenience
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
For more information of the RHEL errata support policy, please visit:
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.