Bug 195715 - waitpid() modifies status even when returning zero
Summary: waitpid() modifies status even when returning zero
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: ia32el
Version: 3.0
Hardware: ia64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Petr Machata
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-06-16 18:45 UTC by Alexandre Oliva
Modified: 2015-05-05 01:32 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-19 18:43:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch for bug 195715 (1.95 KB, patch)
2006-06-20 07:25 UTC, Eric Lin
no flags Details | Diff
ia32el-1.6 support for old gcj (2.26 KB, patch)
2006-06-29 17:47 UTC, Petr Machata
no flags Details | Diff
Fix for RHEL-3 (1.67 KB, patch)
2006-06-30 16:55 UTC, Petr Machata
no flags Details | Diff
patch against ia32el V5 (1.3) (1.41 KB, patch)
2006-07-05 04:59 UTC, Eric Lin
no flags Details | Diff

Description Alexandre Oliva 2006-06-16 18:45:59 UTC
Description of problem:
Debugging multi-threaded 32-bit programs with a 32-bit GDB on ia64, using
ia32el, the fix for bug 175083 exposed problems in multi-threaded debugging
that, in the end, proved to be caused by waitpid() modifying status even when
returning zero.

Before the fix for bug 175083 that resulted in gdb-6.3.0.0-1.130.EL3 (1.129.EL3
did not have the revised fix and did not trigger the problem described here),
GDB was simply unable to debug multiple threads on ia32el, because the exception
thrown when attempting to write to debug registers effectively disabled threaded
debugging.

As soon as that was fixed, GDB would often fail this assertion in linux-nat.c:
      /* We shouldn't end up here unless we want to try again.  */
      gdb_assert (status == 0);

It turned out that my_waitpid was returning 0 but nevertheless modifying
*status.  Since all my_waitpid() does is to repeatedly call waitpid() until it
stops returning -1 with errno == EINTR, it became clear that waitpid() is at
fault.  I've worked around this problem in gdb-6.3.0.0-1.132.EL3, such that
waitpid saves the original status and restores it if it's about to return 0,
issuing a warning while at that.

Version-Release number of selected component (if applicable):
ia32el-1.3-1.EL3.ia64
gdb-6.3.0.0-1.130.EL3.i386
gdb-6.3.0.0-1.132.EL3.i386

How reproducible:
Almost every time

Steps to Reproduce:
1.Compile and link for IA32 the print-thread program from the GDB testsuite, as
requested in bug 175083
2.Run gdb from 6.3.0.0-1.130.EL3 on IA64
3.Set a breakpoint in print-thread's main()
4.Issue the `run' command, and then `continue' after the breakpoint is hit
5.Repeat with gdb-6.3.0.0-1.132.EL3

Actual results:
After 4., GDB prints an assertion failure error with 1.130, and the waitpid()
warning with 1.132.

Expected results:
http://www.opengroup.org/onlinepubs/009695399/functions/waitpid.html
says *stat_loc should be set if stat_loc is not NULL pointer and
waitpid is returning ID of one of the child processes, or if
returning with -1/EINTR then *stat_loc is undefined.

I'm not positive that the *status modification by waitpid() is caused by ia32el,
but since this never happens on plain ia32, it's my prime suspect.

Additional info:
Unfortunately, using gdb to debug gdb debugging a multi-threaded program doesn't
work very well because of too complex ptrace interactions, so I can't tell for
sure what's going on inside waitpid when it fails as described above.

Comment 1 Eric Lin 2006-06-20 07:25:10 UTC
Created attachment 131175 [details]
patch for bug 195715

Comment 2 Eric Lin 2006-06-20 07:31:39 UTC
Attached is a patch for IA-32 EL V6 (6097 - shipped with RHEL 4 U4 beta, 
ia32el-1.6-8.1.EL4.ia64.rpm), could you please have a try? 



Comment 3 Geoff Gustafson 2006-06-28 14:59:10 UTC
Intel will want to do a full test pass on this patch before it ships, if it
turns out it does fix your bug. So please do this test and report back results
ASAP. Thanks!


Comment 4 Petr Machata 2006-06-29 16:40:37 UTC
I tried to reproduce a bug without luck, so I can't tell anything about the
patch. aoliva, who reported the problem, is away at summit, so he can't check
the patch either.

Comment 5 Petr Machata 2006-06-29 17:42:34 UTC
I *can* reproduce it after all.  The problem is that 6.3.0.0-1.130.EL3.1rh
doesn't assert, but merely warns, so I overlooked it.  But it warns even with
ia32el patch applied.  But it's not that simple:

* ia32el-1.6 won't compile on EL3, due to ancient gcj.  I had to port the patch
to compile it (I will attach, for reference).  It might have introduced more
bugs in ia32el.

* the patch can't be applied to ia32el-1.3.  I trimmed the parts that seemed
irrelevant in context of ia32el-1.3 off the patch.  Knowing nothing about
internal workings of ia32el, I expect the result is nonsense (only first four
chunks of the original patch were left in, the remaining four were trimmed).

Both of these may impact the results.  What next?

Comment 6 Petr Machata 2006-06-29 17:47:19 UTC
Created attachment 131760 [details]
ia32el-1.6 support for old gcj

Comment 7 Eric Lin 2006-06-30 06:31:39 UTC
OK, we cannot reproduce the failure on EL3 - we will try to understand why our 
fix doens't work and provide another fix. 

Comment 8 Eric Lin 2006-06-30 06:43:27 UTC
Sorry, I mean "we can reproduce the failure on EL3"

Comment 9 Petr Machata 2006-06-30 11:30:52 UTC
Just for sure, my output with ia32el-1.6-9.EL3.ia64 is following.  The warning
appears just after the 'run' command.

.qa.[root@ia64-3as root]# gdb-130/usr/bin/gdb ./print-threads
GNU gdb Red Hat Linux (6.3.0.0-1.130.EL3.1rh)
<... Copyright ...>
Using host libthread_db library "/emul/ia32-linux/lib/tls/libthread_db.so.1".

(gdb) break main
Breakpoint 1 at 0x80484d2
(gdb) run
Starting program: /root/print-threads
warning: linux_test_for_tracefork: unexpected result from waitpid (5791, status 0x0)
warning: linux_test_for_tracefork: failed to kill child
(no debugging symbols found)
[Thread debugging using libthread_db enabled]
[New Thread 1074055392 (LWP 5771)]
Error while reading shared library symbols:
Couldn't write debug register: Input/output error.
(no debugging symbols found)
warning: the debug information found in "/usr/lib/debug//lib/ld-2.3.2.so.debug"
does not match "/lib/ld-linux.so.2" (CRC mismatch).

(no debugging symbols found)
[Switching to Thread 1074055392 (LWP 5771)]

Breakpoint 1, 0x080484d2 in main ()
(gdb)


Comment 10 Alexandre Oliva 2006-06-30 15:23:33 UTC
This warning you're seeing is one I don't remember having seen before.  What I
added, and I get it after main actually runs and starts other threads is:

        warning ("waitpid: non-zero status %x for zero return value",
                 *status);



Comment 11 Petr Machata 2006-06-30 16:00:00 UTC
Aha, so what I found was completely unrelated waitpid warning. When I check
gdb-132 output, the right warning actually is there, down below, and the patch
does fix that!

So OK, the patch works for us.  I'll leave it to aoliva to decide what with the
other warning.

Comment 12 Petr Machata 2006-06-30 16:55:23 UTC
Created attachment 131820 [details]
Fix for RHEL-3

This patch is applicable to 1.3 version of ia32el, and appears to fix the
abovementioned problem.  It's basically trimmed-down version of 1.6 patch.  I'd
appreciate if Intel looked at it and confirmed that it's not complete nonsense.

Comment 13 Eric Lin 2006-07-03 07:19:19 UTC
Yes, you fix for 1.3 is OK. 
And IA-32 EL will provide a patch specific for 1.3 version of ia32el after 
validation. 

Comment 14 Eric Lin 2006-07-05 04:59:22 UTC
Created attachment 131913 [details]
patch against ia32el V5 (1.3) 

Patch against ia32el V5 (rpm package version 1.3.1)

Comment 15 Petr Machata 2006-07-05 21:59:11 UTC
Thanks. Any news on testing of 1.6 patch?

Comment 16 Eric Lin 2006-07-06 02:05:41 UTC
1.6 patch test is finished - I have submit a new IT (97239) for RHEL 4 U4 
which will include IA-32 EL 1.6. But it seems we need an exception for putting 
it into RHEL 4.4. The following is the IT 97239 information by Gary: 

"Would you like to try for an exception in RHEL4 U4 or is this something you'd 
be comfortable waiting for until RHEL4.5?"

I don't know your Gary's role and your role in EL4.4 and who will make the 
decsion but I myself think the bug is so serious that we need an exception for 
EL4.4. 

Comment 17 Petr Machata 2006-07-06 14:33:20 UTC
Ok, thanks, I didn't know about the IT.  Will coordinate with gcase.

Comment 18 Eric Lin 2006-07-07 00:55:23 UTC
"but I myself think the bug is so serious that we need an exception for 
EL4.4."

Typo -  it should be "I myself DON'T think....." 

Sorry for any inconvenience

Comment 19 RHEL Program Management 2007-10-19 18:43:16 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.


Note You need to log in before you can comment on or make changes to this bug.