Bug 240349 - wait4()'s rusage sometimes wrong when threads present across exec*()
Summary: wait4()'s rusage sometimes wrong when threads present across exec*()
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-05-16 16:57 UTC by Matt Evans
Modified: 2009-06-19 17:08 UTC (History)
4 users (show)

Fixed In Version: RHBA-2007-0791
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-15 16:27:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
C source for test case (930 bytes, application/octet-stream)
2007-05-16 16:57 UTC, Matt Evans
no flags Details
patch for fixing de_thread() to properly init signal_struct fields (1.50 KB, patch)
2007-05-31 23:28 UTC, Ernie Petrides
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 34923 0 None None None Never
Red Hat Product Errata RHBA-2007:0791 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 6 2007-11-14 18:25:55 UTC

Description Matt Evans 2007-05-16 16:57:08 UTC
Description of problem:

It seems that the rusage counts are fooled by programs that exec() when pthreads are active; running /
usr/bin/time or bash-builtin 'time' on such a program produces wild results.

Version-Release number of selected component (if applicable):

Observed on RH4u2/ppc64 (2.6.9-28.EL), RH4u3/x86-64 (2.6.9-34.ELsmp), RH4u5/ppc64 
(2.6.9-55.EL).  I have not found this issue on SLES10/ppc64 ( ;] ), RH5/x86-64 (2.6.18-8.el5) or 
(random unpatched kernel) 2.6.19.7/x86-32.

How reproducible:
Easily.

Steps to Reproduce:
1.   Compile attached test case: gcc fork_time_bug.c -o fork_time_bug.ppc -lpthread
2.   (in bash)  "[me ~]$ time ./fork_time_bug.ppc 100"
3.   (I just logged into $random_RH_box and ran that 6 times; it failed 4 out of 6.
  
Actual results:

[matt@rh4u5]$ time ./fork_time_bug.ppc 100

real    0m4.050s
user    (0)+0'.m,'.332s
sys     ./.(-*/)m*.539s

Expected results:
[matt@sles10]$ time ~/tests/fork_time_bug.ppc 100

real    0m0.476s
user    0m0.219s
sys     0m0.135s

Additional info:

The supplied test program executes itself N times - making a pthread before the exec call.

Interestingly, compiling the test for x86-32 and x86-64 and running both on an x86-64 RH4u3 box 
shows the 64bit version exhibits the failure, yes the 32bit version does not.   A 64bit word with top half 
uninitialised perhaps...?

Comment 1 Matt Evans 2007-05-16 16:57:08 UTC
Created attachment 154850 [details]
C source for test case

Comment 2 Ernie Petrides 2007-05-30 00:06:42 UTC
I've easily reproduced this on my x86_64 test machine.  (Thanks for the test
program, although the wait() syscall at the end should have an additional arg.)

By experimenting with the reproducer, I've further discovered that the bogus
cpu usage times are evident in the data returned by both wait4() and getrusage()
syscalls, which suggests that it's the kernel's propagation of "rusage" data
that is faulty.  This data is calculated from fields in the task's
"signal_struct", so another possibility is that there's a problem in the
handling of this thread-shared structure during exec() of a multi-threaded
process.

Also, if the reproducer is changed to allow the sub-thread to complete before
the exec() is initiated, the problem disappears.  This is another indicator
of a possible "signal_struct" handling problem during an exec() syscall.

I'm continuing to investigate.


Comment 3 Ernie Petrides 2007-05-31 22:58:42 UTC
The problem is in de_thread().  When the "signal_struct" is still being used
(by another thread) during an exec(), a new one is allocated.  Many of the
fields in the new one are not initialized, including all of the resource
counter stats.  Thus, whatever garbage is in these memory locations of the
newly allocated "signal_struct" ends up being added into the getrusage() data.

I'll implement and test a patch, which I think only needs to alter de_thread().

Comment 4 Ernie Petrides 2007-05-31 23:28:54 UTC
Created attachment 155849 [details]
patch for fixing de_thread() to properly init signal_struct fields

The attached patch should fix this problem.  Here are my test results:

# time fork_time_bug 100
0.022u 0.079s 0:00.10 90.0%	0+0k 0+0io 0pf+0w
# time fork_time_bug 100
0.026u 0.075s 0:00.10 90.0%	0+0k 0+0io 0pf+0w
# time fork_time_bug 100
0.026u 0.074s 0:00.10 90.0%	0+0k 0+0io 0pf+0w
# time fork_time_bug 100
0.030u 0.067s 0:00.10 90.0%	0+0k 0+0io 0pf+0w
# time fork_time_bug 100
0.024u 0.070s 0:00.10 90.0%	0+0k 0+0io 0pf+0w
#

Comment 5 Ernie Petrides 2007-05-31 23:57:08 UTC
The patch above was posted for internal review on 31-May-2007.

Comment 6 RHEL Program Management 2007-06-04 21:42:06 UTC
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 7 Jason Baron 2007-06-28 19:52:11 UTC
committed in stream U6 build 55.14. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 9 IBM Bug Proxy 2007-09-05 14:55:34 UTC
------- Comment From suzukikp.com 2007-09-05 10:52 EDT-------
I have verified the kernel-2.6.9-56 to contain the patch attached here as
"linux-2.6.9-signal-coredump.patch".

Antonio,

Can we close this ?

Thanks

Suzuki

Comment 10 IBM Bug Proxy 2007-09-06 17:05:31 UTC
------- Comment From rosalesa.com 2007-09-06 13:01 EDT-------
(In reply to comment #16)
> I have verified the kernel-2.6.9-56 to contain the patch attached here as
> "linux-2.6.9-signal-coredump.patch".
>
> Antonio,
>
> Can we close this ?
>
> Thanks
>
> Suzuki

As this patch was previously tested to resolve this issue, and the patch has
been confirmed to be in the kernel-2.6.9-56 source I am marking this bug as
verified, and closing.
-thanks.

Comment 12 IBM Bug Proxy 2007-10-04 19:20:08 UTC
------- Comment From bugzilla 2007-10-03 22:35 EST-------
User petrides's account has been disabled, requested by HC

Comment 14 errata-xmlrpc 2007-11-15 16:27:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html



Note You need to log in before you can comment on or make changes to this bug.