Description of problem: It seems that the rusage counts are fooled by programs that exec() when pthreads are active; running / usr/bin/time or bash-builtin 'time' on such a program produces wild results. Version-Release number of selected component (if applicable): Observed on RH4u2/ppc64 (2.6.9-28.EL), RH4u3/x86-64 (2.6.9-34.ELsmp), RH4u5/ppc64 (2.6.9-55.EL). I have not found this issue on SLES10/ppc64 ( ;] ), RH5/x86-64 (2.6.18-8.el5) or (random unpatched kernel) 2.6.19.7/x86-32. How reproducible: Easily. Steps to Reproduce: 1. Compile attached test case: gcc fork_time_bug.c -o fork_time_bug.ppc -lpthread 2. (in bash) "[me ~]$ time ./fork_time_bug.ppc 100" 3. (I just logged into $random_RH_box and ran that 6 times; it failed 4 out of 6. Actual results: [matt@rh4u5]$ time ./fork_time_bug.ppc 100 real 0m4.050s user (0)+0'.m,'.332s sys ./.(-*/)m*.539s Expected results: [matt@sles10]$ time ~/tests/fork_time_bug.ppc 100 real 0m0.476s user 0m0.219s sys 0m0.135s Additional info: The supplied test program executes itself N times - making a pthread before the exec call. Interestingly, compiling the test for x86-32 and x86-64 and running both on an x86-64 RH4u3 box shows the 64bit version exhibits the failure, yes the 32bit version does not. A 64bit word with top half uninitialised perhaps...?
Created attachment 154850 [details] C source for test case
I've easily reproduced this on my x86_64 test machine. (Thanks for the test program, although the wait() syscall at the end should have an additional arg.) By experimenting with the reproducer, I've further discovered that the bogus cpu usage times are evident in the data returned by both wait4() and getrusage() syscalls, which suggests that it's the kernel's propagation of "rusage" data that is faulty. This data is calculated from fields in the task's "signal_struct", so another possibility is that there's a problem in the handling of this thread-shared structure during exec() of a multi-threaded process. Also, if the reproducer is changed to allow the sub-thread to complete before the exec() is initiated, the problem disappears. This is another indicator of a possible "signal_struct" handling problem during an exec() syscall. I'm continuing to investigate.
The problem is in de_thread(). When the "signal_struct" is still being used (by another thread) during an exec(), a new one is allocated. Many of the fields in the new one are not initialized, including all of the resource counter stats. Thus, whatever garbage is in these memory locations of the newly allocated "signal_struct" ends up being added into the getrusage() data. I'll implement and test a patch, which I think only needs to alter de_thread().
Created attachment 155849 [details] patch for fixing de_thread() to properly init signal_struct fields The attached patch should fix this problem. Here are my test results: # time fork_time_bug 100 0.022u 0.079s 0:00.10 90.0% 0+0k 0+0io 0pf+0w # time fork_time_bug 100 0.026u 0.075s 0:00.10 90.0% 0+0k 0+0io 0pf+0w # time fork_time_bug 100 0.026u 0.074s 0:00.10 90.0% 0+0k 0+0io 0pf+0w # time fork_time_bug 100 0.030u 0.067s 0:00.10 90.0% 0+0k 0+0io 0pf+0w # time fork_time_bug 100 0.024u 0.070s 0:00.10 90.0% 0+0k 0+0io 0pf+0w #
The patch above was posted for internal review on 31-May-2007.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
committed in stream U6 build 55.14. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
------- Comment From suzukikp.com 2007-09-05 10:52 EDT------- I have verified the kernel-2.6.9-56 to contain the patch attached here as "linux-2.6.9-signal-coredump.patch". Antonio, Can we close this ? Thanks Suzuki
------- Comment From rosalesa.com 2007-09-06 13:01 EDT------- (In reply to comment #16) > I have verified the kernel-2.6.9-56 to contain the patch attached here as > "linux-2.6.9-signal-coredump.patch". > > Antonio, > > Can we close this ? > > Thanks > > Suzuki As this patch was previously tested to resolve this issue, and the patch has been confirmed to be in the kernel-2.6.9-56 source I am marking this bug as verified, and closing. -thanks.
------- Comment From bugzilla 2007-10-03 22:35 EST------- User petrides's account has been disabled, requested by HC
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html