Bug 138730 - LTC12369-In RHEL 3 U4 -- top command gave segmentation fault
Summary: LTC12369-In RHEL 3 U4 -- top command gave segmentation fault
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Dave Anderson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168424
TreeView+ depends on / blocked
 
Reported: 2004-11-10 21:41 UTC by IBM Bug Proxy
Modified: 2007-11-30 22:07 UTC (History)
6 users (show)

Fixed In Version: RHSA-2006-0144
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-15 15:46:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
2.4.21_proc.patch (3.12 KB, patch)
2004-11-10 21:42 UTC, IBM Bug Proxy
no flags Details | Diff
"2.4.21_proc.patch2" (2.82 KB, text/plain)
2004-11-11 01:11 UTC, IBM Bug Proxy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0144 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 7 2006-03-15 05:00:00 UTC

Description IBM Bug Proxy 2004-11-10 21:41:18 UTC
The following has be reported by IBM LTC:  
In RHEL 3 U4 -- top command gave segmentation fault


PROBLEM  DESCRIPTION
---------------------------------------------------------------------------
we were running some fsstress and some ltp tests on x335b RHEL3 U4 (having
kernel--Linux x335b 2.4.21-21.ELsmp #1 SMP Fri Oct 1 09:28:06 EDT 2004
i686 i686
i386 GNU/Linux) mean time we had  top command also running . After
some 10 to 12
hours we saw the message "segmentation fault" . This defect  looks
like  the
defect filed against the SUSE with same problem (bug number is 9297)

Mike,
You fixed 
https://bugzilla.linux.ibm.com/show_bug.cgi?id=9297
but that is a 2.6 kernel.
Did you submit a fix to mainline and 2.4 ?  Thanks.


No, I did not attempt to fix this bug in 2.4.  The bug is fixed in
mainline 2.6.
 Didn't think to back port 2.6 fix to 2.4.  I can create a patch for
the latest
2.4 kernel and send to Marcelo.

(In reply to comment #2)
> No, I did not attempt to fix this bug in 2.4.  The bug is fixed in
mainline 
2.6.
>  Didn't think to back port 2.6 fix to 2.4.  I can create a patch for
the latest 
> 2.4 kernel and send to Marcelo.

Please attach fix to this bug report and we let the test team test it
out first.   Thanks.

Mike, 
assigning problem to you since you are providing the fix.  Thanks.

Created an attachment (id=7311)
Patch for kernel-2.4.21-21.EL

I created this patch for the source in kernel-2.4.21-21.EL.src.rpm
that I cound on the ftp site.  Hope this is the 'correct' kernel.  I'm
not sure.  Also, I have not tested this as I don't have easy access to
machine for testing.  Can someone try it to ensure that it does solve
the problem?

I'm attempting to move this to the FIXEDAWAITINGTEST state, due to
there being a patch available.  If this patch doesn't work, or needs
to be for another kernel version, please let me know. 

Dave Barrera,
Please make sure your team test Mike's patch please.
We have a deadline to submit patch for RH.
Thanks.

Comment 1 IBM Bug Proxy 2004-11-10 21:42:09 UTC
Created attachment 106450 [details]
2.4.21_proc.patch

Comment 2 IBM Bug Proxy 2004-11-10 22:02:51 UTC
----- Additional Comments From dbarrera.com  2004-11-10 16:59 EDT -------
The India team is out on holiday, which presents a problem for us.  We are 
going try and test it here in Austin. 

Comment 3 IBM Bug Proxy 2004-11-11 01:11:24 UTC
Created attachment 106464 [details]
"2.4.21_proc.patch2"

Comment 4 IBM Bug Proxy 2004-11-11 01:11:45 UTC
----- Additional Comments From mkravetz.com(prefers email via kravetz.com)  2004-11-10 20:09 EDT -------
 
Updated version of the patch

Better version of the patch that will apply with '-p1'.  Note that the code
changes are the same, I just changed the format of the data. 

Comment 5 IBM Bug Proxy 2004-11-11 16:48:12 UTC
----- Additional Comments From dgardnr.com  2004-11-11 11:47 EDT -------
I downloaded the patch and build a kernel with it. I then started top, fsstress
and a couple other tests. I did not see the message "segmentation fault" because
I hit another bug that I already have open - 11109. That is an assertion in
do_get_write_access caused by fsstress. That problem always occurs for me w/i an
hour or two. Until that bug is fixed, I will not be able to test this problem. 

Comment 6 IBM Bug Proxy 2004-11-11 18:06:08 UTC
----- Additional Comments From salina.com  2004-11-11 13:08 EDT -------
David,
Thanks for trying, also looks like there are other problems besides 11109 e.g. 
for ext3 
https://bugzilla.linux.ibm.com/show_bug.cgi?id=11637
We may have to pick up multiple fixes etc. before we can test this one.
Let me know if you are willing to do that and re-test.  Thanks. 

Comment 7 IBM Bug Proxy 2004-11-11 18:28:04 UTC
----- Additional Comments From mkravetz.com(prefers email via kravetz.com)  2004-11-11 13:24 EDT -------
David,

You don't need to run your fsstress tests to recreate/test this problem.  Here
is what you can do.  Use the source code below to build two simple programs:

Source for program fe_long
----------------------------------------
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
                                                                                
main()
{
        pid_t c;
                                                                                
        while(1) {
                c=fork();
                if (c > 0)
                        (void)wait(NULL);
                else
                        execl("./fe", "./fe", NULL);
        }
}

Source for program fe
--------------------------------
#include <unistd.h>
#include <sys/types.h>
                                                                                
main()
{
        exit(0);
}

Note that the program fe_long simply forks and execs the program fe in an
infinite loop.  The key here is generating many instances where a program execs
another program with a shorter name.

After building the programs, start up top with no delay 'top -d 0'.  Then start
up several instances of the program 'fe_long'.  I would suggest 'n instances'
where n is the number of CPUs in the system.  Also note that multiple CPUs is
almost required to recreate/test this program.  I really wouldn't expect one to
recreate this on a single CPU system.

On a kernel without the fix, you should see top segfault within an hour. 
Hopefully, much sooner (like 5 minutes).  On a kernel with the fix, there should
be no segfault. 

Comment 8 IBM Bug Proxy 2004-11-12 02:23:27 UTC
----- Additional Comments From dgardnr.com  2004-11-11 21:20 EDT -------
I will install a multiple cpu machine and retest the fix. 

Comment 9 IBM Bug Proxy 2004-11-15 04:59:35 UTC
----- Additional Comments From prakapn.com  2004-11-14 23:55 EDT -------
Thanks David! Marking this defect as TESTED. 

Comment 10 IBM Bug Proxy 2004-11-15 13:54:30 UTC
----- Additional Comments From prakapn.com  2004-11-15 00:30 EDT -------
Looks like this fix may not goto RHEL3 U4 since U4 is already closed (see 
10072) ? 

Comment 11 Ernie Petrides 2004-11-15 20:28:32 UTC
Last build of U4 was last week.  No fix has yet been committed to U5.


Comment 12 IBM Bug Proxy 2004-11-16 18:49:13 UTC
----- Additional Comments From mkravetz.com(prefers email via kravetz.com)  2004-11-16 13:45 EDT -------
Date: Tue, 16 Nov 2004 08:16:04 -0200
From: Marcelo Tosatti <marcelo.tosatti>
Subject: Re: [PATCH] Task name handling for 2.4
To: Mike Kravetz <kravetz.com>
                                                                                
Mike,
                                                                                
I've saved it to 2.4.29pre.
                                                                                
Thanks
                                                                                
On Fri, Nov 12, 2004 at 09:31:16AM -0800, Mike Kravetz wrote:
> Hi Marcelo,
>
> There is a problem with task name handling in the /proc fs.  See
> http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0136.html
> for the patch that eventually made its way into the 2.6 tree.
>
> We now have people experiencing the same problem/bug in 2.4.  Here
> is a patch for 2.4 that implements the same fix.  Please consider
> applying.
>
> Thanks,
> Signed-off-by: Mike Kravetz <kravetz.com>
<snip> 

Comment 13 IBM Bug Proxy 2004-11-16 23:03:57 UTC
----- Additional Comments From salina.com  2004-11-16 17:59 EDT -------

mainline accepted Mike's patch.

Can we please have this commited for U5 if too late for U4.   Thanks. 

Comment 14 IBM Bug Proxy 2004-11-28 19:28:14 UTC
----- Additional Comments From mkravetz.com(prefers email via kravetz.com)  2004-11-15 12:00 EDT -------
FYI - On Friday I sent the patch to Marcelo for inclusion in 2.4 mainline.

http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/1417.html 

Comment 15 IBM Bug Proxy 2005-03-18 09:14:55 UTC
---- Additional Comments From prakapn.com  2005-03-18 04:11 EST -------
Verification is under progress with RHEL3 U5 (2.4.21-31). 

Comment 16 IBM Bug Proxy 2005-03-23 05:46:21 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ACCEPTED                    |CLOSED




------- Additional Comments From prakapn.com  2005-03-23 00:41 EST -------
Verified that top command is stable in RHEL3 U5.
Closing the defect report. 

Comment 19 Ernie Petrides 2005-03-31 03:13:12 UTC
Glen, could you please explain what's going on here?  No fix for this
problem has been committed to U5, so I'm not sure why anyone on your
end attempted to verify that the problem is fixed.

Comment 20 IBM Bug Proxy 2005-03-31 19:59:47 UTC
---- Additional Comments From salina.com  2005-03-31 14:52 EST -------
sorry our test team was anxious to test this.   We had tried to request a RHEL 
3 U5 target. 

Comment 21 IBM Bug Proxy 2005-07-11 19:02:24 UTC
---- Additional Comments From corryk.com(prefers email via kevcorry.com)  2005-07-11 14:56 EDT -------
Hi Michael, Prakash, Salina,
Do we know if this patch was picked up for RHEL3-U5? If so, let's set this bug
to "accepted". If not, let's move the target-milestone out to RHEL3-U6. Thanks! 

Comment 22 IBM Bug Proxy 2005-07-11 20:08:27 UTC
---- Additional Comments From salina.com  2005-07-11 16:01 EDT -------
kernel-source-2.4.21-32.EL

which is RHEL 3 U5 kernel, still does not have Mike's patch. 

Comment 23 Ernie Petrides 2005-07-21 23:53:51 UTC
Glen, no kernel fix related to this made it into U6.

Comment 24 IBM Bug Proxy 2005-09-14 00:50:45 UTC
---- Additional Comments From jstultz.com(prefers email via johnstul.com)  2005-09-13 20:47 EDT -------
Any update on this bug? 

Comment 25 IBM Bug Proxy 2005-09-14 02:22:09 UTC
---- Additional Comments From mkravetz.com(prefers email via kravetz.com)  2005-09-13 22:15 EDT -------
Not sure who you are talking to John.  If there is anything else I (as bug
owner) can do to help, let me know.  Patch has been provided and even accepted
in mainline. 

Comment 26 Dave Anderson 2005-09-14 12:40:48 UTC
It was proposed for RHEL-U7 on 9/12.  No work has been done on it as of yet.

Comment 27 Dave Anderson 2005-09-23 19:03:09 UTC
Please download and test the kernel found here:

  http://people.redhat.com/anderson/.BZ_138730

In this location, you will find an i386 smp kernel, the associated kernel
src.rpm, and the patch that was applied:

  kernel-smp-2.4.21-37.3.EL.bz138730.i686.rpm
  kernel-2.4.21-37.3.EL.bz138730.src.rpm
  linux-kernel-test.patch

So far I haven't been able to reproduce the problem with a 4-cpu system
running a kernel without the patch.

Please report your test results back to the Bugzilla.

Thanks,
  Dave Anderson




Comment 28 Dave Anderson 2005-09-23 20:23:53 UTC
BTW, my test consists of running 4 "fe_long" tasks, along with a "top -d 0",
on a 4-cpu box.  It's still running strong on an unpatched kernel for well
over 2 hours.

Comment 29 Dave Anderson 2005-09-26 15:02:19 UTC
Two updates:

1. I was able to reproduce the top segfault.
2. But the kernel above (kernel-smp-2.4.21-37.3.EL.bz138730.i686.rpm) may not
   boot!

Apparently 2.4.21-37.1 introduced a patch associated with NX-in-kernel-code
and largepages, that causes some i386 machines to go into an infinite reboot
cycle.  

I will update the kernel in the http://people.redhat.com/anderson directory
listed above.


Comment 30 Dave Anderson 2005-09-26 18:07:16 UTC
I have replaced the test kernel binary, src.rpm and applied patch in: 

  http://people.redhat.com/anderson/.BZ_138730

with:
  
  kernel-smp-2.4.21-37.EL.BZ138730.i686.rpm
  kernel-2.4.21-37.EL.BZ138730.src.rpm
  linux-kernel-test.patch

I am testing it now, but we require the reporting partner's test and buy-in
of the test kernel.




Comment 32 Ernie Petrides 2005-09-30 21:56:41 UTC
Glen, if this bugzilla doesn't need to remain IBM-confidential, then
please uncheck the two "IBM Confidential Group" boxes below.  Thanks.


Comment 33 Ernie Petrides 2005-10-04 18:57:11 UTC
Thanks, Glen.  Completing transition to public bug.

Comment 34 Ernie Petrides 2005-10-04 20:25:22 UTC
*** Bug 162683 has been marked as a duplicate of this bug. ***

Comment 37 Ernie Petrides 2005-10-08 02:16:38 UTC
A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.5.EL).


Comment 38 James Olin Oden 2005-10-12 18:29:22 UTC
Is the fix for this the same patch that is attached to this report?  If not is 
it possible to point me to the patch that was finally used or the SRPM 
containing it.

Thanks...james

Comment 39 Dave Anderson 2005-10-12 18:39:35 UTC
See comment #30, and click on the link.  The patch is "linux-kernel-test-patch".

Comment 44 Red Hat Bugzilla 2006-03-15 15:46:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html



Note You need to log in before you can comment on or make changes to this bug.