The following has be reported by IBM LTC: In RHEL 3 U4 -- top command gave segmentation fault PROBLEM DESCRIPTION --------------------------------------------------------------------------- we were running some fsstress and some ltp tests on x335b RHEL3 U4 (having kernel--Linux x335b 2.4.21-21.ELsmp #1 SMP Fri Oct 1 09:28:06 EDT 2004 i686 i686 i386 GNU/Linux) mean time we had top command also running . After some 10 to 12 hours we saw the message "segmentation fault" . This defect looks like the defect filed against the SUSE with same problem (bug number is 9297) Mike, You fixed https://bugzilla.linux.ibm.com/show_bug.cgi?id=9297 but that is a 2.6 kernel. Did you submit a fix to mainline and 2.4 ? Thanks. No, I did not attempt to fix this bug in 2.4. The bug is fixed in mainline 2.6. Didn't think to back port 2.6 fix to 2.4. I can create a patch for the latest 2.4 kernel and send to Marcelo. (In reply to comment #2) > No, I did not attempt to fix this bug in 2.4. The bug is fixed in mainline 2.6. > Didn't think to back port 2.6 fix to 2.4. I can create a patch for the latest > 2.4 kernel and send to Marcelo. Please attach fix to this bug report and we let the test team test it out first. Thanks. Mike, assigning problem to you since you are providing the fix. Thanks. Created an attachment (id=7311) Patch for kernel-2.4.21-21.EL I created this patch for the source in kernel-2.4.21-21.EL.src.rpm that I cound on the ftp site. Hope this is the 'correct' kernel. I'm not sure. Also, I have not tested this as I don't have easy access to machine for testing. Can someone try it to ensure that it does solve the problem? I'm attempting to move this to the FIXEDAWAITINGTEST state, due to there being a patch available. If this patch doesn't work, or needs to be for another kernel version, please let me know. Dave Barrera, Please make sure your team test Mike's patch please. We have a deadline to submit patch for RH. Thanks.
Created attachment 106450 [details] 2.4.21_proc.patch
----- Additional Comments From dbarrera.com 2004-11-10 16:59 EDT ------- The India team is out on holiday, which presents a problem for us. We are going try and test it here in Austin.
Created attachment 106464 [details] "2.4.21_proc.patch2"
----- Additional Comments From mkravetz.com(prefers email via kravetz.com) 2004-11-10 20:09 EDT ------- Updated version of the patch Better version of the patch that will apply with '-p1'. Note that the code changes are the same, I just changed the format of the data.
----- Additional Comments From dgardnr.com 2004-11-11 11:47 EDT ------- I downloaded the patch and build a kernel with it. I then started top, fsstress and a couple other tests. I did not see the message "segmentation fault" because I hit another bug that I already have open - 11109. That is an assertion in do_get_write_access caused by fsstress. That problem always occurs for me w/i an hour or two. Until that bug is fixed, I will not be able to test this problem.
----- Additional Comments From salina.com 2004-11-11 13:08 EDT ------- David, Thanks for trying, also looks like there are other problems besides 11109 e.g. for ext3 https://bugzilla.linux.ibm.com/show_bug.cgi?id=11637 We may have to pick up multiple fixes etc. before we can test this one. Let me know if you are willing to do that and re-test. Thanks.
----- Additional Comments From mkravetz.com(prefers email via kravetz.com) 2004-11-11 13:24 EDT ------- David, You don't need to run your fsstress tests to recreate/test this problem. Here is what you can do. Use the source code below to build two simple programs: Source for program fe_long ---------------------------------------- #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> main() { pid_t c; while(1) { c=fork(); if (c > 0) (void)wait(NULL); else execl("./fe", "./fe", NULL); } } Source for program fe -------------------------------- #include <unistd.h> #include <sys/types.h> main() { exit(0); } Note that the program fe_long simply forks and execs the program fe in an infinite loop. The key here is generating many instances where a program execs another program with a shorter name. After building the programs, start up top with no delay 'top -d 0'. Then start up several instances of the program 'fe_long'. I would suggest 'n instances' where n is the number of CPUs in the system. Also note that multiple CPUs is almost required to recreate/test this program. I really wouldn't expect one to recreate this on a single CPU system. On a kernel without the fix, you should see top segfault within an hour. Hopefully, much sooner (like 5 minutes). On a kernel with the fix, there should be no segfault.
----- Additional Comments From dgardnr.com 2004-11-11 21:20 EDT ------- I will install a multiple cpu machine and retest the fix.
----- Additional Comments From prakapn.com 2004-11-14 23:55 EDT ------- Thanks David! Marking this defect as TESTED.
----- Additional Comments From prakapn.com 2004-11-15 00:30 EDT ------- Looks like this fix may not goto RHEL3 U4 since U4 is already closed (see 10072) ?
Last build of U4 was last week. No fix has yet been committed to U5.
----- Additional Comments From mkravetz.com(prefers email via kravetz.com) 2004-11-16 13:45 EDT ------- Date: Tue, 16 Nov 2004 08:16:04 -0200 From: Marcelo Tosatti <marcelo.tosatti> Subject: Re: [PATCH] Task name handling for 2.4 To: Mike Kravetz <kravetz.com> Mike, I've saved it to 2.4.29pre. Thanks On Fri, Nov 12, 2004 at 09:31:16AM -0800, Mike Kravetz wrote: > Hi Marcelo, > > There is a problem with task name handling in the /proc fs. See > http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0136.html > for the patch that eventually made its way into the 2.6 tree. > > We now have people experiencing the same problem/bug in 2.4. Here > is a patch for 2.4 that implements the same fix. Please consider > applying. > > Thanks, > Signed-off-by: Mike Kravetz <kravetz.com> <snip>
----- Additional Comments From salina.com 2004-11-16 17:59 EDT ------- mainline accepted Mike's patch. Can we please have this commited for U5 if too late for U4. Thanks.
----- Additional Comments From mkravetz.com(prefers email via kravetz.com) 2004-11-15 12:00 EDT ------- FYI - On Friday I sent the patch to Marcelo for inclusion in 2.4 mainline. http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/1417.html
---- Additional Comments From prakapn.com 2005-03-18 04:11 EST ------- Verification is under progress with RHEL3 U5 (2.4.21-31).
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ACCEPTED |CLOSED ------- Additional Comments From prakapn.com 2005-03-23 00:41 EST ------- Verified that top command is stable in RHEL3 U5. Closing the defect report.
Glen, could you please explain what's going on here? No fix for this problem has been committed to U5, so I'm not sure why anyone on your end attempted to verify that the problem is fixed.
---- Additional Comments From salina.com 2005-03-31 14:52 EST ------- sorry our test team was anxious to test this. We had tried to request a RHEL 3 U5 target.
---- Additional Comments From corryk.com(prefers email via kevcorry.com) 2005-07-11 14:56 EDT ------- Hi Michael, Prakash, Salina, Do we know if this patch was picked up for RHEL3-U5? If so, let's set this bug to "accepted". If not, let's move the target-milestone out to RHEL3-U6. Thanks!
---- Additional Comments From salina.com 2005-07-11 16:01 EDT ------- kernel-source-2.4.21-32.EL which is RHEL 3 U5 kernel, still does not have Mike's patch.
Glen, no kernel fix related to this made it into U6.
---- Additional Comments From jstultz.com(prefers email via johnstul.com) 2005-09-13 20:47 EDT ------- Any update on this bug?
---- Additional Comments From mkravetz.com(prefers email via kravetz.com) 2005-09-13 22:15 EDT ------- Not sure who you are talking to John. If there is anything else I (as bug owner) can do to help, let me know. Patch has been provided and even accepted in mainline.
It was proposed for RHEL-U7 on 9/12. No work has been done on it as of yet.
Please download and test the kernel found here: http://people.redhat.com/anderson/.BZ_138730 In this location, you will find an i386 smp kernel, the associated kernel src.rpm, and the patch that was applied: kernel-smp-2.4.21-37.3.EL.bz138730.i686.rpm kernel-2.4.21-37.3.EL.bz138730.src.rpm linux-kernel-test.patch So far I haven't been able to reproduce the problem with a 4-cpu system running a kernel without the patch. Please report your test results back to the Bugzilla. Thanks, Dave Anderson
BTW, my test consists of running 4 "fe_long" tasks, along with a "top -d 0", on a 4-cpu box. It's still running strong on an unpatched kernel for well over 2 hours.
Two updates: 1. I was able to reproduce the top segfault. 2. But the kernel above (kernel-smp-2.4.21-37.3.EL.bz138730.i686.rpm) may not boot! Apparently 2.4.21-37.1 introduced a patch associated with NX-in-kernel-code and largepages, that causes some i386 machines to go into an infinite reboot cycle. I will update the kernel in the http://people.redhat.com/anderson directory listed above.
I have replaced the test kernel binary, src.rpm and applied patch in: http://people.redhat.com/anderson/.BZ_138730 with: kernel-smp-2.4.21-37.EL.BZ138730.i686.rpm kernel-2.4.21-37.EL.BZ138730.src.rpm linux-kernel-test.patch I am testing it now, but we require the reporting partner's test and buy-in of the test kernel.
Glen, if this bugzilla doesn't need to remain IBM-confidential, then please uncheck the two "IBM Confidential Group" boxes below. Thanks.
Thanks, Glen. Completing transition to public bug.
*** Bug 162683 has been marked as a duplicate of this bug. ***
A fix for this problem has just been committed to the RHEL3 U7 patch pool this evening (in kernel version 2.4.21-37.5.EL).
Is the fix for this the same patch that is attached to this report? If not is it possible to point me to the patch that was finally used or the SRPM containing it. Thanks...james
See comment #30, and click on the link. The patch is "linux-kernel-test-patch".
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0144.html