Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 3 product line. The current stable release is 3.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 128993

Summary:	kernel panic from handle_IPI
Product:	Red Hat Enterprise Linux 3	Reporter:	Gary Hade <garyhade>
Component:	kernel	Assignee:	Larry Woodman <lwoodman>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.0	CC:	acjohnso, alex_williamson, glen.foster, gone, grant.grundler, jbaron, lcm, peterm, petrides, tao, tburke
Target Milestone:	---
Target Release:	---
Hardware:	ia64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-08-19 00:51:02 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	116727

Description Gary Hade 2004-08-02 19:57:07 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)

Description of problem:
While running stress tests on the IBM x450 and x455 we've been 
encountering kernel panics from handle_IPI:
                       ...
[<e00000000440e9a0>] sp=0xe00000005e657930 bsp=0xe00000005e651868
                     ia64_leave_kernel [kernel] 0x0
[<e000000004448800>] sp=0xe00000005e657ad0 bsp=0xe00000005e651808
                     handle_IPI [kernel] 0x1a0


Version-Release number of selected component (if applicable):
kernel-2.4.21-18.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. Subject the system to heavy test load.
2.
3.
    

Actual Results:  kernel panic

Expected Results:  no kernel panic

Additional info:

Comment 1 Ernie Petrides 2004-08-05 20:23:41 UTC

*** Bug 128724 has been marked as a duplicate of this bug. ***

Comment 2 Ernie Petrides 2004-08-05 20:25:35 UTC

*** Bug 128847 has been marked as a duplicate of this bug. ***

Comment 3 Ernie Petrides 2004-08-05 20:27:37 UTC

*** Bug 129011 has been marked as a duplicate of this bug. ***

Comment 4 Ernie Petrides 2004-08-05 20:30:17 UTC

Adding this to U3 must-fix list.  -ernie

Comment 5 Larry Woodman 2004-08-09 15:40:06 UTC

Gary, do you have a test case that will help me reproduce this problem?

I have tried without success.


Thanks, Larry Woodman

Comment 6 Gary Hade 2004-08-09 18:14:45 UTC

Larry, The test that reproduces the problem is a collection of 
various test suites that are run in parallel to stress the system.  
Unfortunately, some of the test suites in the collection contain IP 
that prevents me from giving it to you.  What have you tried?  Do you 
also have a stress test suite that you have tried on an IBM or other 
ia64 system?

Comment 7 Larry Woodman 2004-08-10 16:23:30 UTC

I think this problem is fixed.  The way we want to fix it is to take
the spinlock before altering call_data_struct rather than eliminating
the static definition so that the disk dump still works properly.

***********************************************************************
--- linux-2.4.21/arch/ia64/kernel/smp.c.orig    2004-08-10
11:09:31.000000000 -0400
+++ linux-2.4.21/arch/ia64/kernel/smp.c 2004-08-10 11:10:10.000000000
-0400
@@ -296,6 +296,8 @@
        if (!cpus)
                return 0;
                                                                     
                                                    
+       spin_lock_bh(&call_lock);
+
        data.func = func;
        data.info = info;
        atomic_set(&data.started, 0);
@@ -303,8 +305,6 @@
        if (wait > 0)
                atomic_set(&data.finished, 0);
                                                                     
                                                    
-       spin_lock_bh(&call_lock);
-
        call_data = &data;
        mb();   /* ensure store to call_data precedes setting of
IPI_CALL_FUNC */
        send_IPI_allbutself(IPI_CALL_FUNC);
*********************************************************************


Please rerun the Busy test with this kernel ASAP so that we can
include this change in RHEL3-U3.  Its located in:

http://people.redhat.com/~lwoodman/IA64/

Larry Woodman

Comment 8 Gary Hade 2004-08-10 18:23:53 UTC

Larry, Our stress tests are running on your debug kernel on two of 
our boxes, an x450 and an x455.  I'll let you know how it goes.

Comment 9 Gary Hade 2004-08-11 16:45:33 UTC

Larry, I think you have this one nailed.  The tests have been running 
continuously on both systems for over 22 hours with no problems.  I 
believe I was previously seeing the panic in less than 4 hours.  I'll 
allow the tests to continue and give you further updates but if you 
absolutely need to get the changes in today I think you're good to 
go.  Thanks.

Comment 10 Larry Woodman 2004-08-11 20:08:41 UTC

Thanks for the quick turnaroung Gary, please rerun the test one more
time ASAP with the final version located in:

http://people.redhat.com/~lwoodman/IA64/


We want to build a final U3 kernel as soon as you can confirm this run.

Thanks, Larry Woodman

Comment 11 Gary Hade 2004-08-11 23:49:53 UTC

I installed/booted 2.4.21-18.ia64debugfinal.EL on the same two 
systems and started the tests.  I'll give you a report tomorrow 
morning.

Comment 12 Gary Hade 2004-08-12 18:24:56 UTC

Bad news.  The tests are still running on the x455 but there was an 
Oops on the x450 after I left for home last night.  The tests had run 
almost 5 hours.  The following call trace was copied manualy from the 
console.

Call Trace: [<e0000000044158e0>] sp=0xe00000006d107a30
    bsp=0xe00000006d101548  show_stack [kernel] 0x80
[<e0000000044512d0>] sp=0xe00000006d107c00
    bsp=0xe00000006d101520  die [kernel] 0x200
[<e0000000044512d0>] sp=0xe00000006d107c00
    bsp=0xe00000006d1014c8  ia64_do_page_fault [kernel] 0x310
 [<e00000000440e9a0>] sp=0xe00000006d107c90 
    bsp=0xe00000006d1014c8  ia64_leave_kernel [kernel] 0x0 
[<e00000000451bea0>] sp=0xe00000006d107e30
    bsp=0xe00000006d1014c8 buffer_insert_list [kernel] 0xc0

Comment 13 Larry Woodman 2004-08-12 19:41:21 UTC

Gary, can you see if you can reproduce this OOPS and grab a full
traceback?  There isnt enough information to debug this but at first
glance it doesnt look anything like the handle_IPI() OOPS that we were
seeing before.

Thanks, Larry

Comment 14 Gary Hade 2004-08-13 00:34:05 UTC

Larry, Earlier today I got a serial line set up on the x450 so we can 
capture the entire Oops output in a scrollable window if this happens 
again.  The tests are running.  Starting tomorrow I will be gone 
until Monday but Chris McDermott is monitoring and will provide the 
Oops output if this reproduces again.

Comment 15 Ernie Petrides 2004-08-13 04:29:31 UTC

Hi, Gary.  We know that Larry has fixed a problem in the x86 and
ia64 versions of smp_call_function() that would account for the
originally reported problem in handle_IPI().  Please assume that
the new problem you've encountered on your x450 is something
different, and open a new bugzilla for that issue.

Thanks.  -ernie

Comment 16 Ernie Petrides 2004-08-13 04:31:08 UTC

A fix for this problem has just been committed to the RHEL3 U3
patch pool this evening (in kernel version 2.4.21-19.EL).

Comment 17 Gary Hade 2004-08-16 16:49:03 UTC

The stress tests continued to run through the weekend on both the 
x450 and x455 with no problems.

Ernie, The problem I reported in comment 12 appears to be (1) very 
difficult to reproduce and (2) unrelated to the handle_IPI() Oops.  
Per your comment 15 request I just submitted Bug 130022 to track this 
apparently new problem.

Comment 18 Ernie Petrides 2004-08-16 21:20:58 UTC

Gary, thanks very much for opening the new bug.  I just assigned
it to Larry Woodman.

Comment 19 Jay Turner 2004-08-19 00:51:02 UTC

Closing out this issue based on confirmation from the original reporter.

Comment 20 John Flanagan 2004-09-02 04:32:11 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-433.html