Bug 128993 - kernel panic from handle_IPI
kernel panic from handle_IPI
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
ia64 Linux
medium Severity high
: ---
: ---
Assigned To: Larry Woodman
Brian Brock
:
Depends On:
Blocks: 116727
  Show dependency treegraph
 
Reported: 2004-08-02 15:57 EDT by Gary Hade
Modified: 2007-11-30 17:07 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-08-18 20:51:02 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Gary Hade 2004-08-02 15:57:07 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)

Description of problem:
While running stress tests on the IBM x450 and x455 we've been 
encountering kernel panics from handle_IPI:
                       ...
[<e00000000440e9a0>] sp=0xe00000005e657930 bsp=0xe00000005e651868
                     ia64_leave_kernel [kernel] 0x0
[<e000000004448800>] sp=0xe00000005e657ad0 bsp=0xe00000005e651808
                     handle_IPI [kernel] 0x1a0


Version-Release number of selected component (if applicable):
kernel-2.4.21-18.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. Subject the system to heavy test load.
2.
3.
    

Actual Results:  kernel panic

Expected Results:  no kernel panic

Additional info:
Comment 1 Ernie Petrides 2004-08-05 16:23:41 EDT
*** Bug 128724 has been marked as a duplicate of this bug. ***
Comment 2 Ernie Petrides 2004-08-05 16:25:35 EDT
*** Bug 128847 has been marked as a duplicate of this bug. ***
Comment 3 Ernie Petrides 2004-08-05 16:27:37 EDT
*** Bug 129011 has been marked as a duplicate of this bug. ***
Comment 4 Ernie Petrides 2004-08-05 16:30:17 EDT
Adding this to U3 must-fix list.  -ernie
Comment 5 Larry Woodman 2004-08-09 11:40:06 EDT
Gary, do you have a test case that will help me reproduce this problem?

I have tried without success.


Thanks, Larry Woodman
Comment 6 Gary Hade 2004-08-09 14:14:45 EDT
Larry, The test that reproduces the problem is a collection of 
various test suites that are run in parallel to stress the system.  
Unfortunately, some of the test suites in the collection contain IP 
that prevents me from giving it to you.  What have you tried?  Do you 
also have a stress test suite that you have tried on an IBM or other 
ia64 system?
  
Comment 7 Larry Woodman 2004-08-10 12:23:30 EDT
I think this problem is fixed.  The way we want to fix it is to take
the spinlock before altering call_data_struct rather than eliminating
the static definition so that the disk dump still works properly.

***********************************************************************
--- linux-2.4.21/arch/ia64/kernel/smp.c.orig    2004-08-10
11:09:31.000000000 -0400
+++ linux-2.4.21/arch/ia64/kernel/smp.c 2004-08-10 11:10:10.000000000
-0400
@@ -296,6 +296,8 @@
        if (!cpus)
                return 0;
                                                                     
                                                    
+       spin_lock_bh(&call_lock);
+
        data.func = func;
        data.info = info;
        atomic_set(&data.started, 0);
@@ -303,8 +305,6 @@
        if (wait > 0)
                atomic_set(&data.finished, 0);
                                                                     
                                                    
-       spin_lock_bh(&call_lock);
-
        call_data = &data;
        mb();   /* ensure store to call_data precedes setting of
IPI_CALL_FUNC */
        send_IPI_allbutself(IPI_CALL_FUNC);
*********************************************************************


Please rerun the Busy test with this kernel ASAP so that we can
include this change in RHEL3-U3.  Its located in:

http://people.redhat.com/~lwoodman/IA64/

Larry Woodman



Comment 8 Gary Hade 2004-08-10 14:23:53 EDT
Larry, Our stress tests are running on your debug kernel on two of 
our boxes, an x450 and an x455.  I'll let you know how it goes.
Comment 9 Gary Hade 2004-08-11 12:45:33 EDT
Larry, I think you have this one nailed.  The tests have been running 
continuously on both systems for over 22 hours with no problems.  I 
believe I was previously seeing the panic in less than 4 hours.  I'll 
allow the tests to continue and give you further updates but if you 
absolutely need to get the changes in today I think you're good to 
go.  Thanks.



    
Comment 10 Larry Woodman 2004-08-11 16:08:41 EDT
Thanks for the quick turnaroung Gary, please rerun the test one more
time ASAP with the final version located in:

http://people.redhat.com/~lwoodman/IA64/


We want to build a final U3 kernel as soon as you can confirm this run.

Thanks, Larry Woodman
Comment 11 Gary Hade 2004-08-11 19:49:53 EDT
I installed/booted 2.4.21-18.ia64debugfinal.EL on the same two 
systems and started the tests.  I'll give you a report tomorrow 
morning.
Comment 12 Gary Hade 2004-08-12 14:24:56 EDT
Bad news.  The tests are still running on the x455 but there was an 
Oops on the x450 after I left for home last night.  The tests had run 
almost 5 hours.  The following call trace was copied manualy from the 
console.

Call Trace: [<e0000000044158e0>] sp=0xe00000006d107a30
    bsp=0xe00000006d101548  show_stack [kernel] 0x80
[<e0000000044512d0>] sp=0xe00000006d107c00
    bsp=0xe00000006d101520  die [kernel] 0x200
[<e0000000044512d0>] sp=0xe00000006d107c00
    bsp=0xe00000006d1014c8  ia64_do_page_fault [kernel] 0x310
 [<e00000000440e9a0>] sp=0xe00000006d107c90 
    bsp=0xe00000006d1014c8  ia64_leave_kernel [kernel] 0x0 
[<e00000000451bea0>] sp=0xe00000006d107e30
    bsp=0xe00000006d1014c8 buffer_insert_list [kernel] 0xc0

 
Comment 13 Larry Woodman 2004-08-12 15:41:21 EDT
Gary, can you see if you can reproduce this OOPS and grab a full
traceback?  There isnt enough information to debug this but at first
glance it doesnt look anything like the handle_IPI() OOPS that we were
seeing before.

Thanks, Larry
Comment 14 Gary Hade 2004-08-12 20:34:05 EDT
Larry, Earlier today I got a serial line set up on the x450 so we can 
capture the entire Oops output in a scrollable window if this happens 
again.  The tests are running.  Starting tomorrow I will be gone 
until Monday but Chris McDermott is monitoring and will provide the 
Oops output if this reproduces again.
Comment 15 Ernie Petrides 2004-08-13 00:29:31 EDT
Hi, Gary.  We know that Larry has fixed a problem in the x86 and
ia64 versions of smp_call_function() that would account for the
originally reported problem in handle_IPI().  Please assume that
the new problem you've encountered on your x450 is something
different, and open a new bugzilla for that issue.

Thanks.  -ernie
Comment 16 Ernie Petrides 2004-08-13 00:31:08 EDT
A fix for this problem has just been committed to the RHEL3 U3
patch pool this evening (in kernel version 2.4.21-19.EL).
Comment 17 Gary Hade 2004-08-16 12:49:03 EDT
The stress tests continued to run through the weekend on both the 
x450 and x455 with no problems.

Ernie, The problem I reported in comment 12 appears to be (1) very 
difficult to reproduce and (2) unrelated to the handle_IPI() Oops.  
Per your comment 15 request I just submitted Bug 130022 to track this 
apparently new problem.
   
Comment 18 Ernie Petrides 2004-08-16 17:20:58 EDT
Gary, thanks very much for opening the new bug.  I just assigned
it to Larry Woodman.
Comment 19 Jay Turner 2004-08-18 20:51:02 EDT
Closing out this issue based on confirmation from the original reporter.
Comment 20 John Flanagan 2004-09-02 00:32:11 EDT
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-433.html

Note You need to log in before you can comment on or make changes to this bug.