From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0) Description of problem: While running stress tests on the IBM x450 and x455 we've been encountering kernel panics from handle_IPI: ... [<e00000000440e9a0>] sp=0xe00000005e657930 bsp=0xe00000005e651868 ia64_leave_kernel [kernel] 0x0 [<e000000004448800>] sp=0xe00000005e657ad0 bsp=0xe00000005e651808 handle_IPI [kernel] 0x1a0 Version-Release number of selected component (if applicable): kernel-2.4.21-18.EL How reproducible: Sometimes Steps to Reproduce: 1. Subject the system to heavy test load. 2. 3. Actual Results: kernel panic Expected Results: no kernel panic Additional info:
*** Bug 128724 has been marked as a duplicate of this bug. ***
*** Bug 128847 has been marked as a duplicate of this bug. ***
*** Bug 129011 has been marked as a duplicate of this bug. ***
Adding this to U3 must-fix list. -ernie
Gary, do you have a test case that will help me reproduce this problem? I have tried without success. Thanks, Larry Woodman
Larry, The test that reproduces the problem is a collection of various test suites that are run in parallel to stress the system. Unfortunately, some of the test suites in the collection contain IP that prevents me from giving it to you. What have you tried? Do you also have a stress test suite that you have tried on an IBM or other ia64 system?
I think this problem is fixed. The way we want to fix it is to take the spinlock before altering call_data_struct rather than eliminating the static definition so that the disk dump still works properly. *********************************************************************** --- linux-2.4.21/arch/ia64/kernel/smp.c.orig 2004-08-10 11:09:31.000000000 -0400 +++ linux-2.4.21/arch/ia64/kernel/smp.c 2004-08-10 11:10:10.000000000 -0400 @@ -296,6 +296,8 @@ if (!cpus) return 0; + spin_lock_bh(&call_lock); + data.func = func; data.info = info; atomic_set(&data.started, 0); @@ -303,8 +305,6 @@ if (wait > 0) atomic_set(&data.finished, 0); - spin_lock_bh(&call_lock); - call_data = &data; mb(); /* ensure store to call_data precedes setting of IPI_CALL_FUNC */ send_IPI_allbutself(IPI_CALL_FUNC); ********************************************************************* Please rerun the Busy test with this kernel ASAP so that we can include this change in RHEL3-U3. Its located in: http://people.redhat.com/~lwoodman/IA64/ Larry Woodman
Larry, Our stress tests are running on your debug kernel on two of our boxes, an x450 and an x455. I'll let you know how it goes.
Larry, I think you have this one nailed. The tests have been running continuously on both systems for over 22 hours with no problems. I believe I was previously seeing the panic in less than 4 hours. I'll allow the tests to continue and give you further updates but if you absolutely need to get the changes in today I think you're good to go. Thanks.
Thanks for the quick turnaroung Gary, please rerun the test one more time ASAP with the final version located in: http://people.redhat.com/~lwoodman/IA64/ We want to build a final U3 kernel as soon as you can confirm this run. Thanks, Larry Woodman
I installed/booted 2.4.21-18.ia64debugfinal.EL on the same two systems and started the tests. I'll give you a report tomorrow morning.
Bad news. The tests are still running on the x455 but there was an Oops on the x450 after I left for home last night. The tests had run almost 5 hours. The following call trace was copied manualy from the console. Call Trace: [<e0000000044158e0>] sp=0xe00000006d107a30 bsp=0xe00000006d101548 show_stack [kernel] 0x80 [<e0000000044512d0>] sp=0xe00000006d107c00 bsp=0xe00000006d101520 die [kernel] 0x200 [<e0000000044512d0>] sp=0xe00000006d107c00 bsp=0xe00000006d1014c8 ia64_do_page_fault [kernel] 0x310 [<e00000000440e9a0>] sp=0xe00000006d107c90 bsp=0xe00000006d1014c8 ia64_leave_kernel [kernel] 0x0 [<e00000000451bea0>] sp=0xe00000006d107e30 bsp=0xe00000006d1014c8 buffer_insert_list [kernel] 0xc0
Gary, can you see if you can reproduce this OOPS and grab a full traceback? There isnt enough information to debug this but at first glance it doesnt look anything like the handle_IPI() OOPS that we were seeing before. Thanks, Larry
Larry, Earlier today I got a serial line set up on the x450 so we can capture the entire Oops output in a scrollable window if this happens again. The tests are running. Starting tomorrow I will be gone until Monday but Chris McDermott is monitoring and will provide the Oops output if this reproduces again.
Hi, Gary. We know that Larry has fixed a problem in the x86 and ia64 versions of smp_call_function() that would account for the originally reported problem in handle_IPI(). Please assume that the new problem you've encountered on your x450 is something different, and open a new bugzilla for that issue. Thanks. -ernie
A fix for this problem has just been committed to the RHEL3 U3 patch pool this evening (in kernel version 2.4.21-19.EL).
The stress tests continued to run through the weekend on both the x450 and x455 with no problems. Ernie, The problem I reported in comment 12 appears to be (1) very difficult to reproduce and (2) unrelated to the handle_IPI() Oops. Per your comment 15 request I just submitted Bug 130022 to track this apparently new problem.
Gary, thanks very much for opening the new bug. I just assigned it to Larry Woodman.
Closing out this issue based on confirmation from the original reporter.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-433.html