438217 – [24][FOCUS] Busy hang running WRT java on RedHat-MRG

Bug 438217 - [24][FOCUS] Busy hang running WRT java on RedHat-MRG

Summary: [24][FOCUS] Busy hang running WRT java on RedHat-MRG

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	beta
Hardware:	ia32e
OS:	All
Priority:	low
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Red Hat Real Time Maintenance
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-03-19 18:33 UTC by IBM Bug Proxy
Modified:	2008-04-07 14:52 UTC (History)
CC List:	0 users
Fixed In Version:	2.6.24.4-30.el5rt
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-04-07 14:52:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Screenshot of panic from rtj-opt6.hursley.ibm.com (540.83 KB, image/jpeg) 2008-03-19 18:33 UTC, IBM Bug Proxy	no flags	Details
Screenshot of panic from rtj-opt22.hursley.ibm.com (583.76 KB, image/jpeg) 2008-03-19 18:33 UTC, IBM Bug Proxy	no flags	Details
Screenshot of the panic submitted by P. N. Stanton (698.53 KB, image/jpeg) 2008-03-19 18:33 UTC, IBM Bug Proxy	no flags	Details
Screenshot pf panic from rtj-opt6.hursley.ibm.com (550.17 KB, image/jpeg) 2008-03-19 18:33 UTC, IBM Bug Proxy	no flags	Details
Hiroshi-san's patch for 2.6.24.3-29.el5rt kernel (2.03 KB, text/plain) 2008-03-20 07:57 UTC, IBM Bug Proxy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	42841	0	None	None	None	Never

Description IBM Bug Proxy 2008-03-19 18:33:17 UTC

=Comment: #0=================================================
P. Thwaite <paul_thwaite.com> - 2008-03-04 11:39 EDT
Problem description:

Bugzilla 42758 was raised recently to cover kernel panics when running 
Realtime java on RH-MRG.  This bug is being opened to cover another problem 
with running Java on RedHat-MRG.

On LS20 and HS21 hardware, when running the same java tests as per bugzilla 
42758, the java process intermittently hangs in a busy loop (100% cpu).  This 
sometimes happens when threads are spawned, or within a few minutes into the 
test.  I say intermittent because we have some tests pass, but the majority do 
fail in this way. 

Issuing a "ps -aux" causes the command to hang (which is a known limitation I 
believe?) and running "chrt -f 99 ps -aux"  usually does run, although I've 
seen that hang too. 

Since we're unable to run java properly, this bug is also blocking java test 
progress. 


Hardware Environment
rtj-opt9.hursley.ibm.com
LS20, 8850-55G, 2 GHz Opteron 270 (dual core), 4GB RAM

rtj-opt28.hursley.ibm.com
HS21, 8853-L6G, 2 x 3.0 GHz Xeon 5160 (dual core EM64T), 4GB RAM


Is this reproducible?
Yes - almost every time. Just running the same tests documented in 42758 will 
cause the busy hang.   

Is the system (not just the application) hung?
No.
    
Did the system produce an OOPS message on the console?
No.

Is the system sitting in a debugger right now?
No.

Additional information:

The test machines are available if required - pls ask for details.
=Comment: #3=================================================
Sripathi Kodi <sripathi.com> - 2008-03-06 10:34 EDT
When the machine hangs:
1) rt-sshd fails to respond
2) Machine doesn't respond to ping
3) sysrq keys don't work.
4) Keyboard at the console doesn't work (for example, caps lock key doesn't work)
=Comment: #5=================================================
Paul A. Clarke <pacman.com> - 2008-03-06 14:04 EDT
I'll start looking into this as well. I'll be out from Saturday thru Tuesday,
but will try to get up to speed quickly.
=Comment: #9=================================================
Paul A. Clarke <pacman.com> - 2008-03-07 11:32 EDT
(In reply to comment #8)
> I'll put it in a loop and see if it finally hangs.

Still running, but I did see this on the console:
Clocksource tsc unstable (delta = 518100895 ns)

=Comment: #10=================================================
P. Thwaite <paul_thwaite.com> - 2008-03-07 11:40 EDT
Paul - as a matter of interest, what hardware type are you using in the runs 
you describe in comment 8 and 9?
=Comment: #11=================================================
Paul A. Clarke <pacman.com> - 2008-03-07 13:08 EDT
(In reply to comment #10)
> Paul - as a matter of interest, what hardware type are you using in the runs 
> you describe in comment 8 and 9?

# dmidecode --string system-product-name
IBM eServer BladeCenter LS20 -[885071U]-
=Comment: #12=================================================
Paul A. Clarke <pacman.com> - 2008-03-07 15:28 EDT
(In reply to comment #8)
> I'll put it in a loop and see if it finally hangs.
> 
> I'm running with the latest kernel, 2.6.24.3-29.el5rt.

Since I'll be away until Wednesday 12 March, I'll provide current status:

I'm up to iteration 58 in my loop, no hangs, and have only observed the two
issues noted in previous comments.

Sripathi, could you restart your efforts with the latest kernel as noted above?
 That kernel includes the fix for bug #42758, so no patching and rebuilding is
required.
=Comment: #13=================================================
John G. Stultz <jstultz.com> - 2008-03-07 21:27 EDT
Might be worth checking if the set_kthread_prio bug is involved here (ie: IRQ
handlers being starved?).
=Comment: #14=================================================
P. Thwaite <paul_thwaite.com> - 2008-03-10 10:44 EDT
Bug 42758 is now fixed (kernel no longer panics).

Whilst testing 42758, we continue to see this bug (which typically causes the 
machine to hang). 

Thie bug is now the next blocker for JTC RH-MRG testing. 

We are running tests at the moment to determine what test (or set of tests) 
cause the hangs.  Details will be available soon. 

=Comment: #15=================================================
Sripathi Kodi <sripathi.com> - 2008-03-11 03:26 EDT
(In reply to comment #13)
> Might be worth checking if the set_kthread_prio bug is involved here (ie: IRQ
> handlers being starved?).

I am trying this out.
=Comment: #16=================================================
Sripathi Kodi <sripathi.com> - 2008-03-11 03:41 EDT
(In reply to comment #15)
> (In reply to comment #13)
> > Might be worth checking if the set_kthread_prio bug is involved here (ie: IRQ
> > handlers being starved?).
> 
> I am trying this out.

Nope, that fix doesn't seem to help this problem. I can still recreate the
problem. I am using 2.6.24.3-29.el5rt kernel.
=Comment: #17=================================================
Paul A. Clarke <pacman.com> - 2008-03-11 23:10 EDT
(In reply to comment #16)
> Nope, that fix doesn't seem to help this problem. I can still recreate the
> problem. I am using 2.6.24.3-29.el5rt kernel.

I wonder why I can't reproduce it.  I'm on iteration 1123.
=Comment: #18=================================================
Sripathi Kodi <sripathi.com> - 2008-03-12 11:33 EDT
(In reply to comment #17)
> (In reply to comment #16)
> > Nope, that fix doesn't seem to help this problem. I can still recreate the
> > problem. I am using 2.6.24.3-29.el5rt kernel.
> 
> I wonder why I can't reproduce it.  I'm on iteration 1123.

Thats a surprise. I have recreated this on two LS20s so far. Most recently I
tried this on (ABAT provisioned) ltcrt16 and recreated it easily. I just
cancelled my job on ltcrt16, so you could try on that very machine and see if it
works.
=Comment: #19=================================================
Sripathi Kodi <sripathi.com> - 2008-03-12 11:35 EDT
I booted with nmi_watchdog=2, verified that NMIs were getting generated and
recreated the problem. I still could not see anything on the console when the
problem happened. It would either mean the system is so badly hosed that it
can't handle NMIs or it is just unable to print anything on the console. I am
hoping that it is the latter and thinking of ways to circumvent it.
=Comment: #20=================================================
Paul A. Clarke <pacman.com> - 2008-03-12 11:59 EDT
(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> > > Nope, that fix doesn't seem to help this problem. I can still recreate the
> > > problem. I am using 2.6.24.3-29.el5rt kernel.
> > 
> > I wonder why I can't reproduce it.  I'm on iteration 1123.
> 
> Thats a surprise. I have recreated this on two LS20s so far. Most recently I
> tried this on (ABAT provisioned) ltcrt16 and recreated it easily. I just
> cancelled my job on ltcrt16, so you could try on that very machine and see if it
> works.

Something finally happened, but it's still not hung...I seem to be stuck in an
endless loop, with these appearing continuously on the terminal from which the
tests are running:
20080312-11:57:31 Command ps -aux did not complete in 30 seconds and has been
terminated

I'll take a peek at ltcrt16...you want to look at ltcrt6?  Maybe I set something
up wrong?
=Comment: #21=================================================
Paul A. Clarke <pacman.com> - 2008-03-12 13:43 EDT
(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> > > Nope, that fix doesn't seem to help this problem. I can still recreate the
> > > problem. I am using 2.6.24.3-29.el5rt kernel.
> > 
> > I wonder why I can't reproduce it.  I'm on iteration 1123.
> 
> Thats a surprise. I have recreated this on two LS20s so far. Most recently I
> tried this on (ABAT provisioned) ltcrt16 and recreated it easily. I just
> cancelled my job on ltcrt16, so you could try on that very machine and see if it
> works.

OK, I've grabbed ltcrt16.

The only jtctests data that I can find is in root's home dir.

Sripathi, Paul, Are you running these tests as root?  If so, why?

I wonder if that's the difference between our runs...I'm running as a non-root user.

Anyway, I'll fire up some tests on ltcrt16 and see what happens.
=Comment: #22=================================================
Paul A. Clarke <pacman.com> - 2008-03-12 14:47 EDT
(In reply to comment #21)
> I wonder if that's the difference between our runs...I'm running as a non-root
user.
> 
> Anyway, I'll fire up some tests on ltcrt16 and see what happens.

I ran once as root and quickly lost control of the machine, including the SOL
session.

I'm up to iteration 6 running as non-root, no hangs so far.
=Comment: #23=================================================
Sripathi Kodi <sripathi.com> - 2008-03-13 01:24 EDT
(In reply to comment #21)
<snip>
> Sripathi, Paul, Are you running these tests as root?  If so, why?
> 
> I wonder if that's the difference between our runs...I'm running as a non-root
user.

Paul, I may have run the test as root on ltcrt16. I agree that I should not do
that. However, on my local LS20 (llm50.in) I have always run it as a normal user
and recreated the problem pretty consistently. It has never taken more than 5
iterations to recreate the problem.

<snip>

=Comment: #24=================================================
Sripathi Kodi <sripathi.com> - 2008-03-13 01:35 EDT
Discussion from bug 42758 that is relevant here: 

------- Additional Comment #40 From Sripathi Kodi  2008-03-12 12:37 EDT  [reply]
-------     Internal Only

(In reply to comment #39)
> Seems that the machines are not hanging as per 48241 - they're actually
> panicking. RIP is pointing to __spin_lock this time. As before, this is easily
> reproducible with the same tests that found the first panic.
> We didn't spot this at first as the panics are not output over the network or
> serial link - it only appears on the screen. So unfortunately the only output we
> have is what I've been able to photograph from the screen.
> Do you want me to re-open this bug or start a new one?

Is it possible that all hangs seen in 42841 could be explained by this? If that
is possible, it will be useful  to carry out analysis in 42841. If it turns out
to be a totally new problem, we can open a new bug later. We surely like to see
the photograph, btw.


------- Additional Comment #41 From P. N. Stanton 2008-03-12 13:34 EDT [reply]
------- Internal Only

It could be the same hang as in 42841, but I see kernel messages every time
whereas the comments in 42841 say that no output is produced.
I've attached a photo from rtj-opt6.hursley.ibm.com. Some of the information
looks to have disappeared off the top of the screen, but this is the only one
I've seen that shows a call trace. Apologies for the quality of the photography
- took this with my mobile phone
=Comment: #25=================================================
Sripathi Kodi <sripathi.com> - 2008-03-13 01:36 EDT

Screenshot of the panic submitted by P. N. Stanton

=Comment: #26=================================================
Sripathi Kodi <sripathi.com> - 2008-03-13 01:45 EDT
The screenshot is interesting. The panic seems to be because of an nmi. My guess
is that the system was hung and nmi triggered the panic. This is what I have
been trying to do without success! We are probably seeing a deadlock here. 
Parts of the text from the screenshot:


Pid: 30090, comm: java

default_do_nmi + 0x6c/0x1a6
do_nmi + 0x3e/0x5a
nmi + 0x7f/0x90
spin_lock + 0x1d/0x23
double_lock_balance + 0x57/0x60
push_rt_task + 0xa4/0x20e
push_rt_tasks + 0x14/0x1c  <== Not sure whether it is 0x1c or 0x1e
task_wake_up_rt + 0x26/0x28
wake_up_new_task + 0xa7/0xbc
do_fork + 0x13e/0x20e

Peter, can this be recreated? If yes, it will be great to get a kdump when it
happens. Procedure to set up kdump is here:
http://rt.et.redhat.com/page/RHEL-RT_kdump/kexec I can help with this if needed.
=Comment: #27=================================================
Sripathi Kodi <sripathi.com> - 2008-03-13 06:26 EDT
This time, I used the NMI button on the front panel of the blade to trigger an
NMI when the hang occurred. The bladecenter logs show that NMI was pressed for
the particular blade, but the system did not respond to it. I saw nothing on
console, SOL had stopped working and I got no kdump.
=Comment: #28=================================================
P. N. Stanton <pstanton.com> - 2008-03-13 07:26 EDT
Hardware info for the two machines that we've been seeing these panics on:

rtj-opt6.hursley.ibm.com:
eServer 326m, model number 7969-76G
2 x 2.4 GHz Opteron 280 (dual core), 5 GB RAM

rtj-opt22.hursley.ibm.com:
eServer x3455, model number 7984-52G
2 x 2.6 GHz Opteron 2218 (dual core), 10 GB RAM

Pressing the NMI button on the back of rtj-opt22 produces these kernel messages:

Uhhuh. NMI received for unknown reason 21.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

The system then continues running normally.

The e326m does not appear to have an NMI button.

Both machines have kdump set up; we are running Java tests to re-create the problem.

I'll attach the other screenshots I've taken - two from rtj-opt6 and one from
rtj-opt22.

=Comment: #29=================================================
P. N. Stanton <pstanton.com> - 2008-03-13 07:27 EDT

Screenshot pf panic from rtj-opt6.hursley.ibm.com

=Comment: #30=================================================
P. N. Stanton <pstanton.com> - 2008-03-13 07:27 EDT

Screenshot of panic from rtj-opt6.hursley.ibm.com

=Comment: #31=================================================
P. N. Stanton <pstanton.com> - 2008-03-13 07:28 EDT

Screenshot of panic from rtj-opt22.hursley.ibm.com

=Comment: #32=================================================
Sripathi Kodi <sripathi.com> - 2008-03-13 09:07 EDT
I will try this on our local x3455 machine.
=Comment: #33=================================================
Sripathi Kodi <sripathi.com> - 2008-03-13 09:43 EDT
At last! I recreated the problem on llm55.in and got a kdump. Backtrace looks
like the following. The version of crash on the system could not read the dump
properly. Hence I pulled down latest version of crash from
http://people.redhat.com/~anderson/ and compiled it. I will post my observation
of the dump soon.

crash> bt
PID: 28301  TASK: ffff81022e6f0e80  CPU: 0   COMMAND: "java"
 #0 [ffffffff80a67d00] machine_kexec at ffffffff802246a1
 #1 [ffffffff80a67de0] crash_kexec at ffffffff8026929a
 #2 [ffffffff80a67ea0] die_nmi at ffffffff804a77de
 #3 [ffffffff80a67ed0] nmi_watchdog_tick at ffffffff804a7d23
 #4 [ffffffff80a67f00] default_do_nmi at ffffffff804a7453
 #5 [ffffffff80a67f30] do_nmi at ffffffff804a7dd9
 #6 [ffffffff80a67f50] nmi at ffffffff804a725f
    [exception RIP: __spin_lock+26]
    RIP: ffffffff804a6980  RSP: ffff81022398dde8  RFLAGS: 00200086
    RAX: ffff81022398dfd8  RBX: ffff810001021680  RCX: 0000000000000003
    RDX: 0000000000000000  RSI: ffff810001021680  RDI: ffff810001021680
    RBP: ffff81022398dde8   R8: ffff810001005840   R9: 00000000ffffffff
    R10: 00000000fffffff4  R11: 00000000c9c05c35  R12: ffff810001011680
    R13: ffffffff80a5f680  R14: ffff810001011680  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <exception stack> ---
 #7 [ffff81022398dde8] __spin_lock at ffffffff804a6980
 #8 [ffff81022398ddf0] double_lock_balance at ffffffff802329e0
 #9 [ffff81022398de10] push_rt_task at ffffffff802330e1
#10 [ffff81022398de50] push_rt_tasks at ffffffff8023325f
#11 [ffff81022398de70] task_wake_up_rt at ffffffff80239661
#12 [ffff81022398de80] wake_up_new_task at ffffffff8023a1f9
#13 [ffff81022398deb0] do_fork at ffffffff8023c860
#14 [ffff81022398df40] sys32_clone at ffffffff80229fe0
#15 [ffff81022398df50] ia32_ptregs_common at ffffffff80229cf5
    RIP: 0000000045b7efc8  RSP: 00000000cac0ce68  RFLAGS: 00200296
    RAX: ffffffffffffffda  RBX: 00000000003d0f00  RCX: 00000000a2cdd4b4
    RDX: 00000000a2cddbd8  RSI: 00000000cac0ced4  RDI: 00000000a2cddbd8
    RBP: 00000000cac0cf00   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000078  CS: 0023  SS: 002b
=Comment: #34=================================================
John G. Stultz <jstultz.com> - 2008-03-13 12:29 EDT
So looking at the screenshots for rtj-opt6.hursley.ibm.com, it seems we're hung
up on a spinlock in the apic_timer_interrupt code. This is similar to
Sripathi's, but not in the same place.

On rtj-opt22.hursley.ibm.com, it seems is hung in default_idle, which is an odd
place to hang.
=Comment: #35=================================================
Paul A. Clarke <pacman.com> - 2008-03-13 17:28 EDT
(In reply to comment #33)
> At last! I recreated the problem on llm55.in and got a kdump.

(talked to sripathi on irc...)
Two of the CPUs are stuck in double_lock_balance.  (per sripathi, "that is fine")
One appears to be in schedule.
One is in rb_insert_color:
 #8 [ffff810222db3a30] rb_insert_color at ffffffff8033f253
 #9 [ffff810222db3a60] __enqueue_entity at ffffffff80231a11
#10 [ffff810222db3a70] put_prev_task_fair at ffffffff802397a9
#11 [ffff810222db3a90] __schedule at ffffffff804a4eab
#12 [ffff810222db3b70] schedule at ffffffff804a5527
#13 [ffff810222db3b80] rt_mutex_slowlock at ffffffff804a6055
#14 [ffff810222db3c40] rt_mutex_lock at ffffffff804a5cf2
#15 [ffff810222db3c50] __rt_down_read at ffffffff8025e285
#16 [ffff810222db3c70] rt_down_read at ffffffff8025e29f
#17 [ffff810222db3c80] futex_wait at ffffffff8025aef3
#18 [ffff810222db3e10] do_futex at ffffffff8025c316
#19 [ffff810222db3ef0] compat_sys_futex at ffffffff8025d301
#20 [ffff810222db3f80] cstar_do_call at ffffffff80229a04
    RIP: 00000000ffffe405  RSP: 00000000b19c83c8  RFLAGS: 00200202
    RAX: ffffffffffffffda  RBX: ffffffff80229a04  RCX: 0000000000000000
    RDX: 0000000000000002  RSI: 0000000000000000  RDI: 00000000b19c8b90
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 00000000000000f0  CS: 0023  SS: 002b

=Comment: #36=================================================
Paul A. Clarke <pacman.com> - 2008-03-13 19:32 EDT
traceback indicates that the task which is in rb_insert_color is at
(bt -a)
[exception RIP: rb_insert_color+175]
    RIP: ffffffff8033f253

disassembly with line numbers shows:
(dis -l rb_insert_color)
/usr/src/debug/kernel-2.6.24.3/linux-2.6.24.3.x86_64/lib/rbtree.c: 129
0xffffffff8033f253 <rb_insert_color+175>:       andq   $0xfffffffffffffffe,(%r12)

line 129 of rbtree.c:
128                         rb_set_black(parent);
129                         rb_set_red(gparent);
130                         __rb_rotate_left(gparent, root);

now, at this point in the instruction sequence, I believe:
R15 is "root":
R15: ffff8101300ae708
root is a pointer to the address of the root node of the tree:
crash> rd ffff8101300ae708 1
ffff8101300ae708:  ffff810222d9f358

R12 is "gparent":
R12: ffff810222d9f358

RCX is "parent":
RCX: ffff810222d9f358

lets look at that node:
crash> rd ffff810222d9f358 3
ffff810222d9f358:  ffff810222d9f359 ffff810222d9f358
ffff810222d9f368:  0000000000000000

the node structure is { parent(and color), right, left }
note that the color of the node is stored in the low order bit of the parent
address, and red is 0 and black is 1 (per rbtree.h).

so, this tree is apparently a single node, currently black (from line 128), but
about to be set to red (line 129), right link points to itself, left link is null.

The tree has been corrupted.  The root node's parent should be NULL, and nodes
should not point to themselves.

Since the node has been set to red, that likely explains why this function never
finishes, since the loop condition is:

 76         while ((parent = rb_parent(node)) && rb_is_red(parent))

=Comment: #37=================================================
Ankita Garg <ankigarg.com> - 2008-03-14 01:49 EDT
(In reply to comment #33)
> At last! I recreated the problem on llm55.in and got a kdump. Backtrace looks
> like the following. The version of crash on the system could not read the dump
> properly. Hence I pulled down latest version of crash from
> http://people.redhat.com/~anderson/ and compiled it. 

Sripathi, so does that mean we will need to open a new bug to ask RH to ship the
new crash version with MRG?
=Comment: #39=================================================
Sripathi Kodi <sripathi.com> - 2008-03-14 04:33 EDT
I have looked at the dump a bit more. This is the summary of what I have seen.
If anyone is interested in detailed analysis I can put it up.

cpu0: 'java' process pid:28301 is trying to hold the runqueue lock of cpu 1
cpu1: 'java' process pid:14094 with incomplete backtrace!
cpu2: 'softirq-cru/2' process pid:39 is trying to hold runqueue lock of cpu 3
cpu3: 'java' process pid:14092 is spinning and has the rq lock of rq3. Paul has
found that it contains a corrupted r-b tree, because of which it is spinning
forever.

This can explain why cpu0, cpu2 and cpu3 are not able to make progress. I am not
sure what is happening on cpu1, however. I cannot confirm that it has the
runqueue lock of cpu1. It probably doesn't? It's backtrace is:

PID: 14094  TASK: ffff8100c1c21600  CPU: 1   COMMAND: "java"
 #0 [ffff8100c0181f58] schedule at ffffffff804a520c
 #1 [ffff8100c0181f80] cstar_do_call at ffffffff80229a04
    RIP: 00000000ffffe405  RSP: 00000000badef3c8  RFLAGS: 00200202
    RAX: ffffffffffffffda  RBX: ffffffff80229a04  RCX: 0000000000000000
    RDX: 0000000000000002  RSI: 0000000000000000  RDI: 00000000badefb90
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 00000000000000f0  CS: 0023  SS: 002b

I can't establish who holds cpu1's runqueue lock. I looked at the backtraces of
all other runnable tasks in vain. More info later. 

We may have to recreate the problem again to confirm the r-b tree corruption.
=Comment: #40=================================================
Sripathi Kodi <sripathi.com> - 2008-03-17 09:04 EDT
I tried to recreate this again on llm55.in to get another core dump. This would
help us reconfirm the r-b tree corruption that Paul has described. However, even
though the system seems to panic when the test is run, it did not trigger a
kdump! I tried 10s of times and gave up. I don't have a serial console on this
machine, so I can't get much information from it. I will see if I can do this on
rt-ash instead.
=Comment: #41=================================================
Sripathi Kodi <sripathi.com> - 2008-03-17 09:06 EDT
I had a little chat with Vatsa about this problem today. He will try to get some
of his time for this later tonight/tomorrow. He too feels it will be nice to
recreate this again to confirm that our observations are consistent.
=Comment: #42=================================================
John G. Stultz <jstultz.com> - 2008-03-17 15:46 EDT
Can we mirror this issue to RH?
=Comment: #45=================================================
Sripathi Kodi <sripathi.com> - 2008-03-18 09:21 EDT
I am trying out Hiroshi Shimamoto's patch from LKML ("fix race in schedule"),
because some of the backtraces I saw on later attempts looked suspiciously
similar to the one he has reported. Will report results soon.
=Comment: #48=================================================
Sripathi Kodi <sripathi.com> - 2008-03-18 11:23 EDT
(In reply to comment #45)
> I am trying out Hiroshi Shimamoto's patch from LKML ("fix race in schedule"),
> because some of the backtraces I saw on later attempts looked suspiciously
> similar to the one he has reported. Will report results soon.

Looking good so far in 20 iterations. Running 100 more.
=Comment: #49=================================================
Sripathi Kodi <sripathi.com> - 2008-03-18 11:26 EDT
The patch I am testing is:
http://article.gmane.org/gmane.linux.rt.user/2577

Comment 1 IBM Bug Proxy 2008-03-19 18:33:22 UTC

Created attachment 298564 [details]
Screenshot of panic from rtj-opt6.hursley.ibm.com

Comment 2 IBM Bug Proxy 2008-03-19 18:33:26 UTC

Created attachment 298565 [details]
Screenshot of panic from rtj-opt22.hursley.ibm.com

Comment 3 IBM Bug Proxy 2008-03-19 18:33:30 UTC

Created attachment 298566 [details]
Screenshot of the panic submitted by P. N. Stanton

Comment 4 IBM Bug Proxy 2008-03-19 18:33:34 UTC

Created attachment 298567 [details]
Screenshot pf panic from rtj-opt6.hursley.ibm.com

Comment 5 IBM Bug Proxy 2008-03-20 05:57:21 UTC

------- Comment From sripathi.com 2008-03-20 01:49 EDT-------
(In reply to comment #49)
> The patch I am testing is:
> http://article.gmane.org/gmane.linux.rt.user/2577

After 100s of iterations with this patch I feel confident that the patch fixes
this problem. I have asked Ingo to confirm whether this patch is headed to next
-rt patch. It is already in mainline.

Comment 6 IBM Bug Proxy 2008-03-20 07:57:15 UTC

Created attachment 298650 [details]
Hiroshi-san&apos;s patch for  2.6.24.3-29.el5rt kernel

Attaching Hiroshi-san's patch for 2.6.24.3-29.el5rt kernel.

Comment 7 IBM Bug Proxy 2008-03-20 15:41:45 UTC

------- Comment From matthewclarke.com 2008-03-20 11:32 EDT-------
Peter has installed the new patch '2.6.24.3-29.el5rt.42841' on 3 of our machines.

We ran multiple stress tests on the load that we knew use to cause the kernel
panic, and after 50 iterations no failures were seen.

A substantial run of stess tests have been submitted for a weekend run ->
http://jsvtaxxon.hursley.ibm.com/build_info.php?build_id=19111 to see how it copes.

This looks like the kernel patch has fixed the system hangs that we have been
seeing.

Comment 8 IBM Bug Proxy 2008-03-20 23:57:21 UTC

------- Comment From jstultz.com 2008-03-20 19:50 EDT-------
Clark: Please pick up hiroshi-san's patch for MRG.

Comment 9 IBM Bug Proxy 2008-03-25 10:41:27 UTC

------- Comment From paul_thwaite.com 2008-03-25 06:35 EDT-------
Load Level tests have been running all weekend and no failures have been seen
so far.

All 5 minute tests have passed.
50% of 1 hour tests have passed.
The 3 hour tests are yet to run.

We have 3 machines currently running these tests so it will take a while to
get through them all.

The fix does look good.

Comment 10 Clark Williams 2008-03-25 22:16:05 UTC

Fix picked up with the 2.6.24.4 stable patch which should be rolled out this week. 
I'll change status to MODIFIED and if we're all happy we can close it next week.

Clark

Comment 11 IBM Bug Proxy 2008-03-26 05:33:29 UTC

------- Comment From sripathi.com 2008-03-26 01:28 EDT-------
Yes, from IBM's side we are happy about the fix.

Comment 12 IBM Bug Proxy 2008-04-02 16:17:54 UTC

------- Comment From sripathi.com 2008-04-02 12:08 EDT-------
Verified that the patch is in 2.6.24.4-30.el5rt kernel.

Note You need to log in before you can comment on or make changes to this bug.