=Comment: #0================================================= P. Thwaite <paul_thwaite.com> - 2008-03-04 11:39 EDT Problem description: Bugzilla 42758 was raised recently to cover kernel panics when running Realtime java on RH-MRG. This bug is being opened to cover another problem with running Java on RedHat-MRG. On LS20 and HS21 hardware, when running the same java tests as per bugzilla 42758, the java process intermittently hangs in a busy loop (100% cpu). This sometimes happens when threads are spawned, or within a few minutes into the test. I say intermittent because we have some tests pass, but the majority do fail in this way. Issuing a "ps -aux" causes the command to hang (which is a known limitation I believe?) and running "chrt -f 99 ps -aux" usually does run, although I've seen that hang too. Since we're unable to run java properly, this bug is also blocking java test progress. Hardware Environment rtj-opt9.hursley.ibm.com LS20, 8850-55G, 2 GHz Opteron 270 (dual core), 4GB RAM rtj-opt28.hursley.ibm.com HS21, 8853-L6G, 2 x 3.0 GHz Xeon 5160 (dual core EM64T), 4GB RAM Is this reproducible? Yes - almost every time. Just running the same tests documented in 42758 will cause the busy hang. Is the system (not just the application) hung? No. Did the system produce an OOPS message on the console? No. Is the system sitting in a debugger right now? No. Additional information: The test machines are available if required - pls ask for details. =Comment: #3================================================= Sripathi Kodi <sripathi.com> - 2008-03-06 10:34 EDT When the machine hangs: 1) rt-sshd fails to respond 2) Machine doesn't respond to ping 3) sysrq keys don't work. 4) Keyboard at the console doesn't work (for example, caps lock key doesn't work) =Comment: #5================================================= Paul A. Clarke <pacman.com> - 2008-03-06 14:04 EDT I'll start looking into this as well. I'll be out from Saturday thru Tuesday, but will try to get up to speed quickly. =Comment: #9================================================= Paul A. Clarke <pacman.com> - 2008-03-07 11:32 EDT (In reply to comment #8) > I'll put it in a loop and see if it finally hangs. Still running, but I did see this on the console: Clocksource tsc unstable (delta = 518100895 ns) =Comment: #10================================================= P. Thwaite <paul_thwaite.com> - 2008-03-07 11:40 EDT Paul - as a matter of interest, what hardware type are you using in the runs you describe in comment 8 and 9? =Comment: #11================================================= Paul A. Clarke <pacman.com> - 2008-03-07 13:08 EDT (In reply to comment #10) > Paul - as a matter of interest, what hardware type are you using in the runs > you describe in comment 8 and 9? # dmidecode --string system-product-name IBM eServer BladeCenter LS20 -[885071U]- =Comment: #12================================================= Paul A. Clarke <pacman.com> - 2008-03-07 15:28 EDT (In reply to comment #8) > I'll put it in a loop and see if it finally hangs. > > I'm running with the latest kernel, 2.6.24.3-29.el5rt. Since I'll be away until Wednesday 12 March, I'll provide current status: I'm up to iteration 58 in my loop, no hangs, and have only observed the two issues noted in previous comments. Sripathi, could you restart your efforts with the latest kernel as noted above? That kernel includes the fix for bug #42758, so no patching and rebuilding is required. =Comment: #13================================================= John G. Stultz <jstultz.com> - 2008-03-07 21:27 EDT Might be worth checking if the set_kthread_prio bug is involved here (ie: IRQ handlers being starved?). =Comment: #14================================================= P. Thwaite <paul_thwaite.com> - 2008-03-10 10:44 EDT Bug 42758 is now fixed (kernel no longer panics). Whilst testing 42758, we continue to see this bug (which typically causes the machine to hang). Thie bug is now the next blocker for JTC RH-MRG testing. We are running tests at the moment to determine what test (or set of tests) cause the hangs. Details will be available soon. =Comment: #15================================================= Sripathi Kodi <sripathi.com> - 2008-03-11 03:26 EDT (In reply to comment #13) > Might be worth checking if the set_kthread_prio bug is involved here (ie: IRQ > handlers being starved?). I am trying this out. =Comment: #16================================================= Sripathi Kodi <sripathi.com> - 2008-03-11 03:41 EDT (In reply to comment #15) > (In reply to comment #13) > > Might be worth checking if the set_kthread_prio bug is involved here (ie: IRQ > > handlers being starved?). > > I am trying this out. Nope, that fix doesn't seem to help this problem. I can still recreate the problem. I am using 2.6.24.3-29.el5rt kernel. =Comment: #17================================================= Paul A. Clarke <pacman.com> - 2008-03-11 23:10 EDT (In reply to comment #16) > Nope, that fix doesn't seem to help this problem. I can still recreate the > problem. I am using 2.6.24.3-29.el5rt kernel. I wonder why I can't reproduce it. I'm on iteration 1123. =Comment: #18================================================= Sripathi Kodi <sripathi.com> - 2008-03-12 11:33 EDT (In reply to comment #17) > (In reply to comment #16) > > Nope, that fix doesn't seem to help this problem. I can still recreate the > > problem. I am using 2.6.24.3-29.el5rt kernel. > > I wonder why I can't reproduce it. I'm on iteration 1123. Thats a surprise. I have recreated this on two LS20s so far. Most recently I tried this on (ABAT provisioned) ltcrt16 and recreated it easily. I just cancelled my job on ltcrt16, so you could try on that very machine and see if it works. =Comment: #19================================================= Sripathi Kodi <sripathi.com> - 2008-03-12 11:35 EDT I booted with nmi_watchdog=2, verified that NMIs were getting generated and recreated the problem. I still could not see anything on the console when the problem happened. It would either mean the system is so badly hosed that it can't handle NMIs or it is just unable to print anything on the console. I am hoping that it is the latter and thinking of ways to circumvent it. =Comment: #20================================================= Paul A. Clarke <pacman.com> - 2008-03-12 11:59 EDT (In reply to comment #18) > (In reply to comment #17) > > (In reply to comment #16) > > > Nope, that fix doesn't seem to help this problem. I can still recreate the > > > problem. I am using 2.6.24.3-29.el5rt kernel. > > > > I wonder why I can't reproduce it. I'm on iteration 1123. > > Thats a surprise. I have recreated this on two LS20s so far. Most recently I > tried this on (ABAT provisioned) ltcrt16 and recreated it easily. I just > cancelled my job on ltcrt16, so you could try on that very machine and see if it > works. Something finally happened, but it's still not hung...I seem to be stuck in an endless loop, with these appearing continuously on the terminal from which the tests are running: 20080312-11:57:31 Command ps -aux did not complete in 30 seconds and has been terminated I'll take a peek at ltcrt16...you want to look at ltcrt6? Maybe I set something up wrong? =Comment: #21================================================= Paul A. Clarke <pacman.com> - 2008-03-12 13:43 EDT (In reply to comment #18) > (In reply to comment #17) > > (In reply to comment #16) > > > Nope, that fix doesn't seem to help this problem. I can still recreate the > > > problem. I am using 2.6.24.3-29.el5rt kernel. > > > > I wonder why I can't reproduce it. I'm on iteration 1123. > > Thats a surprise. I have recreated this on two LS20s so far. Most recently I > tried this on (ABAT provisioned) ltcrt16 and recreated it easily. I just > cancelled my job on ltcrt16, so you could try on that very machine and see if it > works. OK, I've grabbed ltcrt16. The only jtctests data that I can find is in root's home dir. Sripathi, Paul, Are you running these tests as root? If so, why? I wonder if that's the difference between our runs...I'm running as a non-root user. Anyway, I'll fire up some tests on ltcrt16 and see what happens. =Comment: #22================================================= Paul A. Clarke <pacman.com> - 2008-03-12 14:47 EDT (In reply to comment #21) > I wonder if that's the difference between our runs...I'm running as a non-root user. > > Anyway, I'll fire up some tests on ltcrt16 and see what happens. I ran once as root and quickly lost control of the machine, including the SOL session. I'm up to iteration 6 running as non-root, no hangs so far. =Comment: #23================================================= Sripathi Kodi <sripathi.com> - 2008-03-13 01:24 EDT (In reply to comment #21) <snip> > Sripathi, Paul, Are you running these tests as root? If so, why? > > I wonder if that's the difference between our runs...I'm running as a non-root user. Paul, I may have run the test as root on ltcrt16. I agree that I should not do that. However, on my local LS20 (llm50.in) I have always run it as a normal user and recreated the problem pretty consistently. It has never taken more than 5 iterations to recreate the problem. <snip> =Comment: #24================================================= Sripathi Kodi <sripathi.com> - 2008-03-13 01:35 EDT Discussion from bug 42758 that is relevant here: ------- Additional Comment #40 From Sripathi Kodi 2008-03-12 12:37 EDT [reply] ------- Internal Only (In reply to comment #39) > Seems that the machines are not hanging as per 48241 - they're actually > panicking. RIP is pointing to __spin_lock this time. As before, this is easily > reproducible with the same tests that found the first panic. > We didn't spot this at first as the panics are not output over the network or > serial link - it only appears on the screen. So unfortunately the only output we > have is what I've been able to photograph from the screen. > Do you want me to re-open this bug or start a new one? Is it possible that all hangs seen in 42841 could be explained by this? If that is possible, it will be useful to carry out analysis in 42841. If it turns out to be a totally new problem, we can open a new bug later. We surely like to see the photograph, btw. ------- Additional Comment #41 From P. N. Stanton 2008-03-12 13:34 EDT [reply] ------- Internal Only It could be the same hang as in 42841, but I see kernel messages every time whereas the comments in 42841 say that no output is produced. I've attached a photo from rtj-opt6.hursley.ibm.com. Some of the information looks to have disappeared off the top of the screen, but this is the only one I've seen that shows a call trace. Apologies for the quality of the photography - took this with my mobile phone =Comment: #25================================================= Sripathi Kodi <sripathi.com> - 2008-03-13 01:36 EDT Screenshot of the panic submitted by P. N. Stanton =Comment: #26================================================= Sripathi Kodi <sripathi.com> - 2008-03-13 01:45 EDT The screenshot is interesting. The panic seems to be because of an nmi. My guess is that the system was hung and nmi triggered the panic. This is what I have been trying to do without success! We are probably seeing a deadlock here. Parts of the text from the screenshot: Pid: 30090, comm: java default_do_nmi + 0x6c/0x1a6 do_nmi + 0x3e/0x5a nmi + 0x7f/0x90 spin_lock + 0x1d/0x23 double_lock_balance + 0x57/0x60 push_rt_task + 0xa4/0x20e push_rt_tasks + 0x14/0x1c <== Not sure whether it is 0x1c or 0x1e task_wake_up_rt + 0x26/0x28 wake_up_new_task + 0xa7/0xbc do_fork + 0x13e/0x20e Peter, can this be recreated? If yes, it will be great to get a kdump when it happens. Procedure to set up kdump is here: http://rt.et.redhat.com/page/RHEL-RT_kdump/kexec I can help with this if needed. =Comment: #27================================================= Sripathi Kodi <sripathi.com> - 2008-03-13 06:26 EDT This time, I used the NMI button on the front panel of the blade to trigger an NMI when the hang occurred. The bladecenter logs show that NMI was pressed for the particular blade, but the system did not respond to it. I saw nothing on console, SOL had stopped working and I got no kdump. =Comment: #28================================================= P. N. Stanton <pstanton.com> - 2008-03-13 07:26 EDT Hardware info for the two machines that we've been seeing these panics on: rtj-opt6.hursley.ibm.com: eServer 326m, model number 7969-76G 2 x 2.4 GHz Opteron 280 (dual core), 5 GB RAM rtj-opt22.hursley.ibm.com: eServer x3455, model number 7984-52G 2 x 2.6 GHz Opteron 2218 (dual core), 10 GB RAM Pressing the NMI button on the back of rtj-opt22 produces these kernel messages: Uhhuh. NMI received for unknown reason 21. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue The system then continues running normally. The e326m does not appear to have an NMI button. Both machines have kdump set up; we are running Java tests to re-create the problem. I'll attach the other screenshots I've taken - two from rtj-opt6 and one from rtj-opt22. =Comment: #29================================================= P. N. Stanton <pstanton.com> - 2008-03-13 07:27 EDT Screenshot pf panic from rtj-opt6.hursley.ibm.com =Comment: #30================================================= P. N. Stanton <pstanton.com> - 2008-03-13 07:27 EDT Screenshot of panic from rtj-opt6.hursley.ibm.com =Comment: #31================================================= P. N. Stanton <pstanton.com> - 2008-03-13 07:28 EDT Screenshot of panic from rtj-opt22.hursley.ibm.com =Comment: #32================================================= Sripathi Kodi <sripathi.com> - 2008-03-13 09:07 EDT I will try this on our local x3455 machine. =Comment: #33================================================= Sripathi Kodi <sripathi.com> - 2008-03-13 09:43 EDT At last! I recreated the problem on llm55.in and got a kdump. Backtrace looks like the following. The version of crash on the system could not read the dump properly. Hence I pulled down latest version of crash from http://people.redhat.com/~anderson/ and compiled it. I will post my observation of the dump soon. crash> bt PID: 28301 TASK: ffff81022e6f0e80 CPU: 0 COMMAND: "java" #0 [ffffffff80a67d00] machine_kexec at ffffffff802246a1 #1 [ffffffff80a67de0] crash_kexec at ffffffff8026929a #2 [ffffffff80a67ea0] die_nmi at ffffffff804a77de #3 [ffffffff80a67ed0] nmi_watchdog_tick at ffffffff804a7d23 #4 [ffffffff80a67f00] default_do_nmi at ffffffff804a7453 #5 [ffffffff80a67f30] do_nmi at ffffffff804a7dd9 #6 [ffffffff80a67f50] nmi at ffffffff804a725f [exception RIP: __spin_lock+26] RIP: ffffffff804a6980 RSP: ffff81022398dde8 RFLAGS: 00200086 RAX: ffff81022398dfd8 RBX: ffff810001021680 RCX: 0000000000000003 RDX: 0000000000000000 RSI: ffff810001021680 RDI: ffff810001021680 RBP: ffff81022398dde8 R8: ffff810001005840 R9: 00000000ffffffff R10: 00000000fffffff4 R11: 00000000c9c05c35 R12: ffff810001011680 R13: ffffffff80a5f680 R14: ffff810001011680 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <exception stack> --- #7 [ffff81022398dde8] __spin_lock at ffffffff804a6980 #8 [ffff81022398ddf0] double_lock_balance at ffffffff802329e0 #9 [ffff81022398de10] push_rt_task at ffffffff802330e1 #10 [ffff81022398de50] push_rt_tasks at ffffffff8023325f #11 [ffff81022398de70] task_wake_up_rt at ffffffff80239661 #12 [ffff81022398de80] wake_up_new_task at ffffffff8023a1f9 #13 [ffff81022398deb0] do_fork at ffffffff8023c860 #14 [ffff81022398df40] sys32_clone at ffffffff80229fe0 #15 [ffff81022398df50] ia32_ptregs_common at ffffffff80229cf5 RIP: 0000000045b7efc8 RSP: 00000000cac0ce68 RFLAGS: 00200296 RAX: ffffffffffffffda RBX: 00000000003d0f00 RCX: 00000000a2cdd4b4 RDX: 00000000a2cddbd8 RSI: 00000000cac0ced4 RDI: 00000000a2cddbd8 RBP: 00000000cac0cf00 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000078 CS: 0023 SS: 002b =Comment: #34================================================= John G. Stultz <jstultz.com> - 2008-03-13 12:29 EDT So looking at the screenshots for rtj-opt6.hursley.ibm.com, it seems we're hung up on a spinlock in the apic_timer_interrupt code. This is similar to Sripathi's, but not in the same place. On rtj-opt22.hursley.ibm.com, it seems is hung in default_idle, which is an odd place to hang. =Comment: #35================================================= Paul A. Clarke <pacman.com> - 2008-03-13 17:28 EDT (In reply to comment #33) > At last! I recreated the problem on llm55.in and got a kdump. (talked to sripathi on irc...) Two of the CPUs are stuck in double_lock_balance. (per sripathi, "that is fine") One appears to be in schedule. One is in rb_insert_color: #8 [ffff810222db3a30] rb_insert_color at ffffffff8033f253 #9 [ffff810222db3a60] __enqueue_entity at ffffffff80231a11 #10 [ffff810222db3a70] put_prev_task_fair at ffffffff802397a9 #11 [ffff810222db3a90] __schedule at ffffffff804a4eab #12 [ffff810222db3b70] schedule at ffffffff804a5527 #13 [ffff810222db3b80] rt_mutex_slowlock at ffffffff804a6055 #14 [ffff810222db3c40] rt_mutex_lock at ffffffff804a5cf2 #15 [ffff810222db3c50] __rt_down_read at ffffffff8025e285 #16 [ffff810222db3c70] rt_down_read at ffffffff8025e29f #17 [ffff810222db3c80] futex_wait at ffffffff8025aef3 #18 [ffff810222db3e10] do_futex at ffffffff8025c316 #19 [ffff810222db3ef0] compat_sys_futex at ffffffff8025d301 #20 [ffff810222db3f80] cstar_do_call at ffffffff80229a04 RIP: 00000000ffffe405 RSP: 00000000b19c83c8 RFLAGS: 00200202 RAX: ffffffffffffffda RBX: ffffffff80229a04 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: 00000000b19c8b90 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 00000000000000f0 CS: 0023 SS: 002b =Comment: #36================================================= Paul A. Clarke <pacman.com> - 2008-03-13 19:32 EDT traceback indicates that the task which is in rb_insert_color is at (bt -a) [exception RIP: rb_insert_color+175] RIP: ffffffff8033f253 disassembly with line numbers shows: (dis -l rb_insert_color) /usr/src/debug/kernel-2.6.24.3/linux-2.6.24.3.x86_64/lib/rbtree.c: 129 0xffffffff8033f253 <rb_insert_color+175>: andq $0xfffffffffffffffe,(%r12) line 129 of rbtree.c: 128 rb_set_black(parent); 129 rb_set_red(gparent); 130 __rb_rotate_left(gparent, root); now, at this point in the instruction sequence, I believe: R15 is "root": R15: ffff8101300ae708 root is a pointer to the address of the root node of the tree: crash> rd ffff8101300ae708 1 ffff8101300ae708: ffff810222d9f358 R12 is "gparent": R12: ffff810222d9f358 RCX is "parent": RCX: ffff810222d9f358 lets look at that node: crash> rd ffff810222d9f358 3 ffff810222d9f358: ffff810222d9f359 ffff810222d9f358 ffff810222d9f368: 0000000000000000 the node structure is { parent(and color), right, left } note that the color of the node is stored in the low order bit of the parent address, and red is 0 and black is 1 (per rbtree.h). so, this tree is apparently a single node, currently black (from line 128), but about to be set to red (line 129), right link points to itself, left link is null. The tree has been corrupted. The root node's parent should be NULL, and nodes should not point to themselves. Since the node has been set to red, that likely explains why this function never finishes, since the loop condition is: 76 while ((parent = rb_parent(node)) && rb_is_red(parent)) =Comment: #37================================================= Ankita Garg <ankigarg.com> - 2008-03-14 01:49 EDT (In reply to comment #33) > At last! I recreated the problem on llm55.in and got a kdump. Backtrace looks > like the following. The version of crash on the system could not read the dump > properly. Hence I pulled down latest version of crash from > http://people.redhat.com/~anderson/ and compiled it. Sripathi, so does that mean we will need to open a new bug to ask RH to ship the new crash version with MRG? =Comment: #39================================================= Sripathi Kodi <sripathi.com> - 2008-03-14 04:33 EDT I have looked at the dump a bit more. This is the summary of what I have seen. If anyone is interested in detailed analysis I can put it up. cpu0: 'java' process pid:28301 is trying to hold the runqueue lock of cpu 1 cpu1: 'java' process pid:14094 with incomplete backtrace! cpu2: 'softirq-cru/2' process pid:39 is trying to hold runqueue lock of cpu 3 cpu3: 'java' process pid:14092 is spinning and has the rq lock of rq3. Paul has found that it contains a corrupted r-b tree, because of which it is spinning forever. This can explain why cpu0, cpu2 and cpu3 are not able to make progress. I am not sure what is happening on cpu1, however. I cannot confirm that it has the runqueue lock of cpu1. It probably doesn't? It's backtrace is: PID: 14094 TASK: ffff8100c1c21600 CPU: 1 COMMAND: "java" #0 [ffff8100c0181f58] schedule at ffffffff804a520c #1 [ffff8100c0181f80] cstar_do_call at ffffffff80229a04 RIP: 00000000ffffe405 RSP: 00000000badef3c8 RFLAGS: 00200202 RAX: ffffffffffffffda RBX: ffffffff80229a04 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: 00000000badefb90 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 00000000000000f0 CS: 0023 SS: 002b I can't establish who holds cpu1's runqueue lock. I looked at the backtraces of all other runnable tasks in vain. More info later. We may have to recreate the problem again to confirm the r-b tree corruption. =Comment: #40================================================= Sripathi Kodi <sripathi.com> - 2008-03-17 09:04 EDT I tried to recreate this again on llm55.in to get another core dump. This would help us reconfirm the r-b tree corruption that Paul has described. However, even though the system seems to panic when the test is run, it did not trigger a kdump! I tried 10s of times and gave up. I don't have a serial console on this machine, so I can't get much information from it. I will see if I can do this on rt-ash instead. =Comment: #41================================================= Sripathi Kodi <sripathi.com> - 2008-03-17 09:06 EDT I had a little chat with Vatsa about this problem today. He will try to get some of his time for this later tonight/tomorrow. He too feels it will be nice to recreate this again to confirm that our observations are consistent. =Comment: #42================================================= John G. Stultz <jstultz.com> - 2008-03-17 15:46 EDT Can we mirror this issue to RH? =Comment: #45================================================= Sripathi Kodi <sripathi.com> - 2008-03-18 09:21 EDT I am trying out Hiroshi Shimamoto's patch from LKML ("fix race in schedule"), because some of the backtraces I saw on later attempts looked suspiciously similar to the one he has reported. Will report results soon. =Comment: #48================================================= Sripathi Kodi <sripathi.com> - 2008-03-18 11:23 EDT (In reply to comment #45) > I am trying out Hiroshi Shimamoto's patch from LKML ("fix race in schedule"), > because some of the backtraces I saw on later attempts looked suspiciously > similar to the one he has reported. Will report results soon. Looking good so far in 20 iterations. Running 100 more. =Comment: #49================================================= Sripathi Kodi <sripathi.com> - 2008-03-18 11:26 EDT The patch I am testing is: http://article.gmane.org/gmane.linux.rt.user/2577
Created attachment 298564 [details] Screenshot of panic from rtj-opt6.hursley.ibm.com
Created attachment 298565 [details] Screenshot of panic from rtj-opt22.hursley.ibm.com
Created attachment 298566 [details] Screenshot of the panic submitted by P. N. Stanton
Created attachment 298567 [details] Screenshot pf panic from rtj-opt6.hursley.ibm.com
------- Comment From sripathi.com 2008-03-20 01:49 EDT------- (In reply to comment #49) > The patch I am testing is: > http://article.gmane.org/gmane.linux.rt.user/2577 After 100s of iterations with this patch I feel confident that the patch fixes this problem. I have asked Ingo to confirm whether this patch is headed to next -rt patch. It is already in mainline.
Created attachment 298650 [details] Hiroshi-san's patch for 2.6.24.3-29.el5rt kernel Attaching Hiroshi-san's patch for 2.6.24.3-29.el5rt kernel.
------- Comment From matthewclarke.com 2008-03-20 11:32 EDT------- Peter has installed the new patch '2.6.24.3-29.el5rt.42841' on 3 of our machines. We ran multiple stress tests on the load that we knew use to cause the kernel panic, and after 50 iterations no failures were seen. A substantial run of stess tests have been submitted for a weekend run -> http://jsvtaxxon.hursley.ibm.com/build_info.php?build_id=19111 to see how it copes. This looks like the kernel patch has fixed the system hangs that we have been seeing.
------- Comment From jstultz.com 2008-03-20 19:50 EDT------- Clark: Please pick up hiroshi-san's patch for MRG.
------- Comment From paul_thwaite.com 2008-03-25 06:35 EDT------- Load Level tests have been running all weekend and no failures have been seen so far. All 5 minute tests have passed. 50% of 1 hour tests have passed. The 3 hour tests are yet to run. We have 3 machines currently running these tests so it will take a while to get through them all. The fix does look good.
Fix picked up with the 2.6.24.4 stable patch which should be rolled out this week. I'll change status to MODIFIED and if we're all happy we can close it next week. Clark
------- Comment From sripathi.com 2008-03-26 01:28 EDT------- Yes, from IBM's side we are happy about the fix.
------- Comment From sripathi.com 2008-04-02 12:08 EDT------- Verified that the patch is in 2.6.24.4-30.el5rt kernel.