Bug 199667

Summary:

ext3 file system crashed in my IA64 box

Product:

Red Hat Enterprise Linux 4

Reporter:

bibo,mao <bibo.mao>

Component:

kernel

Assignee:

Eric Sandeen <esandeen>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

4.0

CC:

andrew.patterson, bibo.mao, bjorn.helgaas, jarod, jbaker, jbaron, lori.carlson, lwang, nicholas.dokos, rick.hester

Target Milestone:

---

Target Release:

---

Hardware:

ia64

OS:

Linux

Whiteboard:

Fixed In Version:

RHEL4U4

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-09-20 19:35:45 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
nat consumption panic	none
another nat consumption panic	none
upstream patch to fix JBD race in t_forget list handling	none
another nat consumption panic, on rx2620 with montecito	none
ext3 patch from RHEL4 U4	none
ext3 jbd crashed on RHEL4.3	none

Description bibo,mao 2006-07-21 09:18:01 UTC

By ltp stress test, system crashed in my IA64 box. My machine is with 4 
physical cpu with dual core and hypherthread function, kernel version is RHEL4.
I run several times with/without patch pasted in 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=158363

The following is system crash information.

inode02[28417]: NaT consumption 2216203124768 [1]
Modules linked in: nfs(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) md5(U) ipv6
(U) parport_pc(U) lp(U) parport(U) autofs4(U) sun
rpc(U) ds(U) yenta_socket(U) pcmcia_core(U) vfat(U) fat(U) dm_mirror(U) dm_mod
(U) joydev(U) button(U) uhci_hcd(U) ehci_hcd(U)
shpchp(U) e1000(U) ext3(U) jbd(U) mptscsih(U) mptsas(U) mptspi(U) mptfc(U) 
mptscsi(U) mptbase(U) sd_mod(U) scsi_mod(U)

Pid: 28417, CPU 13, comm:              inode02
psr : 0000121008126010 ifs : 800000000000040d ip  : [<a000000200134721>]    Not 
tainted
ip is at journal_dirty_metadata+0x2c1/0x5e0 [jbd]
unat: 0000000000000000 pfs : 0000000000000917 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 005965a026595569
ldrs: 0000000000000000 ccv : 0000000000060011 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000002001d3350 b6  : a000000100589f20 b7  : a0000001001fee60
f6  : 1003e0000000000000000 f7  : 1003e0000000000000080
f8  : 1003e00000000000008c1 f9  : 1003effffffffffffc0a0
f10 : 100049c8d719c0533ddf0 f11 : 1003e00000000000008c1
r1  : a000000200330000 r2  : a000000200158630 r3  : a000000200158630
r8  : e0000001d885631c r9  : 0000000000000000 r10 : e0000001bede66e0
r11 : 0000000000000010 r12 : e0000001a8a67d40 r13 : e0000001a8a60000
r14 : 0000000000060011 r15 : 0000000000000000 r16 : e0000001fef76e80
r17 : e0000001e3e34b38 r18 : 0000000000000020 r19 : 0000000000060011
r20 : 00000000000e0011 r21 : 0000000000060011 r22 : 0000000000000000
r23 : 0000000000000000 r24 : 0000000000000000 r25 : e0000001e6cc8090
r26 : e0000001d88561e0 r27 : 0000000044bc2661 r28 : e0000001d8856078
r29 : 000000007c6fe61a r30 : e0000001e6cc80e8 r31 : e0000001d8856080

Call Trace:
 [<a000000100016b20>] show_stack+0x80/0xa0
                                sp=e0000001a8a67750 bsp=e0000001a8a61360
 [<a000000100017430>] show_regs+0x890/0x8c0
                                sp=e0000001a8a67920 bsp=e0000001a8a61318
 [<a00000010003dbf0>] die+0x150/0x240
                                sp=e0000001a8a67940 bsp=e0000001a8a612d8
 [<a00000010003dd20>] die_if_kernel+0x40/0x60
                                sp=e0000001a8a67940 bsp=e0000001a8a612a8
 [<a00000010003f930>] ia64_fault+0x1450/0x15a0
                                sp=e0000001a8a67940 bsp=e0000001a8a61250
 [<a00000010000f540>] ia64_leave_kernel+0x0/0x260
                                sp=e0000001a8a67b70 bsp=e0000001a8a61250
 [<a000000200134720>] journal_dirty_metadata+0x2c0/0x5e0 [jbd]
                                sp=e0000001a8a67d40 bsp=e0000001a8a611e0
 [<a0000002001d3350>] ext3_mark_iloc_dirty+0x750/0xc00 [ext3]
                                sp=e0000001a8a67d40 bsp=e0000001a8a61150
 [<a0000002001d3a90>] ext3_mark_inode_dirty+0xb0/0xe0 [ext3]
                                sp=e0000001a8a67d40 bsp=e0000001a8a61128
 [<a0000002001ce3d0>] ext3_new_inode+0x1670/0x1a80 [ext3]
                                sp=e0000001a8a67d60 bsp=e0000001a8a61068
 [<a0000002001df5e0>] ext3_mkdir+0x120/0x940 [ext3]
                                sp=e0000001a8a67da0 bsp=e0000001a8a61008
 [<a000000100148250>] vfs_mkdir+0x250/0x380
                                sp=e0000001a8a67db0 bsp=e0000001a8a60fb0
 [<a0000001001484f0>] sys_mkdir+0x170/0x280
                                sp=e0000001a8a67db0 bsp=e0000001a8a60f30
 [<a00000010000f3e0>] ia64_ret_from_syscall+0x0/0x20
                                sp=e0000001a8a67e30 bsp=e0000001a8a60f30
 [<a000000000010640>] 0xa000000000010640
                                sp=e0000001a8a68000 bsp=e0000001a8a60f30
Kernel panic - not syncing: Fatal exception

Comment 2 bibo,mao 2006-07-21 15:46:35 UTC

I debuged this problem, watched register and memory content, it seems that 
there is one sentence in function journal_dirty_metadata()
  struct journal_head *jh = bh2jh(bh);
jh value is NULL, and I wrote one patch ltp stress test can pass but I am
not familiar with EXT3 filesystem, I do not know whether it is root cause.

--- linux-2.6.9/fs/jbd/transaction.c.orig       2006-06-30 14:05:58.000000000 
+0800
+++ linux-2.6.9/fs/jbd/transaction.c    2006-07-07 02:56:32.000000000 +0800
@@ -1104,13 +1104,15 @@ int journal_dirty_metadata(handle_t *han
 {
        transaction_t *transaction = handle->h_transaction;
        journal_t *journal = transaction->t_journal;
-       struct journal_head *jh = bh2jh(bh);
+       struct journal_head *jh;

-       jbd_debug(5, "journal_head %p\n", jh);
-       JBUFFER_TRACE(jh, "entry");
        if (is_handle_aborted(handle))
                goto out;

+       jh = journal_add_journal_head(bh);
+       jbd_debug(5, "journal_head %p\n", jh);
+       JBUFFER_TRACE(jh, "entry");
+
        jbd_lock_bh_state(bh);

        /*
@@ -1154,6 +1156,7 @@ int journal_dirty_metadata(handle_t *han
        spin_unlock(&journal->j_list_lock);
 out_unlock_bh:
        jbd_unlock_bh_state(bh);
+       journal_put_journal_head(jh);
 out:
        JBUFFER_TRACE(jh, "exit");
        return 0;

Comment 3 bibo,mao 2006-08-18 01:16:31 UTC

I tested on RHEL4-U3, I will test it on RHEL4-U5.

Comment 7 Bjorn Helgaas 2006-08-24 16:30:56 UTC

Created attachment 134829 [details]
nat consumption panic

Comment 8 Bjorn Helgaas 2006-08-24 16:38:20 UTC

Created attachment 134830 [details]
another nat consumption panic

This and the previous attachment are panics on a large HP ia64
system (64 socket, dual-core Montecito, 1TB RAM), running the
RHEL4 U3 largesmp kernel and an HP internal I/O stress test.

We can reproduce this failure reliably (within 12 hours), and
it is a serious problem for shipping Montecito servers.

Comment 9 Bjorn Helgaas 2006-08-24 16:47:24 UTC

This is the same problem reported in Issue Tracker 100177.
That sighting was on a much smaller ia64 system (2 socket
dual-core Montecito) running the normal RHEL4 U3 kernel.

Comment 10 Eric Sandeen 2006-08-24 16:54:00 UTC

Created attachment 134833 [details]
upstream patch to fix JBD race in t_forget list handling

Although this should certainly not lead to a kernel panic, do you know the root
causes of the IO errors reported in the log in comment #7 and comment #8 ?

SCSI error : <11 0 4 1> return code = 0x6000000
end_request: I/O error, dev sdeo, sector 31437897

Does the attached patch help?

Comment 11 Bjorn Helgaas 2006-08-24 17:31:26 UTC

I don't know the root cause of the I/O errors preceding the panic.

I will put the patch into RHEL4 U3 and try to reproduce the problem.
We should have some results by tomorrow.

It sounds like any customer-shippable kernel with this fix would be
a post U4 kernel.  So we'll want that before doing testing beyond
this specific issue.

Comment 12 bibo,mao 2006-08-25 01:23:07 UTC

I haved met the I/O errors before, maybe the test case assumes too many file 
system operation so that disk has some error. I/O error phenomenon did not 
appear again after I changed one new disk.
I applied this patch but bug still occured.

Comment 13 Eric Sandeen 2006-08-25 01:35:11 UTC

There are two different test cases here, right - the bug was opened for an ltp
stress test, while the hp folks are using an internal test?  Bibo,Mao had
earlier reported that the patch did not help in the ltp case, but it at least
seems worth a shot in the hp case.  It also looks like different kernel versions
are in play here.  They may well be the same root cause but it's not immediately
clear yet.

(hm, interesting though that all 3 pasted oopses are down the sys_mkdir path....)

Thanks,
-Eric

Comment 14 Bjorn Helgaas 2006-08-25 16:42:16 UTC

Yes, HP is seeing the problem when using a different test case.
HP is using "hazard", an internal I/O stress test.  Intel is
using an LTP stress test.

Bibo, you reported seeing the problem even with the patch.  Can
you confirm that the patch you're testing is the one in comment #10
(not the one in comment #2)?

How long does it take you to reproduce the problem?  The LKML mail here:
http://lkml.org/lkml/2006/7/25/61 mentions three days, but it must happen
sooner sometimes, if you've already seen it with the comment #10 patch.

HP is currently testing the comment #10 patch.  Without the patch, we
were seeing failures after 12 hours or so of hazard testing.

Comment 15 Bjorn Helgaas 2006-08-25 18:14:48 UTC

Created attachment 134946 [details]
another nat consumption panic, on rx2620 with montecito

We saw this on an rx2620 with two Montecitos, threads enabled, 32GB.
About four hours into the rhr2-2.0.3 memory cert test.	Note this
the backtrace is slightly different (came through open/create, not
mkdir), but the offset into journal_dirty_metadata and the null pointer
in r18 are identical.

Comment 16 Chris Williams 2006-08-25 18:18:28 UTC

Bjorn,
Can you grab a vmcore from the 2620?

Comment 17 Eric Sandeen 2006-08-25 18:23:08 UTC

Bjorn, Is the oops in comment #15 on a stock kernel, or patched?

Comment 18 Bjorn Helgaas 2006-08-25 19:56:52 UTC

The crash in comment #15 (rx2620 with two Montecitos) is with stock RHEL4 U3.
We don't have a vmcore for that crash.

When testing RHEL4 U3 + the comment #10 patch on SandDune with 64 Montecitos,
we saw the same crash again after about 12 hours of hazard.

Comment 19 Eric Sandeen 2006-08-25 20:06:28 UTC

Ok, thanks for trying it, anyway.

Comment 23 Nick Dokos 2006-08-28 16:43:54 UTC

Suggestion: try the following patch from RHEL4U4

Patch1049: linux-2.6.9-ext3-jbd-race.patch

It is supposed to eliminate some races that *might* help with this problem.
I've already suggested it to Bjorn, but I thought I'd add a note here
to see if Mao Bibo could test as well. He did mention that he would test
on U5 (which presumably has the patch - I have not checked) but I don't
think he reported results.

Comment 24 Eric Sandeen 2006-08-28 16:54:02 UTC

Out of curiosity, has an upstream kernel been checked with this testcase? 
(2.6.17, or 2.6.18 rcX?)

That might be a lot to ask, but it might help to know whether we should be
looking for a problem that has already been fixed.

Comment 25 Ronald Pacheco 2006-08-28 18:04:28 UTC

Bibo,

Two questions:

1. Do you have a way of reproducing this problem in a short period of time (<<12
hours)?
2. Can you provide an update on how your testing with U4 (for this problem) is
progressing?

Regards,

Ron

Comment 26 Ronald Pacheco 2006-08-28 18:23:50 UTC

Bibo,

Two questions:

1. Do you have a way of reproducing this problem in a short period of time (<<12
hours)?
2. Can you provide an update on how your testing with U4 (for this problem) is
progressing?

Regards,

Ron

Comment 29 Eric Sandeen 2006-08-28 22:28:31 UTC

Created attachment 135093 [details]
ext3 patch from RHEL4 U4

Just for convenience, here's the RHEL4U4 patch referred to in Comment #23

It's from this upstream LKML discussion: http://lkml.org/lkml/2005/3/8/147

Comment 30 Eric Sandeen 2006-08-28 23:19:33 UTC

I'm having a bit of a hard time finding my way through the twisty maze of
inlines in journal_dirty_metadata, but I believe that the problem at
journal_dirty_metadata+0x221/0x540 is around this part of the code:

        if (jh->b_transaction == transaction && jh->b_jlist == BJ_Metadata) {

We oops at:

    3000:       0b 90 80 46 00 21       [MMI]       adds r18=32,r35;;

Where did r35 come from?   backing up to the top of the function,

    2df6:       f0 00 86 00 42 c0                   adds r15=64,r33

r33 was the buffer_head passed in (2nd arg), b_private is at offset 64,
b_private is the jh

later we do

    2e06:       30 02 3c 30 20 00                   ld8 r35=[r15]

so now the jh is in r35

and if we're oopsing at:

    3000:       0b 90 80 46 00 21       [MMI]       adds r18=32,r35;;

we're trying to look at offset 32 from jh, which is jh->b_transaction.

Now, the patch mentioned in Comment #23 above and attached in Comment #29 had
this in Stephen's original analysis:

"A truncate running in parallel can lead to journal_unmap_buffer() destroying
the jh if it occurs between these two calls."

If jh were destroyed and null, then this would match what is seen in the oops. 
We oops just after this line in assembly:

    3000:       0b 90 80 46 00 21       [MMI]       adds r18=32,r35;;

and r18 in the oops is:  r18 : 0000000000000020

*b_transaction in a journal head is at offset 0x20 (32).  So if "jh" were NULL,
and we tried to read jh->b_transaction, we'd try to read memory address 0x20,
which would cause this panic.  So I think there's a good case to be made that we
are seeing a null jh here, and if the patch from RHEL4U4 seems to fix it, it
probably is indeed the right fix for this case.

Comment 31 bibo,mao 2006-08-29 01:17:47 UTC

Created attachment 135099 [details]
ext3 jbd crashed on RHEL4.3

I tested on RHEL4-U4, system crashed in different place. This time it seems the
same with bug reported in
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=158363, I will add patch
attached in this website for RHEL4-U4 and test again.

I will modify LTP test case to check whether it can shorten execution time fo
reproduce this bug.

Comment 32 bibo,mao 2006-08-29 01:59:15 UTC

(In reply to comment #26)
> Bibo,
> Two questions:
> 1. Do you have a way of reproducing this problem in a short period of time 
(<<12 hours)?
Now I am modifying LTP test script and check it whether it can shorten time of 
this problem.

> 2. Can you provide an update on how your testing with U4 (for this problem) is
> progressing?
I am testing RHEL4-U4 with patch attached in 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=158363.
I tested in Madison machine with 4 physical CPU, system did not crash.
On RHEL4.3 when I compiled ext3 filesystem with compile-in option and without 
CONFIG_DEBUG_SPINLOCK and CONFIG_DEBUG_SPINLOCK_SLEEP option, system did not 
crash.
But when I compiled kernel with default option in SRPM package in Montecito 
with 4 physical cpu, dual-core and hyperthread function, system crashed. In 
general with 48 hours system crashed.

Comment 33 Andrew Patterson 2006-08-29 04:29:17 UTC

Just ran into a similar crash on an on modified RHEL4 U3.  This crash is in 
fs/jbd/transaction.c though.

ssertion failure in do_get_write_access() at fs/jbd/transaction.c:608: 
"jh->b_next_transaction == ((void *)0)"
kernel BUG at fs/jbd/transaction.c:608!
diskfs[21164]: bugcheck! 0 [1]
Assertion failure in do_get_write_access() at fs/jbd/transaction.c:608: 
"jh->b_next_transaction == ((void *)0)"
kernel BUG at fs/jbd/transaction.c:608!
Modules linked in: md5 ipv6 parport_pc lp parport dev_acpi(U) autofs4 
sunrpc ds yenta_socket pcmcia_core scsi_dump diskdump zlib_deflate mptctl(U) 
lpfcdfc(U) vfat fat dm_mod button ohci_hcd ehci_hcd shpchp tg3 e1000 
s2io sg sr_mod ext3 jbd qla2400(U) qla2300(U) qla2xxx(U) qla2xxx_conf(U) 
lpfc(U) scsi_transport_fc cciss mptspi(U) mptscsih(U) mptbase(U) sd_mod 
scsi_mod

Pid: 21164, CPU 10, comm:               diskfs
psr : 0000101008126010 ifs : 8000000000000794 ip  : [<a0000002000f0b90>]    
Not tainted
ip is at do_get_write_access+0xbb0/0x11c0 [jbd]
unat: 0000000000000000 pfs : 0000000000000794 rsc : 0000000000000003
rnat: 0000000000000158 bsps: 0000073b2cea0f24 pr  : 0000001805659959
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000002000f0b90 b6  : a00000010006ff00 b7  : a000000100259f40
f6  : 1003e0000000000001200 f7  : 1003e8080808080808081
f8  : 1003e00000000000023dc f9  : 1003e000000000e58b20e
f10 : 1003e000000003571d99d f11 : 1003e44b831eee7285baf
r1  : a0000001009add90 r2  : 0000000000000001 r3  : 0000000000100000
r8  : 0000000000000028 r9  : 0000000000000001 r10 : e00007000002510c
r11 : 0000000000000003 r12 : e000070fe4537c50 r13 : e000070fe4530000
r14 : 0000000000004000 r15 : a000000100742ac0 r16 : a000000100742ac8
r17 : e000070ff405fde8 r18 : e000070ff405802c r19 : e000070000025100
r20 : e0000700000247c0 r21 : 0000000000000002 r22 : 0000000000000001
r23 : e000070ff4058040 r24 : e000070000025860 r25 : e000070000025858
r26 : e000070000025838 r27 : 0000000000000074 r28 : 0000000000000074
r29 : 00000000ffffffff r30 : e000070ff4058080 r31 : 0000000000000000

Call Trace:
 [<a000000100016b20>] show_stack+0x80/0xa0
                                sp=e000070fe45377c0 bsp=e000070fe4531628
 [<a000000100017430>] show_regs+0x890/0x8c0
                                sp=e000070fe4537990 bsp=e000070fe45315d8
 [<a00000010003dbf0>] die+0x150/0x240
                                sp=e000070fe45379b0 bsp=e000070fe4531598
 [<a00000010003dd20>] die_if_kernel+0x40/0x60
                                sp=e000070fe45379b0 bsp=e000070fe4531568
 [<a00000010003dec0>] ia64_bad_break+0x180/0x600
                                sp=e000070fe45379b0 bsp=e000070fe4531540
 [<a00000010000f540>] ia64_leave_kernel+0x0/0x260
                                sp=e000070fe4537a80 bsp=e000070fe4531540
 [<a0000002000f0b90>] do_get_write_access+0xbb0/0x11c0 [jbd]
                                sp=e000070fe4537c50 bsp=e000070fe45314a0
 [<a0000002000f1660>] journal_get_write_access+0x60/0xa0 [jbd]
                                sp=e000070fe4537cb0 bsp=e000070fe4531460
 [<a00000020018ca70>] add_dirent_to_buf+0x4f0/0x7c0 [ext3]
                                sp=e000070fe4537cb0 bsp=e000070fe45313d8
 [<a00000020018cee0>] ext3_add_entry+0x1a0/0x1240 [ext3]
                                sp=e000070fe4537cc0 bsp=e000070fe45312c0
 [<a00000020018e2b0>] ext3_add_nondir+0x30/0x100 [ext3]
                                sp=e000070fe4537d90 bsp=e000070fe4531288
 [<a00000020018e5b0>] ext3_create+0x230/0x240 [ext3]
                                sp=e000070fe4537d90 bsp=e000070fe4531238
 [<a000000100146000>] vfs_create+0x260/0x380
                                sp=e000070fe4537da0 bsp=e000070fe45311d8
 [<a0000001001475a0>] open_namei+0xe20/0xf00
                                sp=e000070fe4537da0 bsp=e000070fe4531150
 [<a00000010011e380>] filp_open+0x80/0x140
                                sp=e000070fe4537db0 bsp=e000070fe4531110
 [<a00000010011e9d0>] sys_open+0xd0/0x1a0
                                sp=e000070fe4537e30 bsp=e000070fe4531090
 [<a00000010000f3e0>] ia64_ret_from_syscall+0x0/0x20
                                sp=e000070fe4537e30 bsp=e000070fe4531090
 [<a000000000010640>] 0xa000000000010640
                                sp=e000070fe4538000 bsp=e000070fe4531090

Comment 34 Eric Sandeen 2006-08-29 20:34:51 UTC

From comment #32 and comment #33, it's my understanding that RHEL4U4 resolves
the original issue, thanks to the linux-2.6.9-ext3-jbd-race.patch in U4.

However, it sounds like then, testing runs into the bug in 158363.

Can HP do testing with an RHEL4U4 kernel, with default configuration options,
plus the patch in comment #26 in bug 158363*, to see if the test runs reliably
with that set of code?

Thanks,

-Eric

*https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=130424

Comment 35 Bjorn Helgaas 2006-08-29 21:04:44 UTC

Sorry, I don't follow you, Eric.  Comment 32 says RHEL4U4 didn't crash on
Madison.  Comment 33 says RHEL4U3 crashed with a different panic (I think
this update should probably be on a different bugzilla entry).  Neither one
tells me whether RHEL4U4 fixes the original issue on Montecito.

That said, my opinion is that you are right, and the patch in comment #29
(also included in RHEL4U4) probably DOES resolve the problem.  We've
started testing RHEL4U4, and we haven't seen the problem yet.  Of course,
we are continuing that testing and our confidence will increase with time.

But I would feel a lot better if the author of the patch (Stephen, I think
you said) looked at this crash and confirmed that it looks like a manifestation
of the bug he fixed.  I just don't know enough about the filesystem code to
have an opinion.

Comment 36 Ronald Pacheco 2006-08-29 21:14:40 UTC

Bibo,

Can you report back the following:

1. Which processor you used when you originally reported this bug (Madison,
Montecito or both)

2. If you verified U4+patch on Madison, Montecito or both.  Comment #32 (Thank
you for the response) indicates Madison

Comment 37 Eric Sandeen 2006-08-29 21:18:57 UTC

Ugh, sorry, I meant to reference comment #31 and comment #32

comment #31 says that RHEL4U4 crashed in a -different- place, likely due to the
other bug referenced.  That's why I was interested in the combination of RHEL4U4
plus the patch from the other bug.

comment #32 says:
1) stock U4 kernel did not crash on madison
2) Rebuilt U3 with more debugging options did not crash (on what?) (although the
debugging probably changes timing, and could well change or avoid racey problems)
3) Rebuilt "kernel with default option in SRPM package in Montecito with 4
physical cpu, dual-core and hyperthread function, system crashed."

I can't tell what kernel we're talking about in 3), or what hardware was tested.

Also, I believe that my analysis in comment #30 makes a very strong case that
you are seeing a null journal head, which is addressed by the patch in U4.

Comment 38 bibo,mao 2006-08-30 01:20:11 UTC

(In reply to comment #36)
> Bibo,
> Can you report back the following:
> 1. Which processor you used when you originally reported this bug (Madison,
> Montecito or both)
Sorry for the confusion, originally this bug is generated on Montecito machine,
and test case passed on Madison machine. Previously I mainly tested in RHEL4.3
kernel version with default configuration option in SRPM, Montecito failed and 
Madison passed. When I changed kernel option, Montecito also passed, I did not 
test changed configuration option in Madison. 

> 2. If you verified U4+patch on Madison, Montecito or both.  Comment #32 (Thank
> you for the response) indicates Madison
Now I am verifing U4 + patch on Montecito machine, Comment #32 indicates 
Montecito, it hit another bug. I did not tested U4 kernel on Madison machine, 
because my Madison machine can not hit this bug, laterly I run LTP test cases on
Montecito machine always.

Comment 39 bibo,mao 2006-08-30 01:34:39 UTC

(In reply to comment #37)
> comment #32 says:
> 1) stock U4 kernel did not crash on madison
Sorry I did not state it clearly, stock U3 kernel did not crash on madison, I 
did not verify U4 kenrel on madison.

> 2) Rebuilt U3 with more debugging options did not crash (on what?)
> (although the debugging probably changes timing, and could well change or 
> avoid racey problems)
Rebuilt U3 with tiger_defconfig option, system did not crash on Montecito, 
laterly in order to reconstruct this bug I alway tested stock kernel with 
default option on Montecito.

> 3) Rebuilt "kernel with default option in SRPM package in Montecito with 4
> physical cpu, dual-core and hyperthread function, system crashed."
> I can't tell what kernel we're talking about in 3), or what hardware was 
tested.
> Also, I believe that my analysis in comment #30 makes a very strong case that
> you are seeing a null journal head, which is addressed by the patch in U4.
yes, your analysis in comment #30 is right, I ever debug this nat consumption 
problem, jh pointer is empty. But I do not know whethter patch in U4 can fix 
this problem. Now I am testing U4+patch on Montecito machine.

Comment 40 bibo,mao 2006-09-01 01:21:42 UTC

I seemed that U4+patch passed on my Montecito machine, I will test again.

Comment 41 Ronald Pacheco 2006-09-01 22:47:34 UTC

Bibo,

Thanks for your testing!  This is good news.

Regards,

Ron

Comment 42 Bjorn Helgaas 2006-09-05 20:48:25 UTC

We've completed a 96 hour hazard run on RHEL4 U4 with no u320 drives (see
note below), with no problem found.  This same configuration on RHEL4 U3 
typically crashed in 12 hours or so.

I'm pretty confident that that the patch you identified is the fix, Eric.
Can you double-check with Stephen that a null pointer dereference in
journal_data_metadata() is one manifestation of the bug that he fixed?

When running hazard on RHEL4 U4 with u320 drives, we did see a panic due to
a null pointer dereference at scsi_request_fn+0x730.  This occurred during
a "task abort", and we believe it is related to this issue tracker:
https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=93312

Comment 43 Ronald Pacheco 2006-09-06 19:33:14 UTC

Bjorn/Rick,

Any updates on the ETA for the driver for 93312?

Regards,

Ron

Comment 44 Bjorn Helgaas 2006-09-06 21:01:18 UTC

Re: 93312, we have not identified a root cause of the MPT Fusion
driver problems yet.

Comment 45 bibo,mao 2006-09-07 06:14:58 UTC

I tested LTP stress test cases two times against kernel rpm package on website
http://people.redhat.com/~jbaron/rhel4 for three days's stress test, it both 
passed.
Also bug on website https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=158363
did not happen against this kernel version.

Comment 46 RHEL Program Management 2006-09-07 20:46:59 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 47 Eric Sandeen 2006-09-07 20:55:05 UTC

Just to make sure everyone is on the same page - I think the consensus is that
this particular bug is -already- fixed in RHEL4 U4.  HP folks, do you concur?

Comment 48 Bjorn Helgaas 2006-09-07 21:42:02 UTC

> Just to make sure everyone is on the same page - I think the consensus
> is that this particular bug is -already- fixed in RHEL4 U4.  HP folks,
> do you concur?

Yes.

Comment 50 Eric Sandeen 2006-09-20 19:35:45 UTC

Closing based on customer feedback that this is resolved in RHEL4U4.