Bug 158363
Summary: | Assert panic in fs/jbd/commit.c:790:journal_commit_transaction() | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Nick Dokos <nicholas.dokos> | ||||||||
Component: | kernel | Assignee: | Eric Sandeen <esandeen> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 4.0 | CC: | adaora.onyia, andrew.patterson, bjorn.helgaas, debby.fu, defranco, i-kitayama, jbaker, jburke, john, josef.moellers, lori.carlson, mcoffey, peter.keilty, rick.stern, rwilliamson, sct, staubach, tao | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | RHBA-2007-0304 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-05-01 22:54:04 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 150568 | ||||||||||
Bug Blocks: | 198694, 234547 | ||||||||||
Attachments: |
|
Description
Nick Dokos
2005-05-20 21:35:09 UTC
Changing arch to "all": I very much doubt that this is ia64-specific. We have also seen this problem on x86_64. We have a stress test that shows it after 3-10 hours (which is a real pain since we're trying to get a 48hr run. Question: when you say "This kernel" version includes a fix for "that" problem. What kernel is "this" and what problem is "that"? Is the "that" the 14604 bug? We've been running a late beta of RH4 U1 (I believe the kernel is 2.6.9.-5.46). I should add the trace later today or tomorrow. Here's the panic info: Kernel 2.6.9-6.46.ELsmp on an x86_64 pltest24.cup.hp.com login: process `cmquerycl' is using obsolete setsockopt SO_BSDCOMPA process `cmclconfd' is using obsolete setsockopt SO_BSDCOMPAT process `cmviewcl' is using obsolete setsockopt SO_BSDCOMPAT process `hdtd' is using obsolete setsockopt SO_BSDCOMPAT Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:790: "jh->b_next_transaction == ((void *)0)" ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at commit:790 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: md5 ipv6 parport_pc lp parport pidentd(U) autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core deadman(U) button battery ac uhci_hcd ehci_hcd hw_random tg3 e1000 bonding floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 333, comm: kjournald Not tainted 2.6.9-6.46.ELsmp RIP: 0010:[<ffffffffa008f646>] <ffffffffa008f646>{:jbd:journal_commit_transaction+4006} RSP: 0018:000001007f02dbb8 EFLAGS: 00010212 RAX: 0000000000000075 RBX: 0000000000000000 RCX: 0000000100000000 RDX: ffffffff803c8b08 RSI: 0000000000000246 RDI: ffffffff803c8b00 RBP: 0000010054821ce8 R08: ffffffff803c8b08 R09: 0000000000000000 R10: ffffffff801ea9ca R11: ffffffff801ea9ca R12: 000001004790f190 R13: 000001007b829ec0 R14: 000001007fdea600 R15: 000001004e282fa8 FS: 0000000000000000(0000) GS:ffffffff804c1700(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000558000 CR3: 0000000000101000 CR4: 00000000000006e0 Process kjournald (pid: 333, threadinfo 000001007f02c000, task 000001007fd18030) Stack: 0000000000000000 00000f8c00000000 0000010060bca074 0000000000000007 000001004a972920 0000000000001c60 0000000000000000 000001007fd18030 ffffffff8013395e 000001007f02dc30 Call Trace:<ffffffff8013395e>{autoremove_wake_function+0} <ffffffff801c284e>{avc_node_replace+52} <ffffffff8013395e>{autoremove_wake_function+0} <ffffffff80131308>{move_tasks+186} <ffffffff80130ead>{finish_task_switch+55} <ffffffff802f830c>{thread_return+42} <ffffffff8013dc9f>{del_timer+107} <ffffffffa0091898>{:jbd:kjournald+250} <ffffffff8013395e>{autoremove_wake_function+0} <ffffffff8013395e>{autoremove_wake_function+0} <ffffffffa0091798>{:jbd:commit_timeout+0} <ffffffff80110c8f>{child_rip+8} <ffffffffa009179e>{:jbd:kjournald+0} <ffffffff80110c87>{child_rip+0} Code: 0f 0b f0 47 09 a0 ff ff ff ff 16 03 4c 89 e7 e8 b0 d4 ff ff RIP <ffffffffa008f646>{:jbd:journal_commit_transaction+4006} RSP <000001007f02dbb8> <0>Kernel panic - not syncing: Oops This bug is present in Fedora Core 3 in kernels 2.6.9-1.724FC3smp, 2.6.9-1.667smp, 2.6.10-1.741_FC3smp. The 2.6.9-1667smp message is: Jun 9 15:37:59 mx kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000000c Jun 9 15:37:59 mx kernel: printing eip: Jun 9 15:37:59 mx kernel: 4285e63f Jun 9 15:37:59 mx kernel: *pde = 00004001 Jun 9 15:37:59 mx kernel: Oops: 0002 [#1] Jun 9 15:37:59 mx kernel: SMP Jun 9 15:37:59 mx kernel: Modules linked in: md5 ipv6 autofs4 i2c_dev i2c_core sunrpc button battery ac ohci_hcd e1000 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mptscsih mptbase sd_mod scsi_mod Jun 9 15:37:59 mx kernel: CPU: 0 Jun 9 15:37:59 mx kernel: EIP: 0060:[<4285e63f>] Not tainted VLI Jun 9 15:37:59 mx kernel: EFLAGS: 00010202 (2.6.9-1.667smp) Jun 9 15:37:59 mx kernel: EIP is at journal_commit_transaction+0x442/0xfb3 [jbd] Jun 9 15:37:59 mx kernel: eax: 3c51a74c ebx: 00000000 ecx: 419a4c80 edx: 4075086c Jun 9 15:37:59 mx kernel: esi: 39ee2a00 edi: 3c51a74c ebp: 41d0d980 esp: 40a7fdf0 Jun 9 15:37:59 mx kernel: ds: 007b es: 007b ss: 0068 Jun 9 15:37:59 mx kernel: Process kjournald (pid: 1253, threadinfo=40a7f000 task=40b6b770) Jun 9 15:37:59 mx kernel: Stack: 00000000 00000000 00000000 00000000 00000000 00000000 33f972cc 39ee2a00 Jun 9 15:37:59 mx kernel: 3c51a80c 000002de 00000000 40b6b770 0211deca 40a7fe44 40a7fe44 00000100 Jun 9 15:37:59 mx kernel: 000000a0 0000c400 00000000 40b6b770 0211deca 40a7fe44 40a7fe44 00000000 Jun 9 15:37:59 mx kernel: Call Trace: Jun 9 15:37:59 mx kernel: [<0211deca>] autoremove_wake_function+0x0/0x2d Jun 9 15:37:59 mx kernel: [<0211deca>] autoremove_wake_function+0x0/0x2d Jun 9 15:37:59 mx kernel: [<42860e59>] kjournald+0xc7/0x215 [jbd] Jun 9 15:37:59 mx kernel: [<0211deca>] autoremove_wake_function+0x0/0x2d Jun 9 15:37:59 mx kernel: [<0211deca>] autoremove_wake_function+0x0/0x2d Jun 9 15:37:59 mx kernel: [<42860d8c>] commit_timeout+0x0/0x5 [jbd] Jun 9 15:37:59 mx kernel: [<42860d92>] kjournald+0x0/0x215 [jbd] Jun 9 15:37:59 mx kernel: [<021041f1>] kernel_thread_helper+0x5/0xb Jun 9 15:37:59 mx kernel: Code: 00 00 e8 20 93 8f bf 8b 54 24 14 89 d8 e8 8a 0b 00 00 89 f0 e8 1f 9a a5 bf 83 7d 18 00 0f 84 1f 01 00 00 8b 45 18 8b 78 20 8b 1f <f0> ff 43 0c 8b 03 a8 04 74 5a 8b 4c 24 1c 8d 81 e4 00 00 00 e8 Jun 9 16:09:31 mx syslogd 1.4.1: restart. Exactly the same error screen as John described in the above comment. RHEL ES4 kernel-smp-2.6.9-5.EL HP DL380 G4 with 6i controller. 3 RAID-0 drives. Dual Xeon 3.2 HT. Busy mail server. Crashes about twice a week on average. In my case turning Journalling off for the /tmp directory was the workaround. I'm speculating the bug has something to do with how Postfix or Cyrus-imapd uses /tmp. Lots of churn, perhaps with small files like what might happen in /tmp might, expose this bug. We have also seen this problem with EL4-U2 and have a crash dump. Would it help if I uploaded the vmcore? Is there a patch available? It's pretty high prio for us. The problem also occurred in RedHat 4 update 2 ia64, during our stress testing.
It happened twice in a row on one server < 3 hours run. Occurred on another
server after 39 hours run.
kernel: 2.6.9-22.EL
The kernel panic trace shown on the console:
Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:793: "jh-
>b_next_transaction == ((void *)0)" kernel BUG at fs/jbd/commit.c:793!
kjournald[266]: bugcheck! 0 [1]
Modules linked in: md5 ipv6 parport_pc lp parport pidentd(U) autofs4 sunrpc ds
yenta_socket pcmcia_core deadman(U) vfat fat dm_multipath button ohci_hcd
ehci_hcd tulip tg3 bonding(U) sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
qla2300 qla2xxx scsi_transport_fc sym53c8xx scsi_transport_spi sd_mod scsi_mod
Pid: 266, CPU 0, comm: kjournald
psr : 0000101008026018 ifs : 8000000000000fa4 ip : [<a0000002000b0a30>]
Not tainted
ip is at journal_commit_transaction+0x110/0x3080 [jbd]
unat: 0000000000000000 pfs : 0000000000000fa4 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000009941
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd :
0000000000000000 ssd : 0000000000000000 b0 : a0000002000b0a30 b6 :
a0000001003418a0 b7 : a000000100256b40 f6 : 0fffbccccccccc8c00000 f7 :
0ffdc8dc0000000000000 f8 : 10001e000000000000000 f9 : 10002a000000000000000
f10 : 0fffeb33333332fa80000 f11 : 1003e0000000000000000 r1 : a00000010099d0e0
r2 : 0000000000038606 r3 : a00000010079d740 r8 : 0000000000000023 r9 :
a000000100732ac0 r10 : a000000100732ab8 r11 : a00000010079d328 r12 :
e000004040f47b10 r13 : e000004040f40000 r14 : 0000000000004000 r15 :
a00000010074e540 r16 : 0000000000000001 r17 : 0000000000000538 r18 :
a000000100650198 r19 : a000000100256b40 r20 : c000000084053000 r21 :
0000000000000005 r22 : a0000001007b3b30 r23 : a0000001007b3a40 r24 :
a0000001007b3a40 r25 : a000000100a3d7c8 r26 : 0000ad0e7654423e r27 :
0000001008026018 r28 : 0000000000000000 r29 : 00000000110000c0 r30 :
0000000000000000 r31 : a0000001007b00c0
Call Trace:
[<a000000100016a60>] show_stack+0x80/0xa0
sp=e000004040f47680 bsp=e000004040f410a8
[<a000000100017370>] show_regs+0x890/0x8c0
sp=e000004040f47850 bsp=e000004040f41060
[<a00000010003d7f0>] die+0x150/0x240
sp=e000004040f47870 bsp=e000004040f41020
[<a00000010003d920>] die_if_kernel+0x40/0x60
sp=e000004040f47870 bsp=e000004040f40fe8
[<a00000010003dac0>] ia64_bad_break+0x180/0x600
sp=e000004040f47870 bsp=e000004040f40fc0
[<a00000010000f480>] ia64_leave_kernel+0x0/0x260
sp=e000004040f47940 bsp=e000004040f40fc0
[<a0000002000b0a30>] journal_commit_transaction+0x110/0x3080 [jbd]
sp=e000004040f47b10 bsp=e000004040f40ea0
[<a0000002000b9830>] kjournald+0x170/0x580 [jbd]
sp=e000004040f47d80 bsp=e000004040f40e38
[<a000000100018930>] kernel_thread_helper+0x30/0x60
sp=e000004040f47e30 bsp=e000004040f40e10
[<a000000100008c60>] start_kernel_thread+0x20/0x40
sp=e000004040f47e30 bsp=e000004040f40e10
Kernel panic - not syncing: Fatal exception
Jan Kara upstream has tracked down what we believe to be the cause of this problem. I have reviewed the diagnosis and fix, and Andrew Morton just committed it to the -mm tree: http://marc.theaimsgroup.com/?l=linux-mm-commits&m=114557249220945&w=2 for further testing. *** Bug 188620 has been marked as a duplicate of this bug. *** Created attachment 128909 [details]
Proposed patch
I've done as much testing as I reasonably can, given my small systems. I didn't see any failures, but I didn't really expect to see any failures either. Could someone who has been seeing the failure, try a kernel which includes the patch and let me know how it goes? (In reply to comment #16) > I've done as much testing as I reasonably can, given my small systems. > I didn't see any failures, but I didn't really expect to see any failures > either. Could someone who has been seeing the failure, try a kernel > which includes the patch and let me know how it goes? We will try that out and let you know how it goes. (In reply to comment #17) > (In reply to comment #16) > > I've done as much testing as I reasonably can, given my small systems. > > I didn't see any failures, but I didn't really expect to see any failures > > either. Could someone who has been seeing the failure, try a kernel > > which includes the patch and let me know how it goes? > > We will try that out and let you know how it goes. Problem still occured for us. Here is the stack trace: Red Hat Enterprise Linux AS release 4 (Nahant Update 3) Kernel 2.6.9-bz158363.36.EL on an ia64 pltest32.cup.hp.com login: Assertion failure in journal_write_metadata_buffer() at fs/jbd/journal.c:308: "buffer_jbddirty(bh_in)" kernel BUG at fs/jbd/journal.c:308! kjournald[456]: bugcheck! 0 [1] Modules linked in: md5 ipv6 parport_pc lp parport pidentd(U) autofs4 sunrpc ds yenta_socket pcmcia_core deadman(U) vfat fat dm_multipath button ohci_hcd ehci_hcd tulip tg3 bonding(U) sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sym53c8xx scsi_transport_spi sd_mod scsi_mod Pid: 456, CPU 2, comm: kjournald psr : 0000101008126030 ifs : 8000000000000896 ip : [<a0000002000c2b90>] Not tainted ip is at journal_write_metadata_buffer+0xb0/0x900 [jbd] unat: 0000000000000000 pfs : 0000000000000896 rsc : 0000000000000003 rnat: 0000000000000534 bsps: a000000100669f58 pr : 0000000000009941 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000002000c2b90 b6 : a00000010034b240 b7 : a00000010025e840 f6 : 0fffbccccccccc8c00000 f7 : 0ffdaf300000000000000 f8 : 10000c000000000000000 f9 : 10002a000000000000000 f10 : 0fffd9999999996900000 f11 : 1003e0000000000000000 r1 : a0000001009bb6e0 r2 : 0000000000015358 r3 : a0000001007bbd40 r8 : 0000000000000024 r9 : a00000010074fbc0 r10 : a00000010074fbb8 r11 : a0000001007bb930 r12 : e00000407f6f7b10 r13 : e00000407f6f0000 r14 : 0000000000004000 r15 : a00000010076b8d0 r16 : 0000000000000001 r17 : 0000000000000534 r18 : a000000100669f58 r19 : a00000010025e840 r20 : c000000084053000 r21 : 0000000000000005 r22 : a0000001007d24c0 r23 : a0000001007d23d0 r24 : a0000001007d23d0 r25 : a000000100a5e2c8 r26 : 00001f9d598b87eb r27 : 0000001008126030 r28 : 0000000000000000 r29 : 00000000110000c0 r30 : 0000000000000000 r31 : a0000001007cea20 Call Trace: [<a000000100016d20>] show_stack+0x80/0xa0 sp=e00000407f6f7680 bsp=e00000407f6f1130 [<a000000100017630>] show_regs+0x890/0x8c0 sp=e00000407f6f7850 bsp=e00000407f6f10e8 [<a00000010003e870>] die+0x150/0x240 sp=e00000407f6f7870 bsp=e00000407f6f10a8 [<a00000010003e9a0>] die_if_kernel+0x40/0x60 sp=e00000407f6f7870 bsp=e00000407f6f1078 [<a00000010003eb40>] ia64_bad_break+0x180/0x600 sp=e00000407f6f7870 bsp=e00000407f6f1050 [<a00000010000f580>] ia64_leave_kernel+0x0/0x260 sp=e00000407f6f7940 bsp=e00000407f6f1050 [<a0000002000c2b90>] journal_write_metadata_buffer+0xb0/0x900 [jbd] sp=e00000407f6f7b10 bsp=e00000407f6f0f98 [<a0000002000b5ad0>] journal_commit_transaction+0x1110/0x30c0 [jbd] sp=e00000407f6f7b10 bsp=e00000407f6f0ea0 [<a0000002000bd910>] kjournald+0x170/0x580 [jbd] sp=e00000407f6f7d80 bsp=e00000407f6f0e38 [<a000000100018bf0>] kernel_thread_helper+0x30/0x60 sp=e00000407f6f7e30 bsp=e00000407f6f0e10 [<a000000100008c60>] start_kernel_thread+0x20/0x40 sp=e00000407f6f7e30 bsp=e00000407f6f0e10 Kernel panic - not syncing: Fatal exception [Not sure what happened - I got email asking for a comment from the original reporter - that's me - but I don't see the note in the bugzilla any longer. In any case, ....] We have seen this problem exactly once, on a pre-U1 kernel. We have run all our usual benchmarks on all the different configurations we usually use and on many kernels up to and including the final U3: we have never seen the problem again. But from what John DeFranco reports, it's still a problem for them. Created attachment 130424 [details]
Proposed patch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this enhancement by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This enhancement is not yet committed for inclusion in an Update release. Andrew Patterson saw a similar crash in fs/jbd/transaction.c on RHEL4U3 while running hazard. This is the same crash reported in comment #33 of bug 199667. I'm copying it here because it looks more like this one than 199667. We've been unable to reproduce this assertion failure, so we've only seen it once. In comment #34 of bug 199667, Eric mentions a patch that might address this assertion failure crash. Most of the comments on this bug (158363) are hidden from me (I can't see anything between 23 and 33), but the patch Eric mentioned does not seem to be in U4. Can you confirm that? Is it in a post-U4 hotfix kernel? From the comments that I can read, this looks like a fairly reproducible problem seen by several people, but I can't see whether it ever got resolved. ssertion failure in do_get_write_access() at fs/jbd/transaction.c:608: "jh->b_next_transaction == ((void *)0)" kernel BUG at fs/jbd/transaction.c:608! diskfs[21164]: bugcheck! 0 [1] Assertion failure in do_get_write_access() at fs/jbd/transaction.c:608: "jh->b_next_transaction == ((void *)0)" kernel BUG at fs/jbd/transaction.c:608! Modules linked in: md5 ipv6 parport_pc lp parport dev_acpi(U) autofs4 sunrpc ds yenta_socket pcmcia_core scsi_dump diskdump zlib_deflate mptctl(U) lpfcdfc(U) vfat fat dm_mod button ohci_hcd ehci_hcd shpchp tg3 e1000 s2io sg sr_mod ext3 jbd qla2400(U) qla2300(U) qla2xxx(U) qla2xxx_conf(U) lpfc(U) scsi_transport_fc cciss mptspi(U) mptscsih(U) mptbase(U) sd_mod scsi_mod Pid: 21164, CPU 10, comm: diskfs psr : 0000101008126010 ifs : 8000000000000794 ip : [<a0000002000f0b90>] Not tainted ip is at do_get_write_access+0xbb0/0x11c0 [jbd] unat: 0000000000000000 pfs : 0000000000000794 rsc : 0000000000000003 rnat: 0000000000000158 bsps: 0000073b2cea0f24 pr : 0000001805659959 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000002000f0b90 b6 : a00000010006ff00 b7 : a000000100259f40 f6 : 1003e0000000000001200 f7 : 1003e8080808080808081 f8 : 1003e00000000000023dc f9 : 1003e000000000e58b20e f10 : 1003e000000003571d99d f11 : 1003e44b831eee7285baf r1 : a0000001009add90 r2 : 0000000000000001 r3 : 0000000000100000 r8 : 0000000000000028 r9 : 0000000000000001 r10 : e00007000002510c r11 : 0000000000000003 r12 : e000070fe4537c50 r13 : e000070fe4530000 r14 : 0000000000004000 r15 : a000000100742ac0 r16 : a000000100742ac8 r17 : e000070ff405fde8 r18 : e000070ff405802c r19 : e000070000025100 r20 : e0000700000247c0 r21 : 0000000000000002 r22 : 0000000000000001 r23 : e000070ff4058040 r24 : e000070000025860 r25 : e000070000025858 r26 : e000070000025838 r27 : 0000000000000074 r28 : 0000000000000074 r29 : 00000000ffffffff r30 : e000070ff4058080 r31 : 0000000000000000 Call Trace: [<a000000100016b20>] show_stack+0x80/0xa0 sp=e000070fe45377c0 bsp=e000070fe4531628 [<a000000100017430>] show_regs+0x890/0x8c0 sp=e000070fe4537990 bsp=e000070fe45315d8 [<a00000010003dbf0>] die+0x150/0x240 sp=e000070fe45379b0 bsp=e000070fe4531598 [<a00000010003dd20>] die_if_kernel+0x40/0x60 sp=e000070fe45379b0 bsp=e000070fe4531568 [<a00000010003dec0>] ia64_bad_break+0x180/0x600 sp=e000070fe45379b0 bsp=e000070fe4531540 [<a00000010000f540>] ia64_leave_kernel+0x0/0x260 sp=e000070fe4537a80 bsp=e000070fe4531540 [<a0000002000f0b90>] do_get_write_access+0xbb0/0x11c0 [jbd] sp=e000070fe4537c50 bsp=e000070fe45314a0 [<a0000002000f1660>] journal_get_write_access+0x60/0xa0 [jbd] sp=e000070fe4537cb0 bsp=e000070fe4531460 [<a00000020018ca70>] add_dirent_to_buf+0x4f0/0x7c0 [ext3] sp=e000070fe4537cb0 bsp=e000070fe45313d8 [<a00000020018cee0>] ext3_add_entry+0x1a0/0x1240 [ext3] sp=e000070fe4537cc0 bsp=e000070fe45312c0 [<a00000020018e2b0>] ext3_add_nondir+0x30/0x100 [ext3] sp=e000070fe4537d90 bsp=e000070fe4531288 [<a00000020018e5b0>] ext3_create+0x230/0x240 [ext3] sp=e000070fe4537d90 bsp=e000070fe4531238 [<a000000100146000>] vfs_create+0x260/0x380 sp=e000070fe4537da0 bsp=e000070fe45311d8 [<a0000001001475a0>] open_namei+0xe20/0xf00 sp=e000070fe4537da0 bsp=e000070fe4531150 [<a00000010011e380>] filp_open+0x80/0x140 sp=e000070fe4537db0 bsp=e000070fe4531110 [<a00000010011e9d0>] sys_open+0xd0/0x1a0 sp=e000070fe4537e30 bsp=e000070fe4531090 [<a00000010000f3e0>] ia64_ret_from_syscall+0x0/0x20 sp=e000070fe4537e30 bsp=e000070fe4531090 [<a000000000010640>] 0xa000000000010640 sp=e000070fe4538000 bsp=e000070fe4531090 The proposed patch (not sure why it wasn't visible, making it so now) is currently planned for inclusion in RHEL4U5. I'll see if I can build a U4 kernel with this patch included, for testing. -Eric I would be happy to test it out again if you build the kernel. Seems to me that there are different failures that have been conflated into this bugzilla. So here is an attempt to clarify the situation, reading between the lines in some cases - so approach with caution. There may be more failures in bugzilla #188620 which comment #13 marked as a duplicate but I am not allowed to see that one - why not? o Assert in journal_commit_transaction() at commit.c:793 "jh->b_next_transaction == 0" seen by (opening comment) Nick Dokos on RHEL4 U1 beta (2.6.9-6.37.EL) running AIM7 fserver on 12-way IA64 (Madison) box, 48Gb, 144 disks, with each 72Gb FC disk holding one ext3 filesystem. Repeatable: no seen by (comments #2 and #3) Rick Stern on RHEL4 U1 beta (2.6.9-6.46.ELsmp) running unspecified stress test on x86_64 - no other details on hardware configuration. Repeatable: yes seen by (comment #8) Debby Fu on RHEL4U2 (2.6.9-22.EL) running unspecified stress test on IA64 (Madison?) - no other details on hardware configuration. Repeatable: yes o Null pointer at journal_commit_transaction+0x442 seen by (comment #4) John Van Ostrand on Fedora Core 2 (2.6.9-1.724FC3smp and two more kernels) running ? on ? (perhaps x86_64) Repeatable: Yes(?) seen by (comment #5) Rich Williamson on RHEL ES4 (kernel-smp-2.6.9-5.EL) running "busy mail server" on HP DL380 G4, dual Xeon 3.2HT with 6i controller, 3 RAID-0 drives. Repeatable: Yes(?) o Assert failure in journal_write_metadata_buffer() at journal.c:308 "buffer_jbddirty(bh_in)" seen by (comment #20) John DeFranco on RHEL4 U3 (2.6.9-bz158363.36.EL custom kernel - presumably incorporating the proposed fix from comment #15) running ? on IA64 (Madison?) - no other hardware details. Repeatable: ? o Assert failure in do_get_write_access() at transaction.c:608 "jh->b_next_transaction == 0" seen by (comment #6) Andrew Patterson/Bjorn Helgaas on RHEL4U3 (2.6.9-?) running hazard on IA64 (Montecito) - no other hardware details. Repeatable: ? From these four failures, I believe that the NULL pointer failures are misfiled here - they are probably manifestations of bugzilla #146037. The other three failures may or may not be related. Assuming that they are and that the patch from comment #15 resolves the first and the last (nb: these are big and unproven assumptions), that still leaves the failure in comment #20 (the third in the list above). But there might be an explanation for this one: there is a patch that is *not* in U3 but is in U4 that addressed a number of different possible races, and the failure happened on U3+patch. I will create an attachment for this patch (it's called "Patch1049: linux-2.6.9-ext3-jdb-race.patch" in the RHEL4 U4 kernel source RPM). Could it be that if John DeFranco applies the patch from comment #15 to RHEL4 U4, his failure goes away? Same question applies to the first set of failures and the last one as well. Also, if John Van Ostrand and Rich Williamson are running more recent versions of RHEL4 (U4 would be ideal because it includes the above mentioned patch for a race that could lead to null pointers), could they report back? It's possible of course that that there are other races that nobody knows about yet. But this should go a long way towards clarifying what the current situation is. Created attachment 135314 [details] Patch 1049: linux-2.6.9-ext3-jbd-race.patch from RHEL4U4 See comment#40. Nick, thanks for the recap. I agree that this bugzilla entry has gotten a bit confused with reports of various different problems. Just a note, the patch in Comment #15 is obsolete, replaced by the patch in Comment #26 (Jan Kara, the upstream patch author, revised his original patch, hence the obsoleting of the patch in Comment #15.) committed in stream U5 build 42.5. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ Also, RHEL4U4 ia64 RPMs with only the patch in Comment #26 added are available at http://people.redhat.com/esandeen/hp/bz158363/ Rick, John, or Debby, have you had a chance to test the patch? You seem to be the only ones with the recipe to reproduce the assertion failures. Hi, I just downloaded the kernels and will try a test as soon as possible (in the next day or two). We have only seen the do_get_write_access() failure once. This was in RHEL4 U3. Since then we have mostly gone to RHEL4 U4 testing, so it will be difficult/impossible to see if this patch fixes this particular problem. I have run a test with the ipf kernel provided. The test usually encounters the crash within 3-6 hours. In this case it ran to completion (48hours). Fix looks ok. *** Bug 201234 has been marked as a duplicate of this bug. *** *** Bug 202197 has been marked as a duplicate of this bug. *** Patch is in -50 and has been verified by two partners. I have verified the presence of the patch in RHEL4.5snap4 (kernel version 2.6.9-51) and RHEL4.5rc (kernel version 2.6.9-54). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html *** Bug 237815 has been marked as a duplicate of this bug. *** |