Bug 158363

Summary:

Assert panic in fs/jbd/commit.c:790:journal_commit_transaction()

Product:

Red Hat Enterprise Linux 4

Reporter:

Nick Dokos <nicholas.dokos>

Component:

kernel

Assignee:

Eric Sandeen <esandeen>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.0

CC:

adaora.onyia, andrew.patterson, bjorn.helgaas, debby.fu, defranco, i-kitayama, jbaker, jburke, john, josef.moellers, lori.carlson, mcoffey, peter.keilty, rick.stern, rwilliamson, sct, staubach, tao

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2007-0304

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-05-01 22:54:04 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

150568

Bug Blocks:

198694, 234547

Attachments:

Description	Flags
Proposed patch	none
Proposed patch	none
Patch 1049: linux-2.6.9-ext3-jbd-race.patch from RHEL4U4	none

Description Nick Dokos 2005-05-20 21:35:09 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3

Description of problem:
We are running AIM7 on a 16-cpu, 64Gb IA64 machine with 144 filesystems, each
on a separate 72Gb disk. We have been running in 4-way, 8-way, 12-way and 16-way
configurations. In one instance (the 12-way/48Gb configuration), the AIM7 fserver benchmark caused the following panic to occur:

 ....
 8416    14871.25   61         1.7670   3429.50  39051.42   Sat May 14 01:15:25 2005
10003    14886.18   72         1.4882   4072.11  46541.54   Sat May 14 02:25:49 2005
13474Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:790: "jh->b_next_transaction == ((void *)0)"
kjournald[21279]: bugcheck! 0 [1]
Modules linked in: md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat dm_mod button tg3 sg sr_mod ext3 jbd qla2300 qla2xxx scsi_transport_fc sym53c8xx scsi_transport_spi sd_mod scsi_mod

Pid: 21279, CPU 5, comm:            kjournald
psr : 0000101008126010 ifs : 8000000000000fa4 ip  : [<a0000002000b0a10>]    Not tainted
ip is at journal_commit_transaction+0x110/0x3060 [jbd]
unat: 0000000000000000 pfs : 0000000000000fa4 rsc : 0000000000000003
rnat: 0000000000000078 bsps: 00002abf32ad4ed7 pr  : 0000000000009941
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000002000b0a10 b6  : a00000010006b140 b7  : a0000001002398a0
f6  : 0fffbccccccccc8c00000 f7  : 0ffdc8dc0000000000000
f8  : 10001e000000000000000 f9  : 10002a000000000000000
f10 : 0fffeb33333332fa80000 f11 : 1003e0000000000000000
r1  : a00000010096af40 r2  : 000000000002b50e r3  : a00000010076b560
r8  : 0000000000000023 r9  : a0000001007028c0 r10 : a0000001007028b8
r11 : a00000010076b180 r12 : e00000001c067b10 r13 : e00000001c060000
r14 : 0000000000004000 r15 : a0000001009a9008 r16 : a00000010077ccd0
r17 : a00000010077ccd8 r18 : 000000000000b583 r19 : 000000000001ffff
r20 : 0000000000020000 r21 : 0000000000000034 r22 : 0000000000000034
r23 : a0000001009b458c r24 : 000000000000b584 r25 : 000000000002b584
r26 : 000000000000003e r27 : 0000001008126010 r28 : a0000001009b458d
r29 : 000000000000b585 r30 : 0000000000000000 r31 : a00000010077d8d0

Call Trace:
 [<a000000100016a40>] show_stack+0x80/0xa0
                                sp=e00000001c0676c0 bsp=e00000001c0610a8
 [<a000000100017350>] show_regs+0x890/0x8c0
                                sp=e00000001c067890 bsp=e00000001c061060
 [<a00000010003cb90>] die+0x150/0x240
                                sp=e00000001c0678b0 bsp=e00000001c061020
 [<a00000010003ccc0>] die_if_kernel+0x40/0x60
                                sp=e00000001c0678b0 bsp=e00000001c060fe8
 [<a00000010003d110>] ia64_bad_break+0x430/0x4c0
                                sp=e00000001c0678b0 bsp=e00000001c060fc0
 [<a00000010000f480>] ia64_leave_kernel+0x0/0x260
                                sp=e00000001c067940 bsp=e00000001c060fc0
 [<a0000002000b0a10>] journal_commit_transaction+0x110/0x3060 [jbd]
                                sp=e00000001c067b10 bsp=e00000001c060ea0
 [<a0000002000b97f0>] kjournald+0x170/0x560 [jbd]
                                sp=e00000001c067d80 bsp=e00000001c060e38
 [<a000000100018910>] kernel_thread_helper+0x30/0x60
                                sp=e00000001c067e30 bsp=e00000001c060e10
 [<a000000100008c60>] start_kernel_thread+0x20/0x40
                                sp=e00000001c067e30 bsp=e00000001c060e10
Kernel panic - not syncing: Fatal exception

The problem has not arisen in any other configuration. It also has not
happened in a subsequent run on the configuration above.

Version-Release number of selected component (if applicable):
2.6.9-6.37.EL

How reproducible:
Couldn't Reproduce

Steps to Reproduce:
1.Run aim7 fserver on a big machine (see above) and wait.
2.If it does not happen, lather, rinse, repeat.
3.Note that it is *not* easily reproducible.
  

Actual Results:  Panic (see stack trace above).

Expected Results:  No panic

Additional info:

Stephen Tweedie knows about this bug and asked for this bugzilla.
A similar problem was described in #146037. This kernel version
includes a fix for that problem.

Comment 1 Stephen Tweedie 2005-05-20 21:48:39 UTC

Changing arch to "all": I very much doubt that this is ia64-specific.

Comment 2 Rick Stern 2005-05-31 23:52:36 UTC

We have also seen this problem on x86_64.  We have a stress test that shows it
after 3-10 hours (which is a real pain since we're trying to get a 48hr run.

Question: when you say "This kernel" version includes a fix for "that" problem.
 What kernel is "this" and what problem is "that"?  Is the "that" the 14604 bug?

We've been running a late beta of RH4 U1 (I believe the kernel is 2.6.9.-5.46).

I should add the trace later today or tomorrow.

Comment 3 Rick Stern 2005-06-01 00:24:00 UTC

Here's the panic info:

Kernel 2.6.9-6.46.ELsmp on an x86_64

pltest24.cup.hp.com login: process `cmquerycl' is using obsolete setsockopt
SO_BSDCOMPA
process `cmclconfd' is using obsolete setsockopt SO_BSDCOMPAT
process `cmviewcl' is using obsolete setsockopt SO_BSDCOMPAT
process `hdtd' is using obsolete setsockopt SO_BSDCOMPAT
Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:790:
"jh->b_next_transaction == ((void *)0)"
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at commit:790
invalid operand: 0000 [1] SMP 
CPU 0 
Modules linked in: md5 ipv6 parport_pc lp parport pidentd(U) autofs4 i2c_dev
i2c_core sunrpc ds yenta_socket pcmcia_core deadman(U) button battery ac
uhci_hcd ehci_hcd hw_random tg3 e1000 bonding floppy dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod
Pid: 333, comm: kjournald Not tainted 2.6.9-6.46.ELsmp
RIP: 0010:[<ffffffffa008f646>]
<ffffffffa008f646>{:jbd:journal_commit_transaction+4006}
RSP: 0018:000001007f02dbb8  EFLAGS: 00010212
RAX: 0000000000000075 RBX: 0000000000000000 RCX: 0000000100000000
RDX: ffffffff803c8b08 RSI: 0000000000000246 RDI: ffffffff803c8b00
RBP: 0000010054821ce8 R08: ffffffff803c8b08 R09: 0000000000000000
R10: ffffffff801ea9ca R11: ffffffff801ea9ca R12: 000001004790f190
R13: 000001007b829ec0 R14: 000001007fdea600 R15: 000001004e282fa8
FS:  0000000000000000(0000) GS:ffffffff804c1700(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000558000 CR3: 0000000000101000 CR4: 00000000000006e0
Process kjournald (pid: 333, threadinfo 000001007f02c000, task 000001007fd18030)
Stack: 0000000000000000 00000f8c00000000 0000010060bca074 0000000000000007 
       000001004a972920 0000000000001c60 0000000000000000 000001007fd18030 
       ffffffff8013395e 000001007f02dc30 
Call Trace:<ffffffff8013395e>{autoremove_wake_function+0}
<ffffffff801c284e>{avc_node_replace+52} 
       <ffffffff8013395e>{autoremove_wake_function+0}
<ffffffff80131308>{move_tasks+186} 
       <ffffffff80130ead>{finish_task_switch+55}
<ffffffff802f830c>{thread_return+42} 
       <ffffffff8013dc9f>{del_timer+107} <ffffffffa0091898>{:jbd:kjournald+250} 
       <ffffffff8013395e>{autoremove_wake_function+0}
<ffffffff8013395e>{autoremove_wake_function+0} 
       <ffffffffa0091798>{:jbd:commit_timeout+0} <ffffffff80110c8f>{child_rip+8} 
       <ffffffffa009179e>{:jbd:kjournald+0} <ffffffff80110c87>{child_rip+0} 
       

Code: 0f 0b f0 47 09 a0 ff ff ff ff 16 03 4c 89 e7 e8 b0 d4 ff ff 
RIP <ffffffffa008f646>{:jbd:journal_commit_transaction+4006} RSP <000001007f02dbb8>
 <0>Kernel panic - not syncing: Oops

Comment 4 John Van Ostrand 2005-06-13 16:05:21 UTC

This bug is present in Fedora Core 3 in kernels 2.6.9-1.724FC3smp,
2.6.9-1.667smp, 2.6.10-1.741_FC3smp.

The 2.6.9-1667smp message is:

Jun  9 15:37:59 mx kernel: Unable to handle kernel NULL pointer dereference
at virtual address 0000000c
Jun  9 15:37:59 mx kernel:  printing eip:
Jun  9 15:37:59 mx kernel: 4285e63f
Jun  9 15:37:59 mx kernel: *pde = 00004001
Jun  9 15:37:59 mx kernel: Oops: 0002 [#1]
Jun  9 15:37:59 mx kernel: SMP
Jun  9 15:37:59 mx kernel: Modules linked in: md5 ipv6 autofs4 i2c_dev
i2c_core sunrpc button battery ac ohci_hcd e1000 floppy sg dm_snapshot
dm_zero dm_mirror ext3 jbd dm_mod mptscsih mptbase sd_mod scsi_mod
Jun  9 15:37:59 mx kernel: CPU:    0
Jun  9 15:37:59 mx kernel: EIP:    0060:[<4285e63f>]    Not tainted VLI
Jun  9 15:37:59 mx kernel: EFLAGS: 00010202   (2.6.9-1.667smp)
Jun  9 15:37:59 mx kernel: EIP is at journal_commit_transaction+0x442/0xfb3
[jbd]
Jun  9 15:37:59 mx kernel: eax: 3c51a74c   ebx: 00000000   ecx: 419a4c80
edx: 4075086c
Jun  9 15:37:59 mx kernel: esi: 39ee2a00   edi: 3c51a74c   ebp: 41d0d980
esp: 40a7fdf0
Jun  9 15:37:59 mx kernel: ds: 007b   es: 007b   ss: 0068
Jun  9 15:37:59 mx kernel: Process kjournald (pid: 1253, threadinfo=40a7f000
task=40b6b770)
Jun  9 15:37:59 mx kernel: Stack: 00000000 00000000 00000000 00000000
00000000 00000000 33f972cc 39ee2a00
Jun  9 15:37:59 mx kernel:        3c51a80c 000002de 00000000 40b6b770
0211deca 40a7fe44 40a7fe44 00000100
Jun  9 15:37:59 mx kernel:        000000a0 0000c400 00000000 40b6b770
0211deca 40a7fe44 40a7fe44 00000000
Jun  9 15:37:59 mx kernel: Call Trace:
Jun  9 15:37:59 mx kernel:  [<0211deca>] autoremove_wake_function+0x0/0x2d
Jun  9 15:37:59 mx kernel:  [<0211deca>] autoremove_wake_function+0x0/0x2d
Jun  9 15:37:59 mx kernel:  [<42860e59>] kjournald+0xc7/0x215 [jbd]
Jun  9 15:37:59 mx kernel:  [<0211deca>] autoremove_wake_function+0x0/0x2d
Jun  9 15:37:59 mx kernel:  [<0211deca>] autoremove_wake_function+0x0/0x2d
Jun  9 15:37:59 mx kernel:  [<42860d8c>] commit_timeout+0x0/0x5 [jbd]
Jun  9 15:37:59 mx kernel:  [<42860d92>] kjournald+0x0/0x215 [jbd]
Jun  9 15:37:59 mx kernel:  [<021041f1>] kernel_thread_helper+0x5/0xb
Jun  9 15:37:59 mx kernel: Code: 00 00 e8 20 93 8f bf 8b 54 24 14 89 d8 e8
8a 0b 00 00 89 f0 e8 1f 9a a5 bf 83 7d 18 00 0f 84 1f 01 00 00 8b 45 18 8b
78 20 8b 1f <f0> ff 43 0c 8b 03 a8 04 74 5a 8b 4c 24 1c 8d 81 e4 00 00 00 e8
Jun  9 16:09:31 mx syslogd 1.4.1: restart.

Comment 5 Rich Williamson 2005-07-31 04:28:27 UTC

Exactly the same error screen as John described in the above comment.

RHEL ES4 kernel-smp-2.6.9-5.EL

HP DL380 G4 with 6i controller. 3 RAID-0 drives. Dual Xeon 3.2 HT.

Busy mail server. Crashes about twice a week on average.

Comment 6 John Van Ostrand 2005-07-31 14:29:13 UTC

In my case turning Journalling off for the /tmp directory was the workaround.
I'm speculating the bug has something to do with how Postfix or Cyrus-imapd uses
/tmp. Lots of churn, perhaps with small files like what might happen in /tmp
might, expose this bug.

Comment 7 Josef Möllers 2005-10-27 08:54:58 UTC

We have also seen this problem with EL4-U2 and have a crash dump.
Would it help if I uploaded the vmcore?
Is there a patch available?
It's pretty high prio for us.

Comment 8 Debby Fu 2005-12-14 18:18:27 UTC

The problem also occurred in RedHat 4 update 2 ia64, during our stress testing.
It happened twice in a row on one server < 3 hours run. Occurred on another 
server after 39 hours run. 

kernel: 2.6.9-22.EL
The kernel panic trace shown on the console:

Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:793: "jh-
>b_next_transaction == ((void *)0)" kernel BUG at fs/jbd/commit.c:793!
kjournald[266]: bugcheck! 0 [1]
Modules linked in: md5 ipv6 parport_pc lp parport pidentd(U) autofs4 sunrpc ds 
yenta_socket pcmcia_core deadman(U) vfat fat dm_multipath button ohci_hcd 
ehci_hcd tulip tg3 bonding(U) sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod 
qla2300 qla2xxx scsi_transport_fc sym53c8xx scsi_transport_spi sd_mod scsi_mod

Pid: 266, CPU 0, comm:            kjournald
psr : 0000101008026018 ifs : 8000000000000fa4 ip  : [<a0000002000b0a30>]    
Not tainted
ip is at journal_commit_transaction+0x110/0x3080 [jbd]
unat: 0000000000000000 pfs : 0000000000000fa4 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 0000000000009941
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 
0000000000000000 ssd : 0000000000000000 b0  : a0000002000b0a30 b6  : 
a0000001003418a0 b7  : a000000100256b40 f6  : 0fffbccccccccc8c00000 f7  : 
0ffdc8dc0000000000000 f8  : 10001e000000000000000 f9  : 10002a000000000000000 
f10 : 0fffeb33333332fa80000 f11 : 1003e0000000000000000 r1  : a00000010099d0e0 
r2  : 0000000000038606 r3  : a00000010079d740 r8  : 0000000000000023 r9  : 
a000000100732ac0 r10 : a000000100732ab8 r11 : a00000010079d328 r12 : 
e000004040f47b10 r13 : e000004040f40000 r14 : 0000000000004000 r15 : 
a00000010074e540 r16 : 0000000000000001 r17 : 0000000000000538 r18 : 
a000000100650198 r19 : a000000100256b40 r20 : c000000084053000 r21 : 
0000000000000005 r22 : a0000001007b3b30 r23 : a0000001007b3a40 r24 : 
a0000001007b3a40 r25 : a000000100a3d7c8 r26 : 0000ad0e7654423e r27 : 
0000001008026018 r28 : 0000000000000000 r29 : 00000000110000c0 r30 : 
0000000000000000 r31 : a0000001007b00c0

Call Trace:
 [<a000000100016a60>] show_stack+0x80/0xa0
                                sp=e000004040f47680 bsp=e000004040f410a8  
[<a000000100017370>] show_regs+0x890/0x8c0
                                sp=e000004040f47850 bsp=e000004040f41060  
[<a00000010003d7f0>] die+0x150/0x240
                                sp=e000004040f47870 bsp=e000004040f41020  
[<a00000010003d920>] die_if_kernel+0x40/0x60
                                sp=e000004040f47870 bsp=e000004040f40fe8  
[<a00000010003dac0>] ia64_bad_break+0x180/0x600
                                sp=e000004040f47870 bsp=e000004040f40fc0  
[<a00000010000f480>] ia64_leave_kernel+0x0/0x260
                                sp=e000004040f47940 bsp=e000004040f40fc0  
[<a0000002000b0a30>] journal_commit_transaction+0x110/0x3080 [jbd]
                                sp=e000004040f47b10 bsp=e000004040f40ea0  
[<a0000002000b9830>] kjournald+0x170/0x580 [jbd]
                                sp=e000004040f47d80 bsp=e000004040f40e38  
[<a000000100018930>] kernel_thread_helper+0x30/0x60
                                sp=e000004040f47e30 bsp=e000004040f40e10  
[<a000000100008c60>] start_kernel_thread+0x20/0x40
                                sp=e000004040f47e30 bsp=e000004040f40e10 
Kernel panic - not syncing: Fatal exception

Comment 11 Stephen Tweedie 2006-04-21 14:47:55 UTC

Jan Kara upstream has tracked down what we believe to be the cause of this
problem.  I have reviewed the diagnosis and fix, and Andrew Morton just
committed it to the -mm tree:

http://marc.theaimsgroup.com/?l=linux-mm-commits&m=114557249220945&w=2

for further testing.

Comment 13 Peter Staubach 2006-05-11 18:39:12 UTC

*** Bug 188620 has been marked as a duplicate of this bug. ***

Comment 15 Peter Staubach 2006-05-11 19:25:30 UTC

Created attachment 128909 [details]
Proposed patch

Comment 16 Peter Staubach 2006-05-18 19:45:35 UTC

I've done as much testing as I reasonably can, given my small systems.
I didn't see any failures, but I didn't really expect to see any failures
either.  Could someone who has been seeing the failure, try a kernel
which includes the patch and let me know how it goes?

Comment 17 John DeFranco 2006-05-18 20:27:04 UTC

(In reply to comment #16)
> I've done as much testing as I reasonably can, given my small systems.
> I didn't see any failures, but I didn't really expect to see any failures
> either.  Could someone who has been seeing the failure, try a kernel
> which includes the patch and let me know how it goes?

We will try that out and let you know how it goes.

Comment 20 John DeFranco 2006-05-22 22:55:52 UTC

(In reply to comment #17)
> (In reply to comment #16)
> > I've done as much testing as I reasonably can, given my small systems.
> > I didn't see any failures, but I didn't really expect to see any failures
> > either.  Could someone who has been seeing the failure, try a kernel
> > which includes the patch and let me know how it goes?
> 
> We will try that out and let you know how it goes.

Problem still occured for us. Here is the stack trace:

Red Hat Enterprise Linux AS release 4 (Nahant Update 3)
Kernel 2.6.9-bz158363.36.EL on an ia64

pltest32.cup.hp.com login: Assertion failure in
journal_write_metadata_buffer() at fs/jbd/journal.c:308:
"buffer_jbddirty(bh_in)"
kernel BUG at fs/jbd/journal.c:308!
kjournald[456]: bugcheck! 0 [1]
Modules linked in: md5 ipv6 parport_pc lp parport pidentd(U) autofs4
sunrpc ds yenta_socket pcmcia_core deadman(U) vfat fat dm_multipath
button ohci_hcd ehci_hcd tulip tg3 bonding(U) sg dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sym53c8xx
scsi_transport_spi sd_mod scsi_mod

Pid: 456, CPU 2, comm:            kjournald
psr : 0000101008126030 ifs : 8000000000000896 ip  : [<a0000002000c2b90>]
Not tainted
ip is at journal_write_metadata_buffer+0xb0/0x900 [jbd]
unat: 0000000000000000 pfs : 0000000000000896 rsc : 0000000000000003
rnat: 0000000000000534 bsps: a000000100669f58 pr  : 0000000000009941
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000002000c2b90 b6  : a00000010034b240 b7  : a00000010025e840
f6  : 0fffbccccccccc8c00000 f7  : 0ffdaf300000000000000
f8  : 10000c000000000000000 f9  : 10002a000000000000000
f10 : 0fffd9999999996900000 f11 : 1003e0000000000000000
r1  : a0000001009bb6e0 r2  : 0000000000015358 r3  : a0000001007bbd40
r8  : 0000000000000024 r9  : a00000010074fbc0 r10 : a00000010074fbb8
r11 : a0000001007bb930 r12 : e00000407f6f7b10 r13 : e00000407f6f0000
r14 : 0000000000004000 r15 : a00000010076b8d0 r16 : 0000000000000001
r17 : 0000000000000534 r18 : a000000100669f58 r19 : a00000010025e840
r20 : c000000084053000 r21 : 0000000000000005 r22 : a0000001007d24c0
r23 : a0000001007d23d0 r24 : a0000001007d23d0 r25 : a000000100a5e2c8
r26 : 00001f9d598b87eb r27 : 0000001008126030 r28 : 0000000000000000
r29 : 00000000110000c0 r30 : 0000000000000000 r31 : a0000001007cea20

Call Trace:
 [<a000000100016d20>] show_stack+0x80/0xa0
                                sp=e00000407f6f7680 bsp=e00000407f6f1130

 [<a000000100017630>] show_regs+0x890/0x8c0
                                sp=e00000407f6f7850 bsp=e00000407f6f10e8
 [<a00000010003e870>] die+0x150/0x240
                                sp=e00000407f6f7870 bsp=e00000407f6f10a8
 [<a00000010003e9a0>] die_if_kernel+0x40/0x60
                                sp=e00000407f6f7870 bsp=e00000407f6f1078
 [<a00000010003eb40>] ia64_bad_break+0x180/0x600
                                sp=e00000407f6f7870 bsp=e00000407f6f1050
 [<a00000010000f580>] ia64_leave_kernel+0x0/0x260
                                sp=e00000407f6f7940 bsp=e00000407f6f1050
 [<a0000002000c2b90>] journal_write_metadata_buffer+0xb0/0x900 [jbd]
                                sp=e00000407f6f7b10 bsp=e00000407f6f0f98
 [<a0000002000b5ad0>] journal_commit_transaction+0x1110/0x30c0 [jbd]
                                sp=e00000407f6f7b10 bsp=e00000407f6f0ea0
 [<a0000002000bd910>] kjournald+0x170/0x580 [jbd]
                                sp=e00000407f6f7d80 bsp=e00000407f6f0e38
 [<a000000100018bf0>] kernel_thread_helper+0x30/0x60
                                sp=e00000407f6f7e30 bsp=e00000407f6f0e10
 [<a000000100008c60>] start_kernel_thread+0x20/0x40
                                sp=e00000407f6f7e30 bsp=e00000407f6f0e10
Kernel panic - not syncing: Fatal exception

Comment 23 Nick Dokos 2006-05-24 15:22:00 UTC

[Not sure what happened - I got email asking for a comment from
the original reporter - that's me - but I don't see the note in the
bugzilla any longer. In any case, ....]

We have seen this problem exactly once, on a pre-U1 kernel.
We have run all our usual benchmarks on all the different configurations
we usually use and on many kernels up to and including the final U3:
we have never seen the problem again. But from what John DeFranco reports,
it's still a problem for them.

Comment 26 Peter Staubach 2006-06-02 21:00:41 UTC

Created attachment 130424 [details]
Proposed patch

Comment 33 RHEL Program Management 2006-08-16 21:04:48 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this enhancement by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This enhancement is not yet committed for inclusion in an Update
release.

Comment 36 Bjorn Helgaas 2006-08-30 16:43:28 UTC

Andrew Patterson saw a similar crash in fs/jbd/transaction.c on RHEL4U3
while running hazard.  This is the same crash reported in comment #33 of
bug 199667.  I'm copying it here because it looks more like this one than
199667.  We've been unable to reproduce this assertion failure, so we've
only seen it once.

In comment #34 of bug 199667, Eric mentions a patch that might address
this assertion failure crash.  Most of the comments on this bug (158363)
are hidden from me (I can't see anything between 23 and 33), but the patch
Eric mentioned does not seem to be in U4.

Can you confirm that?  Is it in a post-U4 hotfix kernel?

From the comments that I can read, this looks like a fairly reproducible
problem seen by several people, but I can't see whether it ever got
resolved.

ssertion failure in do_get_write_access() at fs/jbd/transaction.c:608: 
"jh->b_next_transaction == ((void *)0)"
kernel BUG at fs/jbd/transaction.c:608!
diskfs[21164]: bugcheck! 0 [1]
Assertion failure in do_get_write_access() at fs/jbd/transaction.c:608: 
"jh->b_next_transaction == ((void *)0)"
kernel BUG at fs/jbd/transaction.c:608!
Modules linked in: md5 ipv6 parport_pc lp parport dev_acpi(U) autofs4 
sunrpc ds yenta_socket pcmcia_core scsi_dump diskdump zlib_deflate mptctl(U) 
lpfcdfc(U) vfat fat dm_mod button ohci_hcd ehci_hcd shpchp tg3 e1000 
s2io sg sr_mod ext3 jbd qla2400(U) qla2300(U) qla2xxx(U) qla2xxx_conf(U) 
lpfc(U) scsi_transport_fc cciss mptspi(U) mptscsih(U) mptbase(U) sd_mod 
scsi_mod

Pid: 21164, CPU 10, comm:               diskfs
psr : 0000101008126010 ifs : 8000000000000794 ip  : [<a0000002000f0b90>]    
Not tainted
ip is at do_get_write_access+0xbb0/0x11c0 [jbd]
unat: 0000000000000000 pfs : 0000000000000794 rsc : 0000000000000003
rnat: 0000000000000158 bsps: 0000073b2cea0f24 pr  : 0000001805659959
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000002000f0b90 b6  : a00000010006ff00 b7  : a000000100259f40
f6  : 1003e0000000000001200 f7  : 1003e8080808080808081
f8  : 1003e00000000000023dc f9  : 1003e000000000e58b20e
f10 : 1003e000000003571d99d f11 : 1003e44b831eee7285baf
r1  : a0000001009add90 r2  : 0000000000000001 r3  : 0000000000100000
r8  : 0000000000000028 r9  : 0000000000000001 r10 : e00007000002510c
r11 : 0000000000000003 r12 : e000070fe4537c50 r13 : e000070fe4530000
r14 : 0000000000004000 r15 : a000000100742ac0 r16 : a000000100742ac8
r17 : e000070ff405fde8 r18 : e000070ff405802c r19 : e000070000025100
r20 : e0000700000247c0 r21 : 0000000000000002 r22 : 0000000000000001
r23 : e000070ff4058040 r24 : e000070000025860 r25 : e000070000025858
r26 : e000070000025838 r27 : 0000000000000074 r28 : 0000000000000074
r29 : 00000000ffffffff r30 : e000070ff4058080 r31 : 0000000000000000

Call Trace:
 [<a000000100016b20>] show_stack+0x80/0xa0
                                sp=e000070fe45377c0 bsp=e000070fe4531628
 [<a000000100017430>] show_regs+0x890/0x8c0
                                sp=e000070fe4537990 bsp=e000070fe45315d8
 [<a00000010003dbf0>] die+0x150/0x240
                                sp=e000070fe45379b0 bsp=e000070fe4531598
 [<a00000010003dd20>] die_if_kernel+0x40/0x60
                                sp=e000070fe45379b0 bsp=e000070fe4531568
 [<a00000010003dec0>] ia64_bad_break+0x180/0x600
                                sp=e000070fe45379b0 bsp=e000070fe4531540
 [<a00000010000f540>] ia64_leave_kernel+0x0/0x260
                                sp=e000070fe4537a80 bsp=e000070fe4531540
 [<a0000002000f0b90>] do_get_write_access+0xbb0/0x11c0 [jbd]
                                sp=e000070fe4537c50 bsp=e000070fe45314a0
 [<a0000002000f1660>] journal_get_write_access+0x60/0xa0 [jbd]
                                sp=e000070fe4537cb0 bsp=e000070fe4531460
 [<a00000020018ca70>] add_dirent_to_buf+0x4f0/0x7c0 [ext3]
                                sp=e000070fe4537cb0 bsp=e000070fe45313d8
 [<a00000020018cee0>] ext3_add_entry+0x1a0/0x1240 [ext3]
                                sp=e000070fe4537cc0 bsp=e000070fe45312c0
 [<a00000020018e2b0>] ext3_add_nondir+0x30/0x100 [ext3]
                                sp=e000070fe4537d90 bsp=e000070fe4531288
 [<a00000020018e5b0>] ext3_create+0x230/0x240 [ext3]
                                sp=e000070fe4537d90 bsp=e000070fe4531238
 [<a000000100146000>] vfs_create+0x260/0x380
                                sp=e000070fe4537da0 bsp=e000070fe45311d8
 [<a0000001001475a0>] open_namei+0xe20/0xf00
                                sp=e000070fe4537da0 bsp=e000070fe4531150
 [<a00000010011e380>] filp_open+0x80/0x140
                                sp=e000070fe4537db0 bsp=e000070fe4531110
 [<a00000010011e9d0>] sys_open+0xd0/0x1a0
                                sp=e000070fe4537e30 bsp=e000070fe4531090
 [<a00000010000f3e0>] ia64_ret_from_syscall+0x0/0x20
                                sp=e000070fe4537e30 bsp=e000070fe4531090
 [<a000000000010640>] 0xa000000000010640
                                sp=e000070fe4538000 bsp=e000070fe4531090

Comment 37 Eric Sandeen 2006-08-30 21:30:18 UTC

The proposed patch (not sure why it wasn't visible, making it so now) is
currently planned for inclusion in RHEL4U5.  I'll see if I can build a U4 kernel
with this patch included, for testing.
-Eric

Comment 38 John DeFranco 2006-08-30 22:57:52 UTC

I would be happy to test it out again if you build the kernel.

Comment 40 Nick Dokos 2006-08-31 18:06:56 UTC

Seems to me that there are different failures that have been conflated
into this bugzilla. So here is an attempt to clarify the situation,
reading between the lines in some cases - so approach with caution.
There may be more failures in bugzilla #188620 which comment #13
marked as a duplicate but I am not allowed to see that one - why not?

o Assert in journal_commit_transaction()
  at commit.c:793 "jh->b_next_transaction == 0"

  seen by (opening comment)
       Nick Dokos on RHEL4 U1 beta (2.6.9-6.37.EL)
            running AIM7 fserver on 12-way IA64 (Madison) box,
            48Gb, 144 disks, with each 72Gb FC disk holding one
            ext3 filesystem.

            Repeatable: no

  seen by (comments #2 and #3) 
       Rick Stern on RHEL4 U1 beta (2.6.9-6.46.ELsmp)
            running unspecified stress test on x86_64 -
            no other details on hardware configuration.

            Repeatable: yes

  seen by (comment #8)
       Debby Fu on RHEL4U2 (2.6.9-22.EL)
             running unspecified stress test on IA64 (Madison?) -
             no other details on hardware configuration.

             Repeatable: yes


o Null pointer at journal_commit_transaction+0x442

  seen by (comment #4)
       John Van Ostrand on Fedora Core 2 (2.6.9-1.724FC3smp
                                      and two more kernels)
            running ? on ? (perhaps x86_64)

            Repeatable: Yes(?)

  seen by (comment #5)
       Rich Williamson on RHEL ES4 (kernel-smp-2.6.9-5.EL)
            running "busy mail server" on HP DL380 G4,
            dual Xeon 3.2HT with 6i controller, 3 RAID-0 drives.

            Repeatable: Yes(?)

o Assert failure in journal_write_metadata_buffer()
  at journal.c:308 "buffer_jbddirty(bh_in)"

  seen by (comment #20)
       John DeFranco on RHEL4 U3 (2.6.9-bz158363.36.EL custom kernel -
                                  presumably incorporating the proposed
                                  fix from comment #15)
             running ? on IA64 (Madison?) - no other hardware details.

             Repeatable: ?

o Assert failure in do_get_write_access()
  at transaction.c:608 "jh->b_next_transaction == 0"

  seen by (comment #6)
       Andrew Patterson/Bjorn Helgaas on RHEL4U3 (2.6.9-?)
              running hazard on IA64 (Montecito) -
              no other hardware details.

              Repeatable: ?


From these four failures, I believe that the NULL pointer failures are
misfiled here - they are probably manifestations of bugzilla #146037.

The other three failures may or may not be related. Assuming that they
are and that the patch from comment #15 resolves the first and the
last (nb: these are big and unproven assumptions), that still leaves
the failure in comment #20 (the third in the list above). But there
might be an explanation for this one: there is a patch that is *not*
in U3 but is in U4 that addressed a number of different possible races,
and the failure happened on U3+patch. I will create an attachment for this
patch (it's called "Patch1049: linux-2.6.9-ext3-jdb-race.patch" in the
RHEL4 U4 kernel source RPM).

Could it be that if John DeFranco applies the patch from comment #15
to RHEL4 U4, his failure goes away? Same question applies to the first set of
failures and the last one as well.

Also, if John Van Ostrand and Rich Williamson are running more recent
versions of RHEL4 (U4 would be ideal because it includes the above
mentioned patch for a race that could lead to null pointers),
could they report back?

It's possible of course that that there are other races that nobody
knows about yet. But this should go a long way towards clarifying
what the current situation is.

Comment 41 Nick Dokos 2006-08-31 18:11:08 UTC

Created attachment 135314 [details]
Patch 1049: linux-2.6.9-ext3-jbd-race.patch from RHEL4U4

See comment#40.

Comment 42 Eric Sandeen 2006-08-31 19:32:22 UTC

Nick, thanks for the recap.  I agree that this bugzilla entry has gotten a bit
confused with reports of various different problems.

Just a note, the patch in Comment #15 is obsolete, replaced by the patch in
Comment #26 (Jan Kara, the upstream patch author, revised his original patch,
hence the obsoleting of the patch in Comment #15.)

Comment 44 Jason Baron 2006-09-01 12:58:21 UTC

committed in stream U5 build 42.5. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 45 Eric Sandeen 2006-09-01 13:50:10 UTC

Also, RHEL4U4 ia64 RPMs with only the patch in Comment #26 added are available
at http://people.redhat.com/esandeen/hp/bz158363/

Comment 47 Bjorn Helgaas 2006-09-12 15:56:21 UTC

Rick, John, or Debby, have you had a chance to test the patch?  You
seem to be the only ones with the recipe to reproduce the assertion
failures.

Comment 48 John DeFranco 2006-09-12 21:59:31 UTC

Hi, I just downloaded the kernels and will try a test as soon as possible (in
the next day or two).

Comment 50 Andrew Patterson 2006-10-02 20:35:40 UTC

We have only seen the do_get_write_access() failure once.  This was in RHEL4 U3.
 Since then we have mostly gone to RHEL4 U4 testing, so it will be
difficult/impossible to see if this patch fixes this particular problem.

Comment 51 John DeFranco 2006-10-05 23:21:21 UTC

I have run a test with the ipf kernel provided. The test usually encounters the
crash within 3-6 hours. In this case it ran to completion (48hours). Fix looks ok.

Comment 52 Linda Wang 2006-10-20 09:53:47 UTC

*** Bug 201234 has been marked as a duplicate of this bug. ***

Comment 54 Eric Sandeen 2006-11-06 18:45:03 UTC

*** Bug 202197 has been marked as a duplicate of this bug. ***

Comment 59 Mike Gahagan 2007-03-19 19:47:05 UTC

Patch is in -50 and has been verified by two partners.

Comment 61 Nick Dokos 2007-04-24 20:26:00 UTC

I have verified the presence of the patch in RHEL4.5snap4 (kernel version
2.6.9-51) and RHEL4.5rc (kernel version 2.6.9-54).

Comment 65 Red Hat Bugzilla 2007-05-01 22:54:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html

Comment 67 Eric Sandeen 2007-05-31 18:14:38 UTC

*** Bug 237815 has been marked as a duplicate of this bug. ***