Bug 381221

Summary: Assertion failure in journal_start() at fs/jbd/transaction.c:274: 'handle->h_transaction->t_journal == journal'
Product: Red Hat Enterprise Linux 4 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Josef Bacik <jbacik>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: esandeen, jbaron, tao, wmealing
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2008-0665 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-24 19:20:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 439194    
Attachments:
Description Flags
patch to fix the problem. none

Description Issue Tracker 2007-11-13 22:16:20 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2007-11-13 22:16:21 UTC
We have seen a number of nodes (8) crash on the above assertion failure, over two days. These are all relatively recent reinstalls with SLC4 and only 1GB of memory (so assume tight memory pressure), ext3, with quotas on 2 file systems.

While we on our side try to find out more on what actually happened and eventually reproduce this on "vanilla" RHEL4, could you perhaps already check whether you have seen something similar (and even better, whether you have a fix)?

According to Google, both Centos4 and various other upstream kernels have been hit by this in the past, with at least one message indicating that Ext3 could be doing something wrong when under memory pressure, this was with 2.6.15.1:
http://lkml.org/lkml/2006/2/1/328

Our tracebacks all look like

Nov  5 09:50:10 Assertion failure in journal_start() at fs/jbd/transaction.c:274: "handle->h_transaction->t_journal == journal"
Nov  5 09:50:10 ------------[ cut here ]------------
Nov  5 09:50:10 kernel BUG at fs/jbd/transaction.c:274!
Nov  5 09:50:10 invalid operand: 0000 [#1]
Nov  5 09:50:10 SMP
Nov  5 09:50:10 Modules linked in: e7xxx_edac edac_mc libafs(U) autofs4 i2c_dev i2c_core sunrpc md5 ipv6 dm_mirror dm_mod button battery ac uhci_hcd hw_random e100 mii ext3 jbd ata_piix libata sd_mod scsi_mod
Nov  5 09:50:10 CPU:    1
Nov  5 09:50:10 EIP:    0060:[<f882f449>]    Tainted: PF     VLI
Nov  5 09:50:10 EFLAGS: 00010216   (2.6.9-55.0.6.EL.cernsmp)
Nov  5 09:50:10 EIP is at journal_start+0x45/0x9e [jbd]
Nov  5 09:50:10 eax: 00000073   ebx: cf532294   ecx: f1bcbacc   edx: f88361ca
Nov  5 09:50:10 esi: f7f55c00   edi: f1bcb000   ebp: c0330f18   esp: f1bcbac8
Nov  5 09:50:10 ds: 007b   es: 007b   ss: 0068
Nov  5 09:50:10 Process fsprobe (pid: 8989, threadinfo=f1bcb000 task=f70103b0)
Nov  5 09:50:10 Stack: f88361ca f8835d62 f88361b5 00000112 f8836224 f6bcbbc0 f1bcbb1c 0000006d
Nov  5 09:50:10        f886bd89 f6bcbbc0 f1bcbb1c c0171bf0 f6bcbbc0 c0171c85 f3760de8 f3760df0
Nov  5 09:50:10        00000000 c017200a 00000080 00000080 00000080 c1b4fdf0 d1afb9a0 00000000
Nov  5 09:50:10 Call Trace:
Nov  5 09:50:10  [<f886bd89>] ext3_dquot_drop+0x14/0x3b [ext3]
Nov  5 09:50:10  [<c0171bf0>] clear_inode+0xb4/0x102
Nov  5 09:50:10  [<c0171c85>] dispose_list+0x47/0x6d
Nov  5 09:50:10  [<c017200a>] prune_icache+0x193/0x1ec
Nov  5 09:50:10  [<c0172077>] shrink_icache_memory+0x14/0x2b
Nov  5 09:50:10  [<c0149dac>] shrink_slab+0xf8/0x161
Nov  5 09:50:10  [<c014ae19>] try_to_free_pages+0xd5/0x1bb
Nov  5 09:50:10  [<c0144338>] __alloc_pages+0x1bc/0x2a6
Nov  5 09:50:10  [<c015491f>] read_swap_cache_async+0x56/0xa7
Nov  5 09:50:10  [<c014e3bf>] swapin_readahead+0x3b/0x57
Nov  5 09:50:10  [<c014e451>] do_swap_page+0x76/0x2ea
Nov  5 09:50:10  [<c014ed89>] handle_mm_fault+0x116/0x193
Nov  5 09:50:10  [<c014d7f0>] get_user_pages+0x235/0x368
Nov  5 09:50:10  [<c0179c11>] dio_refill_pages+0x7d/0x112
Nov  5 09:50:10  [<c0179cbe>] dio_get_page+0x18/0x4a
Nov  5 09:50:10  [<c017a608>] do_direct_IO+0x5b/0x306
Nov  5 09:50:10  [<c017ab12>] direct_io_worker+0x25f/0x4ee
Nov  5 09:50:10  [<c017b17a>] __blockdev_direct_IO+0x3d9/0x422
Nov  5 09:50:10  [<f8863970>] ext3_direct_io_get_blocks+0x0/0xaa [ext3]
Nov  5 09:50:10  [<f8864616>] ext3_direct_IO+0xef/0x1a5 [ext3]
Nov  5 09:50:10  [<f8863970>] ext3_direct_io_get_blocks+0x0/0xaa [ext3]
Nov  5 09:50:10  [<c0142e5c>] generic_file_direct_IO+0x3c/0x5c
Nov  5 09:50:10  [<c0142027>] generic_file_direct_write+0x51/0x122
Nov  5 09:50:10  [<c0126934>] current_fs_time+0x44/0x4c
Nov  5 09:50:10  [<c0142935>] __generic_file_aio_write_nolock+0x33c/0x3b7
Nov  5 09:50:10  [<c01429e9>] generic_file_aio_write_nolock+0x39/0x7f
Nov  5 09:50:10  [<c0142bd3>] generic_file_aio_write+0x72/0xc6
Nov  5 09:50:10  [<f8861d9e>] ext3_file_write+0x19/0x8b [ext3]
Nov  5 09:50:10  [<c015b95c>] do_sync_write+0x9e/0xcb
Nov  5 09:50:10  [<c02d4732>] schedule+0x84e/0x8ec
Nov  5 09:50:10  [<c01ae2f6>] selinux_file_permission+0x117/0x120
Nov  5 09:50:10  [<c012052d>] autoremove_wake_function+0x0/0x2d
Nov  5 09:50:10  [<c015ba3f>] vfs_write+0xb6/0xe2
Nov  5 09:50:11  [<c015bb09>] sys_write+0x3c/0x62
Nov  5 09:50:11  [<c02d68bf>] syscall_call+0x7/0xb
Nov  5 09:50:11  [<c02d007b>] unix_accept+0x5e/0xd6
Nov  5 09:50:11 Code: ff 74 7d 85 db 74 34 8b 03 39 30 74 29 68 24 62 83 f8 68 12 01 00 00 68 b5 61 83 f8 68 62 5d 83 f8 68 ca 61 83 f8 e8 a9 34 8f c7 <0f> 0b 12 01 b5 61 83 f8 83 c4 14 ff 43 08 eb 43 89 d0 e8 64 ff
Nov  5 09:50:11  <0>Fatal exception: panic in 5 seconds
Nov  5 09:50:16 Kernel panic - not syncing: Fatal exception


CentOS: 
http://bugs.centos.org/view.php?id=1167 (2.6.9-22.0.1.ELsmp)
http://bugs.centos.org/view.php?id=2077 (2.6.9-42.0.10.ELsmp)
http://www.centos.org/modules/newbb/viewtopic.php?viewmode=flat&topic_id=8779&forum=27 (2.6.9-55.ELsmp)

This event sent from IssueTracker by bbraswel  [Support Engineering Group]
 issue 137165

Comment 2 Josef Bacik 2007-11-15 21:18:13 UTC
Created attachment 260401 [details]
patch to fix the problem.

Please have the customer test this patch and verify it works for them.

Comment 12 RHEL Program Management 2008-03-26 18:39:10 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Vivek Goyal 2008-03-27 23:23:09 UTC
Committed in 68.27.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 16 errata-xmlrpc 2008-07-24 19:20:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Comment 18 Josef Bacik 2008-09-12 01:28:00 UTC
*** Bug 461871 has been marked as a duplicate of this bug. ***