Bug 200434

Summary: Kernel panic on ext3 fs jdb checkpoint 361
Product: Red Hat Enterprise Linux 4 Reporter: CI <redhat.com>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-27 19:37:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description CI 2006-07-27 17:19:14 UTC
Description of problem:

Using iSCSI to mount LeftHand Networks SAN unit.
Running NFS to share mounted SAN volume to other machines.
Using ext3 file system format.
The system kernel panics infrequently.

Jul 12 04:20:21 fs1 kernel: Assertion failure in log_do_checkpoint() at
fs/jbd/checkpoint.c:361: "drop_count != 0 || cleanup_ret != 0"
Jul 12 04:20:21 fs1 kernel: ------------[ cut here ]------------
Jul 12 04:20:21 fs1 kernel: kernel BUG at fs/jbd/checkpoint.c:361!
Jul 12 04:20:21 fs1 kernel: invalid operand: 0000 [#1]
Jul 12 04:20:21 fs1 kernel: SMP
Jul 12 04:20:21 fs1 kernel: Modules linked in: nfs nfsd exportfs lockd md5 ipv6
sunrpc crc32c libcrc32c iscsi_sfnet scsi_transport_iscsi microcode dm_mirror
dm_mod button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U)
floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
Jul 12 04:20:21 fs1 kernel: CPU:    2
Jul 12 04:20:21 fs1 kernel: EIP:    0060:[<f8832fa7>]    Not tainted VLI
Jul 12 04:20:21 fs1 kernel: EFLAGS: 00010216   (2.6.9-22.ELsmp)
Jul 12 04:20:21 fs1 kernel: EIP is at log_do_checkpoint+0x111/0x14b [jbd]
Jul 12 04:20:21 fs1 kernel: eax: 0000006e   ebx: f6dd7c8c   ecx: f7764ac4   edx:
f8836d4c
Jul 12 04:20:21 fs1 kernel: esi: f7e20600   edi: f64eab00   ebp: 00000000   esp:
f7764ac0
Jul 12 04:20:21 fs1 kernel: ds: 007b   es: 007b   ss: 0068
Jul 12 04:20:21 fs1 kernel: Process nfsd (pid: 2600, threadinfo=f7764000
task=f701a7b0)
Jul 12 04:20:21 fs1 kernel: Stack: f8836d4c f8835efd f8836d38 00000169 f8836e07
13d7fc8d f6dd7c8c f6dd7c8c
Jul 12 04:20:21 fs1 kernel:        00000000 00000000 f4923c10 c015bff2 00001000
f888f157 00001000 00000246
Jul 12 04:20:21 fs1 kernel:        00000012 f88fde1a 00000000 ef0ca250 02000000
f7716344 00000008 00000000
Jul 12 04:20:21 fs1 kernel: Call Trace:
Jul 12 04:20:21 fs1 kernel:  [<c015bff2>] __bread+0x9/0x1e
Jul 12 04:20:21 fs1 kernel:  [<f888f157>] ext3_get_branch+0x63/0xc6 [ext3]
Jul 12 04:20:21 fs1 kernel:  [<f88fde1a>] e1000_xmit_frame+0x947/0x951 [e1000]
Jul 12 04:20:21 fs1 kernel:  [<c028b041>] qdisc_restart+0x12/0x1ad
Jul 12 04:20:21 fs1 kernel:  [<c027dccb>] dev_queue_xmit+0x1ff/0x207
Jul 12 04:20:21 fs1 kernel:  [<f88ce357>] bond_dev_queue_xmit+0x1d4/0x1db [bonding]
Jul 12 04:20:21 fs1 kernel:  [<f88d12e5>] bond_xmit_roundrobin+0xd3/0xdb [bonding]
Jul 12 04:20:21 fs1 kernel:  [<f8832b60>] __log_wait_for_space+0xbb/0xe5 [jbd]
Jul 12 04:20:21 fs1 kernel:  [<f882f37e>] start_this_handle+0x2e4/0x32a [jbd]
Jul 12 04:20:21 fs1 kernel:  [<c012f294>] in_group_p+0x31/0x58
Jul 12 04:20:21 fs1 kernel:  [<f889bb7c>] ext3_permission+0x0/0x163 [ext3]
Jul 12 04:20:21 fs1 kernel:  [<f889bc66>] ext3_permission+0xea/0x163 [ext3]
Jul 12 04:20:21 fs1 kernel:  [<c02cf6e3>] __cond_resched+0x14/0x39
Jul 12 04:20:21 fs1 kernel:  [<f882f47c>] journal_start+0x78/0x9e [jbd]
Jul 12 04:20:21 fs1 kernel:  [<f888fd67>] ext3_prepare_write+0x32/0xf5 [ext3]
Jul 12 04:20:21 fs1 kernel:  [<c0141205>] generic_file_buffered_write+0x186/0x47c
Jul 12 04:20:21 fs1 kernel:  [<c0126344>] current_fs_time+0x44/0x4c
Jul 12 04:20:21 fs1 kernel:  [<c0141884>]
__generic_file_aio_write_nolock+0x389/0x3b7
Jul 12 04:20:21 fs1 kernel:  [<c01419b5>] __generic_file_write_nolock+0x84/0x99
Jul 12 04:20:21 fs1 kernel:  [<c011ffb1>] autoremove_wake_function+0x0/0x2d
Jul 12 04:20:21 fs1 kernel:  [<c0141cc5>] generic_file_writev+0x4f/0xac
Jul 12 04:20:21 fs1 kernel:  [<c0141c76>] generic_file_writev+0x0/0xac
Jul 12 04:20:21 fs1 kernel:  [<c0159c8d>] do_sync_write+0x0/0xc9
Jul 12 04:20:21 fs1 kernel:  [<c015a177>] do_readv_writev+0x19c/0x21d
Jul 12 04:20:21 fs1 kernel:  [<c0159281>] dentry_open+0xf0/0x1a5
Jul 12 04:20:21 fs1 kernel:  [<c015a276>] vfs_writev+0x3e/0x43
Jul 12 04:20:21 fs1 kernel:  [<f8c87e99>] nfsd_write+0xeb/0x289 [nfsd]
Jul 12 04:20:21 fs1 kernel:  [<f8c8e682>] nfsd3_proc_write+0xbf/0xd5 [nfsd]
Jul 12 04:20:21 fs1 kernel:  [<f8c906ab>] nfs3svc_decode_writeargs+0x0/0x243 [nfsd]
Jul 12 04:20:21 fs1 kernel:  [<f8c8467a>] nfsd_dispatch+0xba/0x170 [nfsd]
Jul 12 04:20:21 fs1 kernel:  [<f8be4429>] svc_process+0x41b/0x6ce [sunrpc]
Jul 12 04:20:21 fs1 kernel:  [<f8c8445a>] nfsd+0x1cc/0x332 [nfsd]
Jul 12 04:20:21 fs1 kernel:  [<f8c8428e>] nfsd+0x0/0x332 [nfsd]
Jul 12 04:20:21 fs1 kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Jul 12 04:20:21 fs1 kernel: Code: 89 f0 e8 4c fc ff ff 0b 44 24 10 75 29 68 07
6e 83 f8 68 69 01 00 00 68 38 6d 83 f8 68 fd 5e 83 f8 68 4c 6d 83 f8 e8 6b f3 8e
c7 <0f> 0b 69 01 38 6d 83 f8 83 c4 14 39 7e 40 0f 84 09 ff ff ff 8d
Jul 12 04:20:21 fs1 kernel:  <0>Fatal exception: panic in 5 seconds
[root@fs1 log]#

According to Stephen Tweedie, this should be fixed in this patch which I traced
back to Linux Kernel 2.6.12.

---------------
commit 00ea81459c279f14a7b344320a71c94f60f88929
Author: Jan Kara <jack>
Date:   Thu Jun 2 14:02:00 2005 -0700

    [PATCH] ext3: fix log_do_checkpoint() assertion failure

    Fix possible false assertion failure in log_do_checkpoint().  We might fail
    to detect that we actually made a progress when cleaning up the checkpoint
    lists if we don't retry after writing something to disk.  The patch was
    confirmed to fix observed assertion failures for several users.

    When we flushed some buffers we need to retry scanning the list.
    Otherwise we can fail to detect our progress.

    Signed-off-by: Jan Kara <jack>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>
------------------

Issue is that the current RedHat EL kernel rev is only at 2.6.9-34.EL!
Has this patch been back-ported to this kernel?


Version-Release number of selected component (if applicable):

Dell Power Edge 2850
Perc RAID controller
iscsi-initiator-utils-4.0.3.0-2
Linux fs1 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005 i686 i686 i386
GNU/Linux

Difficult to repoduce!  System kernel panics once every 2 weeks.

Comment 1 Jason Baron 2006-07-27 19:37:18 UTC

*** This bug has been marked as a duplicate of 162814 ***