Bug 161101

Summary: Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:790: "jh->b_next_transaction == ((void *)0)"
Product: Red Hat Enterprise Linux 4 Reporter: Jeff Burke <jburke>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: anderson, davej, jbacik, jbaron
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0132 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-07 19:09:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 164439, 168429    

Description Jeff Burke 2005-06-20 15:48:51 UTC
Description of problem:
On a Dell PE2850 Dual 3.6GHz HT, With 4Gig of RAM. RHEL4 Pre U2 kernel.
While running LTP,I have seen an intermittent problem.

A test scenario called ltpstress.sh was specially designed to exercise a wide
range of kernel components in parallel with networking and memory management and
to create a high-stress workload on the testing system. ltpstress.sh is also
part of the LTP test suite. The script runs similar test cases in parallel and
different test cases in sequence in order to avoid intermittent failures caused
by running into the same resources or interfering with one another. By default,
this script executes:

   * NFS stress tests
   * Memory management stress tests
   * Filesystem stress tests
   * Math (floating point) tests
   * pthread stress tests
   * Disk I/O tests
   * IPC (pipeio, semaphore) tests
   * System call functional verification tests
   * Networking stress tests


Version-Release number of selected component (if applicable):
First see on 2.6.9-11.6, reproduced on a test kernel from Dave Anderson
2.6.9-11.10.GFP_NOFS.2.EL.rootsmp

How reproducible:
intermittent

Steps to Reproduce:
1. On a EM64T running the RHEL4 x86_64 kernel install LTP
2. Run the ltpstress.sh script
  
Actual results:
Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:790:
"jh->b_next_transaction == ((void *)0)"
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at commit:790
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: nfs nfsd exportfs lockd parport_pc lp parport netconsole
netdump autofs4 i2c_dev i2c_core sunrpc ds \yenta_socket pcmcia_core dm_mod
button battery ac joydev md5 ipv6 uhci_hcd ehci_hcd hw_random e1000 floppy sg
ext3 jbd \raid1 megaraid_mbox megaraid_mm aic7xxx sd_mod scsi_mod
Pid: 275, comm: kjournald Not tainted 2.6.9-11.10.GFP_NOFS.2.EL.rootsmp
RIP: 0010:[<ffffffffa006a652>]
<ffffffffa006a652>{:jbd:journal_commit_transaction+4006}
RSP: 0000:0000010037e89bb8  EFLAGS: 00010212
RAX: 0000000000000075 RBX: 0000000000000000 RCX: 0000000100000000
RDX: ffffffff803cb088 RSI: 0000000000000246 RDI: ffffffff803cb080
RBP: 0000010118332698 R08: ffffffff803cb088 R09: 0000000000000000
R10: 00000000000000a9 R11: 00000000000000a9 R12: 00000100745e6b68
R13: 000001013f4b14b0 R14: 00000100bff874f0 R15: 000001007c42fbd8
FS:  0000000000000000(0000) GS:ffffffff804c8000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002ab7263000 CR3: 00000000bffa2000 CR4: 00000000000006e0
Process kjournald (pid: 275, threadinfo 0000010037e88000, task 000001013f554030)
Stack: 0000000a894e1393 00000f5c00000000 0000010090d0c0a4 0000000000000001
       000001000d76ab68 0000000000001f92 0000000000000000 000001013f554030
       ffffffff8013393e 0000010037e89c30
Call Trace:<ffffffff8013393e>{autoremove_wake_function+0}
<ffffffff80130b49>{try_to_wake_up+734}
       <ffffffff8013393e>{autoremove_wake_function+0}
<ffffffff80130e85>{finish_task_switch+55}
       <ffffffff802f9c54>{thread_return+42} <ffffffff8013dc8f>{del_timer+107}
       <ffffffffa006c8a4>{:jbd:kjournald+250}
<ffffffff8013393e>{autoremove_wake_function+0}
       <ffffffff8013393e>{autoremove_wake_function+0}
<ffffffffa006c7a4>{:jbd:commit_timeout+0}
       <ffffffff80110c8f>{child_rip+8} <ffffffffa006c7aa>{:jbd:kjournald+0}
       <ffffffff80110c87>{child_rip+0}
                                                                               
                                        
Code: 0f 0b 10 f8 06 a0 ff ff ff ff 16 03 4c 89 e7 e8 a4 d4 ff ff
RIP <ffffffffa006a652>{:jbd:journal_commit_transaction+4006} RSP <0000010037e89bb8>
                                                                               
                                        
Modules linked in: nfs nfsd exportfs lockd parport_pc lp parport netconsole
netdump autofs4 i2c_dev i2c_core sunrpc ds \yenta_socket pcmcia_core dm_mod
button battery ac joydev md5 ipv6 uhci_hcd ehci_hcd hw_random e1000 floppy sg
ext3 jbd \raid1 megaraid_mbox megaraid_mm aic7xxx sd_mod scsi_mod
Pid: 275, comm: kjournald Not tainted 2.6.9-11.10.GFP_NOFS.2.EL.rootsmp
RIP: 0010:[<ffffffffa006a652>]
<ffffffffa006a652>{:jbd:journal_commit_transaction+4006}
RSP: 0000:0000010037e89bb8  EFLAGS: 00010212
RAX: 0000000000000075 RBX: 0000000000000000 RCX: 0000000100000000
RDX: ffffffff803cb088 RSI: 0000000000000246 RDI: ffffffff803cb080
RBP: 0000010118332698 R08: ffffffff803cb088 R09: 0000000000000000
R10: 00000000000000a9 R11: 00000000000000a9 R12: 00000100745e6b68
R13: 000001013f4b14b0 R14: 00000100bff874f0 R15: 000001007c42fbd8
FS:  0000000000000000(0000) GS:ffffffff804c8000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002ab7263000 CR3: 00000000bffa2000 CR4: 00000000000006e0

Call Trace:<ffffffffa006a652>{:jbd:journal_commit_transaction+4006}
       <ffffffff8013393e>{autoremove_wake_function+0}
<ffffffff80130b49>{try_to_wake_up+734}
       <ffffffff8013393e>{autoremove_wake_function+0}
<ffffffff80130e85>{finish_task_switch+55}
       <ffffffff802f9c54>{thread_return+42} <ffffffff8013dc8f>{del_timer+107}
       <ffffffffa006c8a4>{:jbd:kjournald+250}
<ffffffff8013393e>{autoremove_wake_function+0}
       <ffffffff8013393e>{autoremove_wake_function+0}
<ffffffffa006c7a4>{:jbd:commit_timeout+0}
       <ffffffff80110c8f>{child_rip+8} <ffffffffa006c7aa>{:jbd:kjournald+0}
       <ffffffff80110c87>{child_rip+0}


Expected results:
24 hours stress would have passed.

Additional info: The full /var/log/messages file and the vmcore 
can be found at
ndnc-1.lab.boston.redhat.com:/var/crash/192.168.79.127-2005-06-19-09:30

Comment 4 Stephen Tweedie 2005-08-18 12:38:27 UTC
I'm looking into this a bit deeper now.  Is this still reproducible on current
kernels?  Have you seen any further oopses like it?  When you say
"intermittent", just how hard is it to reproduce?  How long does it take, what
sort of system are you reproducing it on, and do any systems seem not to show
the problem?


Comment 9 Stephen Tweedie 2005-08-18 14:27:25 UTC
Some additional questions:

Have previous kernels been subjected to the same level of testing as recent
ones, ie. are we _sure_ this started appearing recently?  (I know for a fact
that earlier pre-U2-beta kernels had a bug which would show up in just the same
way, so I would be really surprised if they passed similar loads reliably.)

How repeatable is it?  You mention at least 2 kernels showing it, but there's
only one oops here --- do you have a vmcore from a mainline build?



Comment 26 Red Hat Bugzilla 2006-03-07 19:09:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html