Description of problem: On a Dell PE2850 Dual 3.6GHz HT, With 4Gig of RAM. RHEL4 Pre U2 kernel. While running LTP,I have seen an intermittent problem. A test scenario called ltpstress.sh was specially designed to exercise a wide range of kernel components in parallel with networking and memory management and to create a high-stress workload on the testing system. ltpstress.sh is also part of the LTP test suite. The script runs similar test cases in parallel and different test cases in sequence in order to avoid intermittent failures caused by running into the same resources or interfering with one another. By default, this script executes: * NFS stress tests * Memory management stress tests * Filesystem stress tests * Math (floating point) tests * pthread stress tests * Disk I/O tests * IPC (pipeio, semaphore) tests * System call functional verification tests * Networking stress tests Version-Release number of selected component (if applicable): First see on 2.6.9-11.6, reproduced on a test kernel from Dave Anderson 2.6.9-11.10.GFP_NOFS.2.EL.rootsmp How reproducible: intermittent Steps to Reproduce: 1. On a EM64T running the RHEL4 x86_64 kernel install LTP 2. Run the ltpstress.sh script Actual results: Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:790: "jh->b_next_transaction == ((void *)0)" ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at commit:790 invalid operand: 0000 [1] SMP CPU 1 Modules linked in: nfs nfsd exportfs lockd parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds \yenta_socket pcmcia_core dm_mod button battery ac joydev md5 ipv6 uhci_hcd ehci_hcd hw_random e1000 floppy sg ext3 jbd \raid1 megaraid_mbox megaraid_mm aic7xxx sd_mod scsi_mod Pid: 275, comm: kjournald Not tainted 2.6.9-11.10.GFP_NOFS.2.EL.rootsmp RIP: 0010:[<ffffffffa006a652>] <ffffffffa006a652>{:jbd:journal_commit_transaction+4006} RSP: 0000:0000010037e89bb8 EFLAGS: 00010212 RAX: 0000000000000075 RBX: 0000000000000000 RCX: 0000000100000000 RDX: ffffffff803cb088 RSI: 0000000000000246 RDI: ffffffff803cb080 RBP: 0000010118332698 R08: ffffffff803cb088 R09: 0000000000000000 R10: 00000000000000a9 R11: 00000000000000a9 R12: 00000100745e6b68 R13: 000001013f4b14b0 R14: 00000100bff874f0 R15: 000001007c42fbd8 FS: 0000000000000000(0000) GS:ffffffff804c8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002ab7263000 CR3: 00000000bffa2000 CR4: 00000000000006e0 Process kjournald (pid: 275, threadinfo 0000010037e88000, task 000001013f554030) Stack: 0000000a894e1393 00000f5c00000000 0000010090d0c0a4 0000000000000001 000001000d76ab68 0000000000001f92 0000000000000000 000001013f554030 ffffffff8013393e 0000010037e89c30 Call Trace:<ffffffff8013393e>{autoremove_wake_function+0} <ffffffff80130b49>{try_to_wake_up+734} <ffffffff8013393e>{autoremove_wake_function+0} <ffffffff80130e85>{finish_task_switch+55} <ffffffff802f9c54>{thread_return+42} <ffffffff8013dc8f>{del_timer+107} <ffffffffa006c8a4>{:jbd:kjournald+250} <ffffffff8013393e>{autoremove_wake_function+0} <ffffffff8013393e>{autoremove_wake_function+0} <ffffffffa006c7a4>{:jbd:commit_timeout+0} <ffffffff80110c8f>{child_rip+8} <ffffffffa006c7aa>{:jbd:kjournald+0} <ffffffff80110c87>{child_rip+0} Code: 0f 0b 10 f8 06 a0 ff ff ff ff 16 03 4c 89 e7 e8 a4 d4 ff ff RIP <ffffffffa006a652>{:jbd:journal_commit_transaction+4006} RSP <0000010037e89bb8> Modules linked in: nfs nfsd exportfs lockd parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds \yenta_socket pcmcia_core dm_mod button battery ac joydev md5 ipv6 uhci_hcd ehci_hcd hw_random e1000 floppy sg ext3 jbd \raid1 megaraid_mbox megaraid_mm aic7xxx sd_mod scsi_mod Pid: 275, comm: kjournald Not tainted 2.6.9-11.10.GFP_NOFS.2.EL.rootsmp RIP: 0010:[<ffffffffa006a652>] <ffffffffa006a652>{:jbd:journal_commit_transaction+4006} RSP: 0000:0000010037e89bb8 EFLAGS: 00010212 RAX: 0000000000000075 RBX: 0000000000000000 RCX: 0000000100000000 RDX: ffffffff803cb088 RSI: 0000000000000246 RDI: ffffffff803cb080 RBP: 0000010118332698 R08: ffffffff803cb088 R09: 0000000000000000 R10: 00000000000000a9 R11: 00000000000000a9 R12: 00000100745e6b68 R13: 000001013f4b14b0 R14: 00000100bff874f0 R15: 000001007c42fbd8 FS: 0000000000000000(0000) GS:ffffffff804c8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002ab7263000 CR3: 00000000bffa2000 CR4: 00000000000006e0 Call Trace:<ffffffffa006a652>{:jbd:journal_commit_transaction+4006} <ffffffff8013393e>{autoremove_wake_function+0} <ffffffff80130b49>{try_to_wake_up+734} <ffffffff8013393e>{autoremove_wake_function+0} <ffffffff80130e85>{finish_task_switch+55} <ffffffff802f9c54>{thread_return+42} <ffffffff8013dc8f>{del_timer+107} <ffffffffa006c8a4>{:jbd:kjournald+250} <ffffffff8013393e>{autoremove_wake_function+0} <ffffffff8013393e>{autoremove_wake_function+0} <ffffffffa006c7a4>{:jbd:commit_timeout+0} <ffffffff80110c8f>{child_rip+8} <ffffffffa006c7aa>{:jbd:kjournald+0} <ffffffff80110c87>{child_rip+0} Expected results: 24 hours stress would have passed. Additional info: The full /var/log/messages file and the vmcore can be found at ndnc-1.lab.boston.redhat.com:/var/crash/192.168.79.127-2005-06-19-09:30
I'm looking into this a bit deeper now. Is this still reproducible on current kernels? Have you seen any further oopses like it? When you say "intermittent", just how hard is it to reproduce? How long does it take, what sort of system are you reproducing it on, and do any systems seem not to show the problem?
Some additional questions: Have previous kernels been subjected to the same level of testing as recent ones, ie. are we _sure_ this started appearing recently? (I know for a fact that earlier pre-U2-beta kernels had a bug which would show up in just the same way, so I would be really surprised if they passed similar loads reliably.) How repeatable is it? You mention at least 2 kernels showing it, but there's only one oops here --- do you have a vmcore from a mainline build?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html