From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7 Description of problem: Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: <ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71} PML4 15e644067 PGD 2d7f90067 PMD 0 Oops: 0000 [1] SMP CPU 62 Modules linked in: nfs lockd nfs_acl vfat fat netconsole netdump md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc\ ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery ac joydev ohci_hcd ehci_hcd tg3 ext3 jbd aic7xxx aacr\aid sd_mod scsi_mod Pid: 16551, comm: as Not tainted 2.6.9-27.ELlargesmp RIP: 0010:[<ffffffffa006dfc6>] <ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71}<5>audit(1135970141.109:411572): avc: denied \ { setattr } for pid=18208 comm="randasys" name="[3680023]" dev=pipefs ino=3680023 scontext=root:system_r:unconfined_t tcontex\t=root:system_r:unconfined_t tclass=fifo_file audit(1135970141.129:411573): avc: denied { setattr } for pid=18272 comm="randasys" name="[3680106]" dev=pipefs ino=3680106 \scontext=root:system_r:unconfined_t tcontext=root:system_r:unconfined_t tclass=fifo_file RSP: 0018:00000102fba978e8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000104e2c847c0 RCX: 00000104eb988000 RDX: 00000104ef4cb400 RSI: 00000104ec751558 RDI: 000001034fb1c910 RBP: 00000104ec751558 R08: 000000000000384a R09: 00000102fb2726b8 R10: 0000000000000058 R11: 0000000000000058 R12: 0000000000000000 R13: 00000104ef493800 R14: 000001034fb1c910 R15: 00000104eba1e860 FS: 0000002a95584da0(0000) GS:ffffffff804ec580(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000020 CR3: 00000004ef40a000 CR4: 00000000000006e0 Process as (pid: 16551, threadinfo 00000102fba96000, task 0000010204c327f0) Stack: 00000104eb98a280 000000000061b84a 00000104ef51c800 00000104ec751558 00000104ec751558 ffffffffa008236f 000001009bf64ae8 00000104eb988000 00000104eb974400 000000c300000001 Call Trace:<ffffffffa008236f>{:ext3:ext3_new_block+1058} <ffffffffa00843ca>{:ext3:ext3_alloc_block+7} <ffffffffa0085faf>{:ext3:ext3_get_block_handle+881} <ffffffffa006de73>{:jbd:__journal_file_buffer+384} <ffffffff8017a7e2>{alloc_buffer_head+49} <ffffffff8017ae0c>{create_buffers+99} <ffffffff8017b583>{__block_prepare_write+339}<5>audit(1135970141.288:411574): avc: denied { setattr } for pid=18272 c\omm="randasys" name="[3680108]" dev=pipefs ino=3680108 scontext=root:system_r:unconfined_t tcontext=root:system_r:unconfined_t \tclass=fifo_file <ffffffffa0086420>{:ext3:ext3_get_block+0} <ffffffff8017b83f>{block_prepare_write+26} <ffffffffa0084781>{:ext3:ext3_prepare_write+101} <ffffffff8015a78d>{generic_file_buffered_write+440} <ffffffffa008c568>{:ext3:__ext3_journal_stop+31} <ffffffff8019608c>{__mark_inode_dirty+40} <ffffffff8015af5a>{__generic_file_aio_write_nolock+731} <ffffffff8015b1f8>{generic_file_aio_write_nolock+32} <ffffffff8015b2c2>{generic_file_aio_write+126} <ffffffffa0082ee5>{:ext3:ext3_file_write+22} <ffffffff80177c09>{do_sync_write+173} <ffffffff80134e36>{autoremove_wake_function+0} <ffffffff80177d04>{vfs_write+207} <ffffffff80177dec>{sys_write+69} <ffffffff801101c6>{system_call+126} Version-Release number of selected component (if applicable): kernel-2.6.9-17.EL How reproducible: Always Steps to Reproduce: 1.Running on a 4 node x460 cluster (64 CPU box with 20 gig). 2.Using the RHTS kernel rhts-kernel-tests-1.0-200512221612.x86_64.rpm 3.Run the /mnt/tests/kernel/stress/ibm/pounder test Actual Results: Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: <ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71} PML4 15e644067 PGD 2d7f90067 PMD 0 Oops: 0000 [1] SMP CPU 62 Modules linked in: nfs lockd nfs_acl vfat fat netconsole netdump md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc\ ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery ac joydev ohci_hcd ehci_hcd tg3 ext3 jbd aic7xxx aacr\aid sd_mod scsi_mod Pid: 16551, comm: as Not tainted 2.6.9-27.ELlargesmp RIP: 0010:[<ffffffffa006dfc6>] <ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71}<5>audit(1135970141.109:411572): avc: denied \ { setattr } for pid=18208 comm="randasys" name="[3680023]" dev=pipefs ino=3680023 scontext=root:system_r:unconfined_t tcontex\t=root:system_r:unconfined_t tclass=fifo_file audit(1135970141.129:411573): avc: denied { setattr } for pid=18272 comm="randasys" name="[3680106]" dev=pipefs ino=3680106 \scontext=root:system_r:unconfined_t tcontext=root:system_r:unconfined_t tclass=fifo_file RSP: 0018:00000102fba978e8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000104e2c847c0 RCX: 00000104eb988000 RDX: 00000104ef4cb400 RSI: 00000104ec751558 RDI: 000001034fb1c910 RBP: 00000104ec751558 R08: 000000000000384a R09: 00000102fb2726b8 R10: 0000000000000058 R11: 0000000000000058 R12: 0000000000000000 R13: 00000104ef493800 R14: 000001034fb1c910 R15: 00000104eba1e860 FS: 0000002a95584da0(0000) GS:ffffffff804ec580(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000020 CR3: 00000004ef40a000 CR4: 00000000000006e0 Process as (pid: 16551, threadinfo 00000102fba96000, task 0000010204c327f0) Expected Results: Test sould run to completion. Additional info: This system is configured for netdump. But so far every netdump stops with a vmcore-incomplete. Not sure if it because the system has 20 gig of ram and a timeout occurs or if the netdump server can't handle a 20 gig vmcore.
I re-ran the test after cleaning up disk space on the netdump server. You can view the vmcore file here: ndnc-1.lab.boston.redhat.com:/var/crash/192.168.77.110-2006-01-03-14:56 -rw------- 1 netdump netdump 20G Jan 3 19:23 vmcore -rw------- 1 netdump netdump 1.3M Jan 3 19:24 log
The guys at LLNL may have hit this bug. In researching it, they believe that this problem was fixed upstream in 2.6.12.5. See: https://bugzilla.lustre.org/show_bug.cgi?id=6419
Created attachment 124837 [details] Upstream journal_unmap_buffer-vs-commit race fix.
Created attachment 135303 [details] upstream patch to fix JBD race in t_forget list handling Here's the backport of the previously attached patch. From http://lkml.org/lkml/2005/7/11/123 and http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e6c9f5c1888097c936334bf9740024520ca47b8e
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
What is the NEEDINFO status for? What info do you need, from whom?
I don't need more info other than to know if the patch in Comment #15 helps the customer's or reporter's testcases... At one point needinfo was set to find out which patch the customer had run with, but that was resolved - they had not run with any patch at that point.
Jeff, just a friendly ping on this, the proposed patch is built up in Brew... Thanks, -Eric
Eric, Sorry for the delay the test is running now. I will update when it is finished. Thanks, Jeff
Eric, I was able to successfully run the test. It ran for 24 hours. I also ran this test with several other system with all of the x86_64 kernel variants. Jeff
Excellent! Thanks Jeff.
Sent to rhkernel-list on 11/7/06
committed in stream U5 build 42.40. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html