Bug 176738

Summary: Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: <ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71}
Product: Red Hat Enterprise Linux 4 Reporter: Jeff Burke <jburke>
Component: kernelAssignee: Eric Sandeen <esandeen>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3CC: dwmw2, jbaron, lwoodman, sct, staubach
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0304 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-02 00:00:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Upstream journal_unmap_buffer-vs-commit race fix.
none
upstream patch to fix JBD race in t_forget list handling none

Description Jeff Burke 2005-12-31 15:13:37 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7

Description of problem:
Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
<ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71}
PML4 15e644067 PGD 2d7f90067 PMD 0
Oops: 0000 [1] SMP
CPU 62
Modules linked in: nfs lockd nfs_acl vfat fat netconsole netdump md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc\ ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery ac joydev ohci_hcd ehci_hcd tg3 ext3 jbd aic7xxx aacr\aid sd_mod scsi_mod
Pid: 16551, comm: as Not tainted 2.6.9-27.ELlargesmp
RIP: 0010:[<ffffffffa006dfc6>] <ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71}<5>audit(1135970141.109:411572): avc:  denied \ { setattr } for  pid=18208 comm="randasys" name="[3680023]" dev=pipefs ino=3680023 scontext=root:system_r:unconfined_t tcontex\t=root:system_r:unconfined_t tclass=fifo_file
audit(1135970141.129:411573): avc:  denied  { setattr } for  pid=18272 comm="randasys" name="[3680106]" dev=pipefs ino=3680106 \scontext=root:system_r:unconfined_t tcontext=root:system_r:unconfined_t tclass=fifo_file

RSP: 0018:00000102fba978e8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000104e2c847c0 RCX: 00000104eb988000
RDX: 00000104ef4cb400 RSI: 00000104ec751558 RDI: 000001034fb1c910
RBP: 00000104ec751558 R08: 000000000000384a R09: 00000102fb2726b8
R10: 0000000000000058 R11: 0000000000000058 R12: 0000000000000000
R13: 00000104ef493800 R14: 000001034fb1c910 R15: 00000104eba1e860
FS:  0000002a95584da0(0000) GS:ffffffff804ec580(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 00000004ef40a000 CR4: 00000000000006e0
Process as (pid: 16551, threadinfo 00000102fba96000, task 0000010204c327f0)
Stack: 00000104eb98a280 000000000061b84a 00000104ef51c800 00000104ec751558
       00000104ec751558 ffffffffa008236f 000001009bf64ae8 00000104eb988000
       00000104eb974400 000000c300000001
Call Trace:<ffffffffa008236f>{:ext3:ext3_new_block+1058} <ffffffffa00843ca>{:ext3:ext3_alloc_block+7}
       <ffffffffa0085faf>{:ext3:ext3_get_block_handle+881}
       <ffffffffa006de73>{:jbd:__journal_file_buffer+384}
       <ffffffff8017a7e2>{alloc_buffer_head+49} <ffffffff8017ae0c>{create_buffers+99}
       <ffffffff8017b583>{__block_prepare_write+339}<5>audit(1135970141.288:411574): avc:  denied  { setattr } for  pid=18272 c\omm="randasys" name="[3680108]" dev=pipefs ino=3680108 scontext=root:system_r:unconfined_t tcontext=root:system_r:unconfined_t \tclass=fifo_file
 <ffffffffa0086420>{:ext3:ext3_get_block+0}
       <ffffffff8017b83f>{block_prepare_write+26} <ffffffffa0084781>{:ext3:ext3_prepare_write+101}
       <ffffffff8015a78d>{generic_file_buffered_write+440}
       <ffffffffa008c568>{:ext3:__ext3_journal_stop+31} <ffffffff8019608c>{__mark_inode_dirty+40}
       <ffffffff8015af5a>{__generic_file_aio_write_nolock+731}
       <ffffffff8015b1f8>{generic_file_aio_write_nolock+32}
       <ffffffff8015b2c2>{generic_file_aio_write+126} <ffffffffa0082ee5>{:ext3:ext3_file_write+22}
       <ffffffff80177c09>{do_sync_write+173} <ffffffff80134e36>{autoremove_wake_function+0}
       <ffffffff80177d04>{vfs_write+207} <ffffffff80177dec>{sys_write+69}
       <ffffffff801101c6>{system_call+126}


Version-Release number of selected component (if applicable):
kernel-2.6.9-17.EL

How reproducible:
Always

Steps to Reproduce:
1.Running on a 4 node x460 cluster (64 CPU box with 20 gig).
2.Using the RHTS kernel rhts-kernel-tests-1.0-200512221612.x86_64.rpm
3.Run the /mnt/tests/kernel/stress/ibm/pounder test
  

Actual Results:  Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP:
<ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71}
PML4 15e644067 PGD 2d7f90067 PMD 0
Oops: 0000 [1] SMP
CPU 62
Modules linked in: nfs lockd nfs_acl vfat fat netconsole netdump md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc\ ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery ac joydev ohci_hcd ehci_hcd tg3 ext3 jbd aic7xxx aacr\aid sd_mod scsi_mod
Pid: 16551, comm: as Not tainted 2.6.9-27.ELlargesmp
RIP: 0010:[<ffffffffa006dfc6>] <ffffffffa006dfc6>{:jbd:journal_dirty_metadata+71}<5>audit(1135970141.109:411572): avc:  denied \ { setattr } for  pid=18208 comm="randasys" name="[3680023]" dev=pipefs ino=3680023 scontext=root:system_r:unconfined_t tcontex\t=root:system_r:unconfined_t tclass=fifo_file
audit(1135970141.129:411573): avc:  denied  { setattr } for  pid=18272 comm="randasys" name="[3680106]" dev=pipefs ino=3680106 \scontext=root:system_r:unconfined_t tcontext=root:system_r:unconfined_t tclass=fifo_file

RSP: 0018:00000102fba978e8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000104e2c847c0 RCX: 00000104eb988000
RDX: 00000104ef4cb400 RSI: 00000104ec751558 RDI: 000001034fb1c910
RBP: 00000104ec751558 R08: 000000000000384a R09: 00000102fb2726b8
R10: 0000000000000058 R11: 0000000000000058 R12: 0000000000000000
R13: 00000104ef493800 R14: 000001034fb1c910 R15: 00000104eba1e860
FS:  0000002a95584da0(0000) GS:ffffffff804ec580(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 00000004ef40a000 CR4: 00000000000006e0
Process as (pid: 16551, threadinfo 00000102fba96000, task 0000010204c327f0)


Expected Results:  Test sould run to completion.

Additional info:

This system is configured for netdump. But so far every netdump stops with a vmcore-incomplete. Not sure if it because the system has 20 gig of ram and a timeout occurs or if the netdump server can't handle a 20 gig vmcore.

Comment 1 Jeff Burke 2006-01-04 18:46:28 UTC
I re-ran the test after cleaning up disk space on the netdump server. 
You can view the vmcore file here: 
ndnc-1.lab.boston.redhat.com:/var/crash/192.168.77.110-2006-01-03-14:56

-rw-------   1 netdump netdump  20G Jan  3 19:23 vmcore
-rw-------   1 netdump netdump 1.3M Jan  3 19:24 log


Comment 2 Ben Woodard 2006-02-14 20:30:01 UTC
The guys at LLNL may have hit this bug. In researching it, they believe that
this problem was fixed upstream in 2.6.12.5. See:
https://bugzilla.lustre.org/show_bug.cgi?id=6419

Comment 8 Stephen Tweedie 2006-02-17 22:09:27 UTC
Created attachment 124837 [details]
Upstream journal_unmap_buffer-vs-commit race fix.

Comment 15 Eric Sandeen 2006-08-31 16:19:10 UTC
Created attachment 135303 [details]
upstream patch to fix JBD race in t_forget list handling

Here's the backport of the previously attached patch.
From http://lkml.org/lkml/2005/7/11/123 and 
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e6c9f5c1888097c936334bf9740024520ca47b8e

Comment 16 RHEL Program Management 2006-09-13 22:03:19 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Guy Streeter 2006-09-13 22:16:54 UTC
What is the NEEDINFO status for? What info do you need, from whom?


Comment 18 Eric Sandeen 2006-09-13 22:33:33 UTC
I don't need more info other than to know if the patch in Comment #15 helps the
customer's or reporter's testcases...

At one point needinfo was set to find out which patch the customer had run with,
but that was resolved - they had not run with any patch at that point.

Comment 24 Eric Sandeen 2006-10-26 14:39:33 UTC
Jeff, just a friendly ping on this, the proposed patch is built up in Brew...
Thanks,
-Eric

Comment 25 Jeff Burke 2006-10-26 16:04:47 UTC
Eric, 
  Sorry for the delay the test is running now. I will update when it is finished.

Thanks,
Jeff

Comment 26 Jeff Burke 2006-10-27 20:08:15 UTC
Eric,
  I was able to successfully run the test. It ran for 24 hours. I also ran this
test with several other system with all of the x86_64 kernel variants.

Jeff

Comment 27 Eric Sandeen 2006-10-27 20:27:41 UTC
Excellent!  Thanks Jeff.

Comment 28 Eric Sandeen 2006-11-07 19:17:01 UTC
Sent to rhkernel-list on 11/7/06

Comment 31 Jason Baron 2007-01-10 18:58:59 UTC
committed in stream U5 build 42.40. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 35 Red Hat Bugzilla 2007-05-02 00:00:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html