Bug 170554

Summary: Kernel OOPs for rpciod process doing a null pointer deref
Product: Red Hat Enterprise Linux 2.1 Reporter: Greg Marsden <greg.marsden>
Component: kernelAssignee: Don Howard <dhoward>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: high    
Version: 2.1CC: cel, lwoodman, steved, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-09-11 16:26:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 143573    
Attachments:
Description Flags
patch to look for null in inode_append_flushd and nfs_delete_inode
none
call panic() instead of printk() in nfs_delete_inode() when the inode has pending rpc requests. none

Description Greg Marsden 2005-10-12 20:42:44 UTC
Oops: 0000
Kernel 2.4.9-e.49enterprise
nfs_delete_inode: inode 26480102 has pending RPC requests
Unable to handle kernel NULL pointer dereference*pde = 32e47001
at virtual address 000000b4
printing eip:
f8f2a3b8
CPU:    1
EIP:    0010:[<f8f2a3b8>]    Not tainted
EFLAGS: 00010286
EIP is at inode_append_flushd [nfs] 0x38
eax: 00000000   ebx: f3968fe0   ecx: f2a7ec80   edx: f3968ffc
esi: 00000000   edi: 12b68922   ebp: f3968fe0   esp: f3985f34
ds: 0018   es: 0018   ss: 0018
Process rpciod (pid: 950, stackpage=f3985000)
Stack: f2a7ec80 f8f2a64b f2a7ec80 00000000 f66b73f4 f66b73a0 f66b73f4
00000002
      f8f0e169 f66b73a0 f3984000 c0359120 00000000 00000002 c0352900
f2c00360
      < netdump activated - performing handshake with the client. >

Not easily reproducible, but does recur on some prod. systems.

Comment 1 Greg Marsden 2005-10-12 20:43:58 UTC
alternate oops:
Kernel 2.4.9-e.40.5.testenterprise
CPU:    2
EIP:    0010:[<f8d023b8>]    Not tainted
EFLAGS: 00010282
EIP is at inode_append_flushd [nfs] 0x38 
eax: 00000000   ebx: f51a1b00   ecx: f14a9740   edx: f51a1b1c
esi: 00000000   edi: 0ce1f957   ebp: f51a1b00   esp: f50d7f34
ds: 0018   es: 0018   ss: 0018
Process rpciod (pid: 817, stackpage=f50d7000)
Stack: f14a9740 f8d0264b f14a9740 00000000 d7c98734 d7c986e0 d7c98734 00000001 
       f8ce6169 d7c986e0 f50d6000 c0357ac0 c0105b4d c96a8000 c0350a00 c96a8360 
       c96a8000 f50d6000 c96a8000 c96a9fbc c0119a65 f50d7fb4 f50d6000 f50d6000 
Call Trace: [<f8d0264b>] nfs_flushd [nfs] 0x11b (0xf50d7f38)
[<f8ce6169>] __rpc_execute [sunrpc] 0xc9 (0xf50d7f54)
[<c0105b4d>] __switch_to [kernel] 0x3d (0xf50d7f64)
[<c0119a65>] schedule [kernel] 0x385 (0xf50d7f84)
[<f8ce65b8>] __rpc_schedule [sunrpc] 0x178 (0xf50d7fa0)
[<f8ce6f29>] rpciod [sunrpc] 0xf9 (0xf50d7fb8)
[<f8cf563c>] rpciod_killer [sunrpc] 0x0 (0xf50d7fbc)
[<c0105856>] arch_kernel_thread [kernel] 0x26 (0xf50d7ff0)
[<f8cf563c>] rpciod_killer [sunrpc] 0x0 (0xf50d7ff4)
[<f8ce6e30>] rpciod [sunrpc] 0x0 (0xf50d7ff8)



Comment 2 Greg Marsden 2005-10-12 20:45:20 UTC
this patch does not resolve the issue:
	ok, here's the patch i was thinking. i did it just from code inspection,
	and i dont' know if it has any negative performance impact, but i'm
	building a kernel with it now.
--- linux/fs/nfs/flushd.c.bak   Wed Aug 18 13:16:43 2004
+++ linux/fs/nfs/flushd.c       Wed Aug 18 13:18:31 2004
@@ -186,6 +186,8 @@ void inode_remove_flushd(struct inode *i
        if (!(NFS_FLAGS(inode) & NFS_INO_FLUSH))
                return;
+       lock_kernel();
+
        q = &cache->inodes;
        while (*q && *q != inode)
                q = &NFS_I(*q)->hash_next;
@@ -194,6 +196,9 @@ void inode_remove_flushd(struct inode *i
                NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
                iput(inode);
        }
+
+       unlock_kernel();
+
 }
 void inode_schedule_scan(struct inode *inode, unsigned long time)
@@ -255,7 +260,9 @@ nfs_flushd(struct rpc_task *task)
                }
                if (nfs_have_writebacks(inode) || nfs_have_read(inode)) {
+                       lock_kernel();
                        inode_append_flushd(inode);
+                       unlock_kernel();
                        if (time_after(delay, NFS_NEXTSCAN(inode)))
                                delay = NFS_NEXTSCAN(inode);
                }
This event sent from IssueTracker by peterm
              


Comment 3 Greg Marsden 2005-10-12 20:45:47 UTC
This from chuck lever...

  RCA shows that the NULL deref comes from NFS_I().  in mainline code of
  this era, this is always a valid address because it points to part of
  the in-core inode.  in RHEL AS 2.1, which has viro's inode shrinkage
  patch, this address could be made invalid by nfs_clear_inode() after
  an iput() call.  what i surmise is happening is that we have some
  outstanding writes after a file is closed, and that is exposing a bug.



  a couple of things, then:



  1.  i think we have an i_count imbalance somewhere.  i don't
  immediately see anything wrong with the existing NFS code base.  there
  could be an issue higher up (like in the VFS layer).



  2.  this issue is only in RHEL AS 2.1, not in the stock kernels, so i
  think RH should take a hard look at this. 




Comment 5 Steve Dickson 2005-10-13 13:34:05 UTC
For the record, the changelog comment:

* Thu Aug 2 2001 Arjan van de Ven <arjanv>
- fixed bug where inodes were cached WAAY too long (vm bug)

came from a 2.4.7 patch that no longer exists in our pool.

Also using a mainline 2.4.9 kernel is proabably not a 
valid reference point since the vm in our AS21 kernel 
is based on 2.4.12 with some "special" vm changes, 
from what I've been told... 


Comment 7 Don Howard 2005-11-03 00:45:44 UTC
*** Bug 154542 has been marked as a duplicate of this bug. ***

Comment 8 Don Howard 2005-11-03 21:24:36 UTC
I see evidence of a couple netdumps in posts to bz154542 and it43813, but I
don't have one to look at.

Also, the patch mentioned in comment #2 was not included in pensacola kernels
until e.64.

It looks like folks have applied the patch to test kernels and have found it
lacking.

Can I get either:
1) A netdump demonstrating the bug from stock e.64 (or later stock kernel)
2) A netdump demonstrating the bug from a test kernel + test kernel srpm
   (binary and debug images would be handy too)


Comment 9 Greg Marsden 2005-11-03 21:38:56 UTC
we're running a version of e.49 with relevant patches applied.

you can find the srpms for those at:
http://oss.oracle.com/kernel/

i'm checking into whether the vmcore for the crashes was saved anywhere, but i'm
doubtful.

Comment 10 Greg Marsden 2005-11-04 00:00:25 UTC
you can find the vmcore at 
http://oss.oracle.com/~gmarsden/rpciod/


Comment 12 Don Howard 2005-11-21 22:44:28 UTC
Hi Greg -

I'm unable to load the the vmcore from http://oss.oracle.com/~gmarsden/rpciod/
with a kernel image built from
http://oss.oracle.com/kernel/2.4.9-e.49orags1/kernel-2.4.9-e.49orags1.src.rpm

Do you have binary kernel images that match the vmcore?

Comment 13 Greg Marsden 2005-11-30 21:03:47 UTC
sorry the best i can do is the binary image which I've just
uploaded to that same dir. I don't have the -debuginfo packages.


Comment 14 Don Howard 2005-12-02 17:10:34 UTC
That image seems to do the trick.



There appears to be freed inode on the the NFS_REQUESTLIST:

#2 [f4af5ef4] error_code (via page_fault) at c01074de
    EAX: 00000000  EBX: f5907bc0  ECX: f3924200  EDX: f5907bdc  EBP: f5907bc0
    DS:  0018      ESI: 00000000  ES:  0018      EDI: 0d0300e8
    CS:  0010      EIP: f8d083b8  ERR: ffffffff  EFLAGS: 00010282
 #3 [f4af5f30] inode_append_flushd at f8d083b8
 #4 [f4af5f38] nfs_flushd at f8d08646


crash> kmem -s cf3a5e40
CACHE    NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
c7666564 inode_cache              448     108452    211563  23507     4k
SLAB      MEMORY    TOTAL  ALLOCATED  FREE
cf3a5000  cf3a5040      9          8     1
FREE / [ALLOCATED]
   cf3a5e40  (cpu 1 cache)


crash> kmem -S  inode_cache | less
  ...
SLAB      MEMORY    TOTAL  ALLOCATED  FREE
cf3a5000  cf3a5040      9          8     1
FREE / [ALLOCATED]
  [cf3a5040]
  [cf3a5200]
  [cf3a53c0]
  [cf3a5580]
  [cf3a5740]
  [cf3a5900]
  [cf3a5ac0]
  [cf3a5c80]
   cf3a5e40  (cpu 1 cache)


This could be due to a double free of the inode. As suggested, I'll examine
theinode shrinkage patch and look for possible issues with it.

Is there any chance that this can be reproduced with slab-debugging kernel?



Comment 16 Andrius Benokraitis 2006-01-09 15:19:48 UTC
Don, is the next action for Greg at Oracle or for NetApp to test? What is the
severity of this issue? Thanks!

Comment 17 Don Howard 2006-01-09 17:29:26 UTC
A means to reproduce the oops would be most helpfull.

Comment 19 Greg Marsden 2006-01-09 17:59:44 UTC
Unfortunately, I can't really offer a means to reproduce. it's sporadic and
shows up after days-to-weeks of continuous operation. We also can't deploy the
slab-debug kernel because it shows up on random machines, and we don't have the
luxury to deploy this on all 1000s of potential machines. Let me know what you
need to debug with, and I'll get that information.

Comment 20 Greg Marsden 2006-01-18 21:06:53 UTC
Created attachment 123404 [details]
patch to look for null in inode_append_flushd and nfs_delete_inode

patch to deal with getting null pointers in inode_append_flushd and
nfs_delete_inode

Comment 21 Don Howard 2006-01-19 19:20:45 UTC
Hi Greg -

A couple comments

+       /* First validate the inode passed in */
+       if (!inode || !&(inode)->i_sb || !&(inode)->i_sb->u
+                       || !NFS_REQUESTLIST(inode))

Don't you want to test the value of inode->i_sb / inode->i_sb->u rather than the
address?  Also, note that the inode could be freed after it is placed on the list.



+       {
+               if (!NFS_I(*q))
+               {
+
+                       printk(KERN_ERR __FILE__ ": caught bad value from &NFS_I
(*q)\n");
+
+                       /* Clean up state  -- insert THIS inode
+                        * where the BAD inode was found!
+                        * (dont walk to end of list) */
+                       break;
+               }
+
+               /* This is inode_append_flushd + 0x38 */
                q = &NFS_I(*q)->hash_next;
+       }


Hrm.  Is this safe for a production system? (Would data corruption be worse than
the crash?)   





Comment 22 Greg Marsden 2006-01-19 20:12:46 UTC
Hi Don, thanks for the review. For your first comment, the inode being freed
after it's placed on the list is exactly the problem we're encountering right
now. I should take out the address, that check is not actually necessary given
that we haven't seen a null inode actually passed in.

As for the second part, it's provably as safe as the current code incarnation:
1. the oops happens at line 
  q = &NFS_I(*q)->hash_next;
2. At the point where the oops happens, NFS_I(*q) is NULL, therefore flushd
cannot traverse past that point (and any inodes which used to be pointed to by
->hash_next will never be reached. 
3. this code checks for a bad value in NFS_I(*q) and inserts the new inode at
that point in the list, simultaneously repairing the inconsistency.

Therefore, the result is no worse than leaving the code path alone.

Comment 25 Don Howard 2006-01-27 00:23:10 UTC
I've done some more digging in the vmcore - 

I was able to verify that the inode that nfs_delete_inode() complains about is
the  first item on the request list, and is the one that we trip over in
inode_append_flushd().  

That message points to the source of the problem, so it would be very useful to
see a backtrace from nfs_delete_inode().  Any chance that the debuging patch
above could be modified to generate a backtrace or beter yet, a panic (and
vmcore) in nfs_delete_inode()?




Comment 26 Don Howard 2006-02-01 23:45:34 UTC
Hi Greg -

Is there any chance that the customer could run a kernel with nfs_delete_inode()
instrumented as I've suggested?  Normally I'd supply a debug kernel for you, but
it appears that this is happening on a kernel that is custom for oracle.

A vmcore from the early panic in nfs_delete_inode() should allow me to see the
contents of the offending inode before it's cleared, as well as the stack
leading up to our first indication of trouble.


Comment 27 Greg Marsden 2006-02-04 01:11:40 UTC
That'll require rolling out a new kernel. Not that I'm opposed to it, but it'll
take time.

Comment 28 Don Howard 2006-02-04 02:09:30 UTC
Keep me posted.  I've reached the end of what I can find with the crash in
inode_append_flush().



Comment 29 Andrius Benokraitis 2006-03-06 16:09:55 UTC
Greg/Don, any updates to this issue?

Comment 31 Bob Johnson 2006-03-29 18:22:11 UTC
Greg, any news ?

Comment 32 Andrius Benokraitis 2006-04-03 18:17:16 UTC
Raising priority to HIGH per NetApp. NetApp is asking what the next actions are
for this. Who has the ball on this? If a vmcore cannot be provided, I'm thinking
we may be at the limit at troubleshooting this. Thoughts?

Comment 33 Don Howard 2006-04-06 18:44:37 UTC
From comment #27, I was under the impression that oracle would attempt to
capture a vmcore instrumented as I suggested in comment #26.  In the vmcores
that I've seen thus far, the source of the problem has already been covered up
as the inode in question has been cleared.  

I can look again to see if there are some clues that I've missed.  
Please continue trying to reproduce the problem, and collect a vmcore as requested.

Comment 34 Chuck Lever 2006-04-06 19:40:10 UTC
don, do you have any suggestions about where to place the panic() call to get
the best (earliest) information possible?

Comment 35 Don Howard 2006-04-10 21:32:25 UTC
Created attachment 127576 [details]
call panic() instead of printk() in nfs_delete_inode() when the inode has pending rpc requests.

Here's the patch I was thinking of.

Comment 37 Andrius Benokraitis 2006-04-11 14:45:21 UTC
Greg, given there has been little traction on this, it looks as though we are at
the end of what we (Red Hat) can do without a vmcore. If a vmcore cannot be
provided, I'd like to suggest that this bug/issue be closed.

Comment 38 Greg Marsden 2006-04-11 17:25:30 UTC
Yes, I recognize that we need to stick the panic() call in there, I just haven't
had time. Thanks Don for uploading the patch, I'll get it on an appropriate
system once some of the other issues have died down.

Comment 39 Andrius Benokraitis 2006-05-01 04:01:54 UTC
Greg, thanks for the update - please keep us in the loop. - Andrius.

Comment 40 Andrius Benokraitis 2006-06-14 18:57:24 UTC
Greg, keep us posted on your progress with this, a 2.1 maintenance release is
scheduled for the end of June. The next one could be some time later.

Comment 41 Andrius Benokraitis 2006-06-26 15:06:07 UTC
Greg, hope things are going well with this, I'm assuming you won't have anything
to address this in the near future?

Comment 42 Bob Johnson 2006-07-24 17:00:04 UTC
Greg, still awaiting Oracle feedback, thanks.

Comment 44 Andrius Benokraitis 2006-08-01 20:42:52 UTC
Action for NetApp: Oracle has stated they do not have the customer
infrastructure available to test this issue. Can this be tested by you all? If
not, this issue will have to be closed. Waiting on NetApp feedback.

Comment 45 Andrius Benokraitis 2006-08-29 14:12:51 UTC
NetApp: Last call on this item. This issue will be closed as WONTFIX on 5-Sep-06
if there are no plans to address this from NetApp.

Comment 48 Andrius Benokraitis 2006-09-11 16:28:48 UTC
Closing issue as WONTFIX due to the fact that NetApp, Oracle, or RH can provide
QA/verification for this.