Bug 170554
Summary: | Kernel OOPs for rpciod process doing a null pointer deref | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Greg Marsden <greg.marsden> | ||||||
Component: | kernel | Assignee: | Don Howard <dhoward> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 2.1 | CC: | cel, lwoodman, steved, tao | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-09-11 16:26:22 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 143573 | ||||||||
Attachments: |
|
Description
Greg Marsden
2005-10-12 20:42:44 UTC
alternate oops: Kernel 2.4.9-e.40.5.testenterprise CPU: 2 EIP: 0010:[<f8d023b8>] Not tainted EFLAGS: 00010282 EIP is at inode_append_flushd [nfs] 0x38 eax: 00000000 ebx: f51a1b00 ecx: f14a9740 edx: f51a1b1c esi: 00000000 edi: 0ce1f957 ebp: f51a1b00 esp: f50d7f34 ds: 0018 es: 0018 ss: 0018 Process rpciod (pid: 817, stackpage=f50d7000) Stack: f14a9740 f8d0264b f14a9740 00000000 d7c98734 d7c986e0 d7c98734 00000001 f8ce6169 d7c986e0 f50d6000 c0357ac0 c0105b4d c96a8000 c0350a00 c96a8360 c96a8000 f50d6000 c96a8000 c96a9fbc c0119a65 f50d7fb4 f50d6000 f50d6000 Call Trace: [<f8d0264b>] nfs_flushd [nfs] 0x11b (0xf50d7f38) [<f8ce6169>] __rpc_execute [sunrpc] 0xc9 (0xf50d7f54) [<c0105b4d>] __switch_to [kernel] 0x3d (0xf50d7f64) [<c0119a65>] schedule [kernel] 0x385 (0xf50d7f84) [<f8ce65b8>] __rpc_schedule [sunrpc] 0x178 (0xf50d7fa0) [<f8ce6f29>] rpciod [sunrpc] 0xf9 (0xf50d7fb8) [<f8cf563c>] rpciod_killer [sunrpc] 0x0 (0xf50d7fbc) [<c0105856>] arch_kernel_thread [kernel] 0x26 (0xf50d7ff0) [<f8cf563c>] rpciod_killer [sunrpc] 0x0 (0xf50d7ff4) [<f8ce6e30>] rpciod [sunrpc] 0x0 (0xf50d7ff8) this patch does not resolve the issue: ok, here's the patch i was thinking. i did it just from code inspection, and i dont' know if it has any negative performance impact, but i'm building a kernel with it now. --- linux/fs/nfs/flushd.c.bak Wed Aug 18 13:16:43 2004 +++ linux/fs/nfs/flushd.c Wed Aug 18 13:18:31 2004 @@ -186,6 +186,8 @@ void inode_remove_flushd(struct inode *i if (!(NFS_FLAGS(inode) & NFS_INO_FLUSH)) return; + lock_kernel(); + q = &cache->inodes; while (*q && *q != inode) q = &NFS_I(*q)->hash_next; @@ -194,6 +196,9 @@ void inode_remove_flushd(struct inode *i NFS_FLAGS(inode) &= ~NFS_INO_FLUSH; iput(inode); } + + unlock_kernel(); + } void inode_schedule_scan(struct inode *inode, unsigned long time) @@ -255,7 +260,9 @@ nfs_flushd(struct rpc_task *task) } if (nfs_have_writebacks(inode) || nfs_have_read(inode)) { + lock_kernel(); inode_append_flushd(inode); + unlock_kernel(); if (time_after(delay, NFS_NEXTSCAN(inode))) delay = NFS_NEXTSCAN(inode); } This event sent from IssueTracker by peterm This from chuck lever... RCA shows that the NULL deref comes from NFS_I(). in mainline code of this era, this is always a valid address because it points to part of the in-core inode. in RHEL AS 2.1, which has viro's inode shrinkage patch, this address could be made invalid by nfs_clear_inode() after an iput() call. what i surmise is happening is that we have some outstanding writes after a file is closed, and that is exposing a bug. a couple of things, then: 1. i think we have an i_count imbalance somewhere. i don't immediately see anything wrong with the existing NFS code base. there could be an issue higher up (like in the VFS layer). 2. this issue is only in RHEL AS 2.1, not in the stock kernels, so i think RH should take a hard look at this. For the record, the changelog comment: * Thu Aug 2 2001 Arjan van de Ven <arjanv> - fixed bug where inodes were cached WAAY too long (vm bug) came from a 2.4.7 patch that no longer exists in our pool. Also using a mainline 2.4.9 kernel is proabably not a valid reference point since the vm in our AS21 kernel is based on 2.4.12 with some "special" vm changes, from what I've been told... *** Bug 154542 has been marked as a duplicate of this bug. *** I see evidence of a couple netdumps in posts to bz154542 and it43813, but I don't have one to look at. Also, the patch mentioned in comment #2 was not included in pensacola kernels until e.64. It looks like folks have applied the patch to test kernels and have found it lacking. Can I get either: 1) A netdump demonstrating the bug from stock e.64 (or later stock kernel) 2) A netdump demonstrating the bug from a test kernel + test kernel srpm (binary and debug images would be handy too) we're running a version of e.49 with relevant patches applied. you can find the srpms for those at: http://oss.oracle.com/kernel/ i'm checking into whether the vmcore for the crashes was saved anywhere, but i'm doubtful. you can find the vmcore at http://oss.oracle.com/~gmarsden/rpciod/ Hi Greg - I'm unable to load the the vmcore from http://oss.oracle.com/~gmarsden/rpciod/ with a kernel image built from http://oss.oracle.com/kernel/2.4.9-e.49orags1/kernel-2.4.9-e.49orags1.src.rpm Do you have binary kernel images that match the vmcore? sorry the best i can do is the binary image which I've just uploaded to that same dir. I don't have the -debuginfo packages. That image seems to do the trick. There appears to be freed inode on the the NFS_REQUESTLIST: #2 [f4af5ef4] error_code (via page_fault) at c01074de EAX: 00000000 EBX: f5907bc0 ECX: f3924200 EDX: f5907bdc EBP: f5907bc0 DS: 0018 ESI: 00000000 ES: 0018 EDI: 0d0300e8 CS: 0010 EIP: f8d083b8 ERR: ffffffff EFLAGS: 00010282 #3 [f4af5f30] inode_append_flushd at f8d083b8 #4 [f4af5f38] nfs_flushd at f8d08646 crash> kmem -s cf3a5e40 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE c7666564 inode_cache 448 108452 211563 23507 4k SLAB MEMORY TOTAL ALLOCATED FREE cf3a5000 cf3a5040 9 8 1 FREE / [ALLOCATED] cf3a5e40 (cpu 1 cache) crash> kmem -S inode_cache | less ... SLAB MEMORY TOTAL ALLOCATED FREE cf3a5000 cf3a5040 9 8 1 FREE / [ALLOCATED] [cf3a5040] [cf3a5200] [cf3a53c0] [cf3a5580] [cf3a5740] [cf3a5900] [cf3a5ac0] [cf3a5c80] cf3a5e40 (cpu 1 cache) This could be due to a double free of the inode. As suggested, I'll examine theinode shrinkage patch and look for possible issues with it. Is there any chance that this can be reproduced with slab-debugging kernel? Don, is the next action for Greg at Oracle or for NetApp to test? What is the severity of this issue? Thanks! A means to reproduce the oops would be most helpfull. Unfortunately, I can't really offer a means to reproduce. it's sporadic and shows up after days-to-weeks of continuous operation. We also can't deploy the slab-debug kernel because it shows up on random machines, and we don't have the luxury to deploy this on all 1000s of potential machines. Let me know what you need to debug with, and I'll get that information. Created attachment 123404 [details]
patch to look for null in inode_append_flushd and nfs_delete_inode
patch to deal with getting null pointers in inode_append_flushd and
nfs_delete_inode
Hi Greg - A couple comments + /* First validate the inode passed in */ + if (!inode || !&(inode)->i_sb || !&(inode)->i_sb->u + || !NFS_REQUESTLIST(inode)) Don't you want to test the value of inode->i_sb / inode->i_sb->u rather than the address? Also, note that the inode could be freed after it is placed on the list. + { + if (!NFS_I(*q)) + { + + printk(KERN_ERR __FILE__ ": caught bad value from &NFS_I (*q)\n"); + + /* Clean up state -- insert THIS inode + * where the BAD inode was found! + * (dont walk to end of list) */ + break; + } + + /* This is inode_append_flushd + 0x38 */ q = &NFS_I(*q)->hash_next; + } Hrm. Is this safe for a production system? (Would data corruption be worse than the crash?) Hi Don, thanks for the review. For your first comment, the inode being freed after it's placed on the list is exactly the problem we're encountering right now. I should take out the address, that check is not actually necessary given that we haven't seen a null inode actually passed in. As for the second part, it's provably as safe as the current code incarnation: 1. the oops happens at line q = &NFS_I(*q)->hash_next; 2. At the point where the oops happens, NFS_I(*q) is NULL, therefore flushd cannot traverse past that point (and any inodes which used to be pointed to by ->hash_next will never be reached. 3. this code checks for a bad value in NFS_I(*q) and inserts the new inode at that point in the list, simultaneously repairing the inconsistency. Therefore, the result is no worse than leaving the code path alone. I've done some more digging in the vmcore - I was able to verify that the inode that nfs_delete_inode() complains about is the first item on the request list, and is the one that we trip over in inode_append_flushd(). That message points to the source of the problem, so it would be very useful to see a backtrace from nfs_delete_inode(). Any chance that the debuging patch above could be modified to generate a backtrace or beter yet, a panic (and vmcore) in nfs_delete_inode()? Hi Greg - Is there any chance that the customer could run a kernel with nfs_delete_inode() instrumented as I've suggested? Normally I'd supply a debug kernel for you, but it appears that this is happening on a kernel that is custom for oracle. A vmcore from the early panic in nfs_delete_inode() should allow me to see the contents of the offending inode before it's cleared, as well as the stack leading up to our first indication of trouble. That'll require rolling out a new kernel. Not that I'm opposed to it, but it'll take time. Keep me posted. I've reached the end of what I can find with the crash in inode_append_flush(). Greg/Don, any updates to this issue? Greg, any news ? Raising priority to HIGH per NetApp. NetApp is asking what the next actions are for this. Who has the ball on this? If a vmcore cannot be provided, I'm thinking we may be at the limit at troubleshooting this. Thoughts? From comment #27, I was under the impression that oracle would attempt to capture a vmcore instrumented as I suggested in comment #26. In the vmcores that I've seen thus far, the source of the problem has already been covered up as the inode in question has been cleared. I can look again to see if there are some clues that I've missed. Please continue trying to reproduce the problem, and collect a vmcore as requested. don, do you have any suggestions about where to place the panic() call to get the best (earliest) information possible? Created attachment 127576 [details]
call panic() instead of printk() in nfs_delete_inode() when the inode has pending rpc requests.
Here's the patch I was thinking of.
Greg, given there has been little traction on this, it looks as though we are at the end of what we (Red Hat) can do without a vmcore. If a vmcore cannot be provided, I'd like to suggest that this bug/issue be closed. Yes, I recognize that we need to stick the panic() call in there, I just haven't had time. Thanks Don for uploading the patch, I'll get it on an appropriate system once some of the other issues have died down. Greg, thanks for the update - please keep us in the loop. - Andrius. Greg, keep us posted on your progress with this, a 2.1 maintenance release is scheduled for the end of June. The next one could be some time later. Greg, hope things are going well with this, I'm assuming you won't have anything to address this in the near future? Greg, still awaiting Oracle feedback, thanks. Action for NetApp: Oracle has stated they do not have the customer infrastructure available to test this issue. Can this be tested by you all? If not, this issue will have to be closed. Waiting on NetApp feedback. NetApp: Last call on this item. This issue will be closed as WONTFIX on 5-Sep-06 if there are no plans to address this from NetApp. Closing issue as WONTFIX due to the fact that NetApp, Oracle, or RH can provide QA/verification for this. |