Bug 1312451

Summary: Oops crash in shrink_dcache_parent() while deleting a d_lru dentry
Product: Red Hat Enterprise Linux 5 Reporter: Stanislav Saner <ssaner>
Component: kernelAssignee: Denys Vlasenko <dvlasenk>
kernel sub component: File Systems QA Contact: Filesystem QE <fs-qe>
Status: CLOSED EOL Docs Contact:
Severity: high    
Priority: unspecified CC: dhoward, jpittman, nmurray
Version: 5.11   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-15 07:28:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Stanislav Saner 2016-02-26 17:33:57 UTC
Description of problem:
-----------------------

Several crashes on different systems running RHEL 5 kernel 2.6.18-400.1.1.el5 always in an inlined list_del() called from various functions manipulating dcache.  Crashes started after upgrade from kernel 2.6.18-308.el5 to 
kernel 2.6.18-400.1.1.el5  and persist with 2.6.18-407.el5

There is a possibility of this problem being related to Bug 1198315.
However I am logging this for 2 reasons:

a) Bug 1198315 states the problem was introduced in patch linux-2.6-fs-dcache-fix-dentry-loop-detection-deadlock.patch for BZ717959 leading to errata http://rhn.redhat.com/errata/RHSA-2012-0150.html which is RHEL 5.8 kernel -308
Customer stated they have not seen these problems when running kernel 2.6.18-308.el5

b) to document kernel stack traces which seem to be due to the same root cause
This will allow the Bugzilla search engine to highlight this Bug as a match.



The crashes:
------------
1. crash in shrink_dcache_parent() while deleting a d_lru dentry dereferencing a NULL pointer

crash> bt
PID: 2109   TASK: ffff8116bc2c2040  CPU: 9   COMMAND: "java"
 #0 [ffff81172fcc9bb0] crash_kexec at ffffffff800b156c
 #1 [ffff81172fcc9c70] __die at ffffffff80065137
 #2 [ffff81172fcc9cb0] do_page_fault at ffffffff80067430
 #3 [ffff81172fcc9da0] error_exit at ffffffff8005ddf9
    [exception RIP: shrink_dcache_parent+0x67]
    RIP: ffffffff8004d7b5  RSP: ffff81172fcc9e58  RFLAGS: 00010217
    RAX: ffff81055d731d60  RBX: ffff8106f61733d8  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff81172fcc9e58  RDI: ffff8106f6173418
    RBP: ffff8106f6173500   R8: ffff81058e430000   R9: ffff81172fcc9cf8
    R10: 0000000000000003  R11: ffffffff8002cb43  R12: ffff81083d5b5228
    R13: ffff81083d5b5228  R14: 00000000000000b1  R15: 00002ab1e06a4800
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff81172fcc9e90] dentry_unhash at ffffffff800506c7
 #5 [ffff81172fcc9ea0] vfs_rmdir at ffffffff8004a654
 #6 [ffff81172fcc9ec0] do_rmdir at ffffffff800ede2c
 #7 [ffff81172fcc9f80] system_call at ffffffff8005d116



2. crash in inlined list_del_init()/__list_del() called from prune_dcache() 

crash> bt
PID: 1340   TASK: ffff81180f9617b0  CPU: 1   COMMAND: "kswapd0"
 #0 [ffff810c0f393ac0] crash_kexec at ffffffff800b155c
 #1 [ffff810c0f393b80] __die at ffffffff80065137
 #2 [ffff810c0f393bc0] do_page_fault at ffffffff80067430
 #3 [ffff810c0f393cb0] error_exit at ffffffff8005ddf9
    [exception RIP: prune_dcache+109]
    RIP: ffffffff8002ea91  RSP: ffff810c0f393d60  RFLAGS: 00010207
    RAX: 0000000000000000  RBX: ffff810af1fe5f10  RCX: 0000000000000064
    RDX: ffffffff8032f320  RSI: ffff810b77c2c190  RDI: ffff810af1fe5ed8
    RBP: ffff810af1fe5ed0   R8: ffff810c0f393d40   R9: 0000000000004f62
    R10: 00000000017aabff  R11: 00000000000002c0  R12: 0000000000000000
    R13: 000000000000002d  R14: 0000000000000000  R15: 0000000000000b00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff810c0f393d88] shrink_dcache_memory at ffffffff800f145e
 #5 [ffff810c0f393d98] shrink_slab at ffffffff8003f836
 #6 [ffff810c0f393dd8] kswapd at ffffffff80057f84
 #7 [ffff810c0f393ee8] kthread at ffffffff80032c56
 #8 [ffff810c0f393f48] kernel_thread at ffffffff8005dfc1


This concerns a linked list of dentries  (either a dispose_list given as an argument to prune_dcache or dentry_unused), here it was dentry_unused.


3. crash in do_lookup()/dput()/list_del()   with dput inlined. No dump provided, only console messages 


Oops: 0000 [1] SMP
...
Process rsync (pid: 14016, threadinfo ffff810378922000, task ffff81043fc92040)
Stack:  ffff8102fedaba28 ffffffff88859f03 00000000439207c0 ffff8102fedaba28
ffff8101439207c0 0000000000000001 ffff8102b500e020 ffffffff8885a02b
ffff810311c97a80 ffff810025715c48 0000000000000001 ffff810088d84558
Call Trace:
[<ffffffff88859f03>] :nfs:nfs_access_get_cached+0xde/0x108
[<ffffffff8885a02b>] :nfs:nfs_permission+0xfe/0x1ce
[<ffffffff8000d034>] do_lookup+0x8f/0x24b
[<ffffffff8000daf6>] permission+0x81/0xc8
[<ffffffff80009a22>] __link_path_walk+0x173/0xf39
[<ffffffff8000c77f>] _atomic_dec_and_lock+0x23/0x57
[<ffffffff8000ebb3>] link_path_walk+0x45/0xb8
[<ffffffff8000ce24>] do_path_lookup+0x294/0x311
[<ffffffff800129f3>] getname+0x15b/0x1c2
[<ffffffff80023f9b>] __user_walk_fd+0x37/0x4c
[<ffffffff8003f5c5>] vfs_lstat_fd+0x18/0x47
[<ffffffff8002b237>] sys_newlstat+0x19/0x31
[<ffffffff8005d116>] system_call+0x7e/0x83


Concerning dentries linked list again.


4. crash in list_del called from shrink_dcache_for_umount_subtree(), no dump given, only console messages 

list_del corruption. next->prev should be ffff8111a79b3938, but was (null)
Kernel BUG at lib/list_debug.c:70
invalid opcode: 0000 [1] SMP 
...
Process umount.nfs (pid: 50151, threadinfo ffff81091b7fc000, task ffff810655aee080)
Stack:  ffff8111a79b3b70 ffffffff800f0fbc ffff81175eae6c00 ffffffff8afbeb40
0000000000000000 ffffffff800f14af ffff81175eae6c00 ffffffff800e87ad
0000000000000050 ffffffff8afbeb00 0000000000000000 ffffffff800e88db
Call Trace:
[<ffffffff800f0fbc>] shrink_dcache_for_umount_subtree+0x1a0/0x222
[<ffffffff800f14af>] shrink_dcache_for_umount+0x37/0x45
[<ffffffff800e87ad>] generic_shutdown_super+0x1b/0xfb
[<ffffffff800e88db>] kill_anon_super+0x9/0x35
[<ffffffff8af88c0f>] :nfs:nfs_kill_super+0x8c/0x9f
[<ffffffff8006458b>] __down_write_nested+0x12/0x92
[<ffffffff800e898c>] deactivate_super+0x6a/0x82
[<ffffffff800f389a>] sys_umount+0x245/0x27b
[<ffffffff80012d39>] __fput+0x191/0x1bd
[<ffffffff8002d0fe>] mntput_no_expire+0x19/0x89
[<ffffffff80024205>] filp_close+0x5c/0x64
[<ffffffff8005d116>] system_call+0x7e/0x83


Again concerning dentries linked list




Version-Release number of selected component (if applicable):
-------------------------------------------------------------
kernel  2.6.18-400.1.1.el5.x86_64
        2.6.18-407.el5


How reproducible:
-----------------
Not at will, crashes happen during normal production


Steps to Reproduce: 
-------------------
N/A

Actual results: 
---------------
System crash due to NULL pointer dereference when deleting dentries from a doubly-linked list. This is suggestive of dcache_lock spinlock protection being broken somewhere past the kernel 2.6.18-308.el5


Expected results:
-----------------


Additional info:
----------------
Locations of the crash dumps and detailed analysis will be in subsequent comment.