Hide Forgot
Description of problem: ----------------------- Several crashes on different systems running RHEL 5 kernel 2.6.18-400.1.1.el5 always in an inlined list_del() called from various functions manipulating dcache. Crashes started after upgrade from kernel 2.6.18-308.el5 to kernel 2.6.18-400.1.1.el5 and persist with 2.6.18-407.el5 There is a possibility of this problem being related to Bug 1198315. However I am logging this for 2 reasons: a) Bug 1198315 states the problem was introduced in patch linux-2.6-fs-dcache-fix-dentry-loop-detection-deadlock.patch for BZ717959 leading to errata http://rhn.redhat.com/errata/RHSA-2012-0150.html which is RHEL 5.8 kernel -308 Customer stated they have not seen these problems when running kernel 2.6.18-308.el5 b) to document kernel stack traces which seem to be due to the same root cause This will allow the Bugzilla search engine to highlight this Bug as a match. The crashes: ------------ 1. crash in shrink_dcache_parent() while deleting a d_lru dentry dereferencing a NULL pointer crash> bt PID: 2109 TASK: ffff8116bc2c2040 CPU: 9 COMMAND: "java" #0 [ffff81172fcc9bb0] crash_kexec at ffffffff800b156c #1 [ffff81172fcc9c70] __die at ffffffff80065137 #2 [ffff81172fcc9cb0] do_page_fault at ffffffff80067430 #3 [ffff81172fcc9da0] error_exit at ffffffff8005ddf9 [exception RIP: shrink_dcache_parent+0x67] RIP: ffffffff8004d7b5 RSP: ffff81172fcc9e58 RFLAGS: 00010217 RAX: ffff81055d731d60 RBX: ffff8106f61733d8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff81172fcc9e58 RDI: ffff8106f6173418 RBP: ffff8106f6173500 R8: ffff81058e430000 R9: ffff81172fcc9cf8 R10: 0000000000000003 R11: ffffffff8002cb43 R12: ffff81083d5b5228 R13: ffff81083d5b5228 R14: 00000000000000b1 R15: 00002ab1e06a4800 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff81172fcc9e90] dentry_unhash at ffffffff800506c7 #5 [ffff81172fcc9ea0] vfs_rmdir at ffffffff8004a654 #6 [ffff81172fcc9ec0] do_rmdir at ffffffff800ede2c #7 [ffff81172fcc9f80] system_call at ffffffff8005d116 2. crash in inlined list_del_init()/__list_del() called from prune_dcache() crash> bt PID: 1340 TASK: ffff81180f9617b0 CPU: 1 COMMAND: "kswapd0" #0 [ffff810c0f393ac0] crash_kexec at ffffffff800b155c #1 [ffff810c0f393b80] __die at ffffffff80065137 #2 [ffff810c0f393bc0] do_page_fault at ffffffff80067430 #3 [ffff810c0f393cb0] error_exit at ffffffff8005ddf9 [exception RIP: prune_dcache+109] RIP: ffffffff8002ea91 RSP: ffff810c0f393d60 RFLAGS: 00010207 RAX: 0000000000000000 RBX: ffff810af1fe5f10 RCX: 0000000000000064 RDX: ffffffff8032f320 RSI: ffff810b77c2c190 RDI: ffff810af1fe5ed8 RBP: ffff810af1fe5ed0 R8: ffff810c0f393d40 R9: 0000000000004f62 R10: 00000000017aabff R11: 00000000000002c0 R12: 0000000000000000 R13: 000000000000002d R14: 0000000000000000 R15: 0000000000000b00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff810c0f393d88] shrink_dcache_memory at ffffffff800f145e #5 [ffff810c0f393d98] shrink_slab at ffffffff8003f836 #6 [ffff810c0f393dd8] kswapd at ffffffff80057f84 #7 [ffff810c0f393ee8] kthread at ffffffff80032c56 #8 [ffff810c0f393f48] kernel_thread at ffffffff8005dfc1 This concerns a linked list of dentries (either a dispose_list given as an argument to prune_dcache or dentry_unused), here it was dentry_unused. 3. crash in do_lookup()/dput()/list_del() with dput inlined. No dump provided, only console messages Oops: 0000 [1] SMP ... Process rsync (pid: 14016, threadinfo ffff810378922000, task ffff81043fc92040) Stack: ffff8102fedaba28 ffffffff88859f03 00000000439207c0 ffff8102fedaba28 ffff8101439207c0 0000000000000001 ffff8102b500e020 ffffffff8885a02b ffff810311c97a80 ffff810025715c48 0000000000000001 ffff810088d84558 Call Trace: [<ffffffff88859f03>] :nfs:nfs_access_get_cached+0xde/0x108 [<ffffffff8885a02b>] :nfs:nfs_permission+0xfe/0x1ce [<ffffffff8000d034>] do_lookup+0x8f/0x24b [<ffffffff8000daf6>] permission+0x81/0xc8 [<ffffffff80009a22>] __link_path_walk+0x173/0xf39 [<ffffffff8000c77f>] _atomic_dec_and_lock+0x23/0x57 [<ffffffff8000ebb3>] link_path_walk+0x45/0xb8 [<ffffffff8000ce24>] do_path_lookup+0x294/0x311 [<ffffffff800129f3>] getname+0x15b/0x1c2 [<ffffffff80023f9b>] __user_walk_fd+0x37/0x4c [<ffffffff8003f5c5>] vfs_lstat_fd+0x18/0x47 [<ffffffff8002b237>] sys_newlstat+0x19/0x31 [<ffffffff8005d116>] system_call+0x7e/0x83 Concerning dentries linked list again. 4. crash in list_del called from shrink_dcache_for_umount_subtree(), no dump given, only console messages list_del corruption. next->prev should be ffff8111a79b3938, but was (null) Kernel BUG at lib/list_debug.c:70 invalid opcode: 0000 [1] SMP ... Process umount.nfs (pid: 50151, threadinfo ffff81091b7fc000, task ffff810655aee080) Stack: ffff8111a79b3b70 ffffffff800f0fbc ffff81175eae6c00 ffffffff8afbeb40 0000000000000000 ffffffff800f14af ffff81175eae6c00 ffffffff800e87ad 0000000000000050 ffffffff8afbeb00 0000000000000000 ffffffff800e88db Call Trace: [<ffffffff800f0fbc>] shrink_dcache_for_umount_subtree+0x1a0/0x222 [<ffffffff800f14af>] shrink_dcache_for_umount+0x37/0x45 [<ffffffff800e87ad>] generic_shutdown_super+0x1b/0xfb [<ffffffff800e88db>] kill_anon_super+0x9/0x35 [<ffffffff8af88c0f>] :nfs:nfs_kill_super+0x8c/0x9f [<ffffffff8006458b>] __down_write_nested+0x12/0x92 [<ffffffff800e898c>] deactivate_super+0x6a/0x82 [<ffffffff800f389a>] sys_umount+0x245/0x27b [<ffffffff80012d39>] __fput+0x191/0x1bd [<ffffffff8002d0fe>] mntput_no_expire+0x19/0x89 [<ffffffff80024205>] filp_close+0x5c/0x64 [<ffffffff8005d116>] system_call+0x7e/0x83 Again concerning dentries linked list Version-Release number of selected component (if applicable): ------------------------------------------------------------- kernel 2.6.18-400.1.1.el5.x86_64 2.6.18-407.el5 How reproducible: ----------------- Not at will, crashes happen during normal production Steps to Reproduce: ------------------- N/A Actual results: --------------- System crash due to NULL pointer dereference when deleting dentries from a doubly-linked list. This is suggestive of dcache_lock spinlock protection being broken somewhere past the kernel 2.6.18-308.el5 Expected results: ----------------- Additional info: ---------------- Locations of the crash dumps and detailed analysis will be in subsequent comment.