* This seems to be a regression caused due to the patch applied to fix problem in bz 526612 A user reported a machine hang cause with kswapd running at 100%. crash> bt 561 PID: 561 TASK: ffff81087fff57a0 CPU: 5 COMMAND: "kswapd0" #0 [ffff81011de29f20] crash_nmi_callback at ffffffff8007af27 #1 [ffff81011de29f40] do_nmi at ffffffff8006588a #2 [ffff81011de29f50] nmi at ffffffff80064eef [exception RIP: .text.lock.spinlock+5] RIP: ffffffff80064bff RSP: ffff81087fac9d58 RFLAGS: 00000202 RAX: 00000000ffffffff RBX: ffff81011dc6be00 RCX: 0000000000000064 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff803f1480 RBP: 00000000ffd1bab0 R8: ffff81087fac9d58 R9: 0101010101010101 R10: 0000000000000000 R11: 00000000ffd1bab0 R12: 0000000000000000 R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000080 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #3 [ffff81087fac9d58] .text.lock.spinlock at ffffffff80064bff (via _spin_lock) #4 [ffff81087fac9d58] prune_dcache at ffffffff8002e73d #5 [ffff81087fac9d88] shrink_dcache_memory at ffffffff800eb979 #6 [ffff81087fac9d98] shrink_slab at ffffffff8003f462 #7 [ffff81087fac9dd8] kswapd at ffffffff80057e80 #8 [ffff81087fac9ee8] kthread at ffffffff80032894 #9 [ffff81087fac9f48] kernel_thread at ffffffff8005dfb1 The kswapd process was waiting for the dcache_lock spinlock. Along with the above kswapd process, 109 other processes were waiting for the same dcache_lock called from prune_dcache using cond_resched_lock. crash> bt 904 PID: 904 TASK: ffff8104be6920c0 CPU: 4 COMMAND: "yum-updatesd-he" #0 [ffff8102fa15db18] schedule at ffffffff80062f96 #1 [ffff8102fa15dbf0] __cond_resched at ffffffff8008e8e5 #2 [ffff8102fa15dc00] cond_resched_lock at ffffffff8003b2cd #3 [ffff8102fa15dc20] prune_dcache at ffffffff8002e74e #4 [ffff8102fa15dc50] shrink_dcache_memory at ffffffff800eb979 #5 [ffff8102fa15dc60] shrink_slab at ffffffff8003f462 #6 [ffff8102fa15dca0] try_to_free_pages at ffffffff800cbad1 #7 [ffff8102fa15dd30] __alloc_pages at ffffffff8000f450 #8 [ffff8102fa15dda0] __handle_mm_fault at ffffffff80008e44 #9 [ffff8102fa15de60] do_page_fault at ffffffff80066b55 #10 [ffff8102fa15df50] error_exit at ffffffff8005dde9 RIP: 00002b2c11c08d9f RSP: 00007fffd4a229b0 RFLAGS: 00010206 RAX: 000000000000001b RBX: 000000000000002c RCX: 0000000000000037 RDX: 0000000000000000 RSI: 000000001fdb4000 RDI: 000000000000002c RBP: 0000000000000005 R8: fefefefefefefeff R9: 0000000000000000 R10: 0000000000000000 R11: 000000000000000a R12: 000000000000000a R13: 00002b2c11eb41c0 R14: 000000001fbd3f08 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b The trigger appears to be a negative dentry_stat->nr_unused which tracks the total number of dentries in the dentry_unused list linked to the dentry struct through dentry->d_lru. crash> p dentry_stat dentry_stat = $7 = { nr_dentry = 4037, nr_unused = -3032498, age_limit = 45, want_pages = 0, dummy = {0, 0} } The negative value of nr_unused is used to calculate the unsigned variable total_scan in shrink_slab(). total_scan is used in the while condition to decide the number of times the shrinker for that cache is called. The negative nr_unused variable results in a large value for total_scan and as a result shrink_dcache_memory()->prune_dcache() is called a large number of times. The last patch which made changes to the way the dentry_stat counter is used was made in bz 526612. We passed this report to Lachlan, the author of the patch for bz 526612 who confirmed that this is due to the same patch -- I think I see what's going on here - the accounting for nr_unused is getting out of whack. Most of the code in dcache.c assumes that the dentry->lru list pointers (when in use) always refer to the dentry_unused list and therefore do nr_unused++/-- when adding or removing a dentry from the list. The new changes I added overload the dentry->lru pointers to put the dentry on a private list which doesn't use the nr_unused++/-- accounting. It's possible for a dentry to be hashed into the filesystem and be on either of these lists. When select_parent() traverses all the dentries hashed into the filesystem and finds one on an lru list it assumes it is the dentry_unused list and removes it and does nr_unused--. But due to the cond_resched_lock(dcache_lock) in prune_dcache() it's possible to expose a dentry that's on a private lru list to select_parent(). It removes it from the list and does an erroneous nr_unused--. I can fix this easily by doing nr_unused accounting for the private list too. I'll work up a patch. -- We are currently testing this patch in a test kernel given to the user.
Created attachment 439909 [details] Patch to fix nr_unused counting. Patch by lmcilroy. -- I think I see what's going on here - the accounting for nr_unused is getting out of whack. Most of the code in dcache.c assumes that the dentry->lru list pointers (when in use) always refer to the dentry_unused list and therefore do nr_unused++/-- when adding or removing a dentry from the list. The new changes I added overload the dentry->lru pointers to put the dentry on a private list which doesn't use the nr_unused++/-- accounting. It's possible for a dentry to be hashed into the filesystem and be on either of these lists. When select_parent() traverses all the dentries hashed into the filesystem and finds one on an lru list it assumes it is the dentry_unused list and removes it and does nr_unused--. But due to the cond_resched_lock(dcache_lock) in prune_dcache() it's possible to expose a dentry that's on a private lru list to select_parent(). It removes it from the list and does an erroneous nr_unused--. I can fix this easily by doing nr_unused accounting for the private list too. I'll work up a patch. --