Description of problem: Customer is seeing regular soft lockup messages due to contention on the dcache_lock. Linux 2.6.18-92.el5 #1 SMP x86_64 Intel(R) Xeon(R) CPU X7350 @ 2.93GHz HP DL580G5 NAS Storage (EMC) They use NAS attached storage for the 8 servers. They provides storage for the customer with multimedia data, so there are many upload and download actions and lot of files are left. They usually remove unnessary files daily by manual. Becuase there are so many small files, sometimes it tooks more than half day to remove it all. After doing this remove job, sometimes they got following messages in the /var/log/messages and after that they couldn't do socket communication irreguarly. BUG: soft lockup - CPU#10 stuck for 10s! [M1IF:3779] Call Trace: [<ffffffff8000c31a>] _atomic_dec_and_lock+0x39/0x57 [<ffffffff8000d052>] dput+0x2c/0x114 [<ffffffff80022b12>] path_release+0xc/0x24 [<ffffffff800285ff>] vfs_stat_fd+0x3c/0x4a [<ffffffff8003aa41>] hrtimer_try_to_cancel+0x4a/0x53 [<ffffffff8005a21d>] hrtimer_cancel+0xc/0x16 [<ffffffff80063c5d>] do_nanosleep+0x47/0x70 [<ffffffff8005a10a>] hrtimer_nanosleep+0x58/0x118 [<ffffffff800233f9>] sys_newstat+0x19/0x31 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 I checked the each functions and found that they all stucked when try to hold 'dcache_lock' variable. After that, sometimes it just recovered without any big problem, but another times it caused the system hang or leads to communication problem in the apps. They usually remove files with average size of 100GBytes and 25 million number of files in a day. Version-Release number of selected component (if applicable): RHEL5.2 kernel-2.6.18-92.EL How reproducible: Customer sees these soft-lockup messages almost daily. Steps to Reproduce: Still working on a test case.
I think this call trace is key to this problem: [<ffffffff8002e7d7>] prune_dcache+0x2d/0x149 [<ffffffff8004cd30>] shrink_dcache_parent+0x3f/0xe1 [<ffffffff8004cd0d>] shrink_dcache_parent+0x1c/0xe1 [<ffffffff883f541b>] :nfs:nfs_lookup_revalidate+0x35c/0x416 [<ffffffff88403931>] :nfs:nfs3_proc_access+0x12f/0x18e [<ffffffff80064c47>] .text.lock.kernel_lock+0x5/0x1f [<ffffffff80009523>] __d_lookup+0xb0/0xff [<ffffffff8000cc5e>] do_lookup+0x17d/0x1d4 [<ffffffff80009918>] __link_path_walk+0x3a6/0xf42 [<ffffffff8000e782>] link_path_walk+0x5c/0xe5 [<ffffffff8000c965>] do_path_lookup+0x270/0x2e8 [<ffffffff800234db>] __path_lookup_intent_open+0x56/0x97 [<ffffffff8001a954>] open_namei+0x73/0x6d5 [<ffffffff8002732a>] do_filp_open+0x1c/0x38 [<ffffffff80019720>] do_sys_open+0x44/0xbe [<ffffffff8005d28d>] tracesys+0xd5/0xe0 What's happening here is an NFS client has tried to lookup a file and it either failed or the inode was found to be stale. When this happens the inode is revalidated and if it is a directory all it's dentries are invalidated via shrink_dcache_parent() and prune_dcache(). Both these functions can hold the dcache_lock for extended periods if there are a significant number of dentries to be scanned. From the sosreport we can see that there is 1.2 million dentries in the cache: $ cat slabinfo | grep dentry_cache dentry_cache 1204176 1204956 216 18 1 : tunables 120 60 8 : slabdata 66912 66942 360 This issue looks like a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=450194 which didn't get resolved but does provide a few more clues including this patch in mainline: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=da3bbdd4632c0171406b2677e31494afa5bde2f8 It also looks to be the same as: https://bugzilla.redhat.com/show_bug.cgi?id=413731 which does have a patch that just might work too.
I've narrowed the problem down to this specific bit of code: In shrink_dcache_parent() we call select_parent() to move all the child dentries of the parent dentry onto the end of the unused dentry list. It also returns the number of dentries moved. This count can be very high (as high as 50000 has been seen before). void shrink_dcache_parent(struct dentry * parent) { int found; while ((found = select_parent(parent)) != 0) prune_dcache(found, parent->d_sb); } We then call into prune_dcache() and we start scanning backwards from the end of the unused dentries list (where select_parent() just put some dentries). static void prune_dcache(int count, struct super_block *sb) { spin_lock(&dcache_lock); for (; count ; count--) { struct dentry *dentry; struct list_head *tmp; struct rw_semaphore *s_umount; cond_resched_lock(&dcache_lock); tmp = dentry_unused.prev; if (sb) { /* Try to find a dentry for this sb, but don't try * too hard, if they aren't near the tail they will * be moved down again soon */ int skip = count; while (skip && tmp != &dentry_unused && list_entry(tmp, struct dentry, d_lru)->d_sb != sb) { skip--; tmp = tmp->prev; } } The outer for loop will execute count times and it will periodically release the dcache_lock in cond_resched_lock() so I think that is playing fair and is not the problem. The problem is the inner while loop that can be traversing a long list without releasing the dcache_lock. In between calling select_parent() and prune_dcache() (and also while prune_dcache() drops the dcache_lock) more dentries could be put on the list. If the system is removing many files then it could easily add another 50000 or so onto the end of the unused list. This can result in the inner loop traversing a list of count dentries for every iteration of the outer loop and our total iterations is closer to count squared. So it's no wonder this bit of code is getting very busy. There are many approaches that could be taken to fix this - including the fix from https://bugzilla.redhat.com/show_bug.cgi?id=413731 which has select_parent() compile a private list of dentries for prune_dcache() to purge so it never has to scan the unused list again and that removes the inner while loop. I'm sure this will fix this issue.
Created attachment 363274 [details] Fix from RHEL3 This is the patch that was used to fix the same problem in RHEL3. I've just updated it to 2.6.18-92.
Looking at the patch that was tried in BZ 450194 I can see why it didn't work. The patch was from here: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=da3bbdd4632c0171406b2677e31494afa5bde2f8 While this patch removes the problematic inner while loop from prune_dcache() it replaces it with another one in __shrink_dcache_sb() although the list will probably be shorter since it's only dealing with dentries for a single filesystem and not the global unused list.
This fixes the problem with the patch used in BZ 450194 and these two patches should be enough to fix this issue. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f3c6ba986ab4527b6dfacf9f3b9e40f72466a8b2
Might also need this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c490d79bb70c549e096099576b1df40a8810b0d8
I have a reliable test case now. The test requires multiple NFS client filesystems, each big enough to create about 100,000 files. The NFS clients must mount different exports since I found that if I mount the same export many times the test doesn't work (this is because the NFS clients all had the same superblock value). The NFS clients and server can be on the same system. Here's what I did: I used a 300GB disk and split it up into 8 logical partitions. I chose 8 partitions because I have 8 cores in my test system and I wanted one for each core. This is allow all the cores to be contesting on the dcache_lock. Disk /dev/sdb: 300.0 GB, 300000000000 bytes 255 heads, 63 sectors/track, 36472 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdb1 1 36472 292961308+ 5 Extended /dev/sdb5 1 4500 36146187 83 Linux /dev/sdb6 4501 9000 36146218+ 83 Linux /dev/sdb7 9001 13500 36146218+ 83 Linux /dev/sdb8 13501 18000 36146218+ 83 Linux /dev/sdb9 18001 22500 36146218+ 83 Linux /dev/sdb10 22501 27000 36146218+ 83 Linux /dev/sdb11 27001 31500 36146218+ 83 Linux /dev/sdb12 31501 36000 36146218+ 83 Linux Make all the filesystems; for i in `seq 5 12` do mkfs.ext3 /dev/sdb$i done Mount the filesystems: for i in `seq 5 12` do mount /dev/sdb$i /mnt/sdb$i mount -t nfs 10.64.176.213:/mnt/sdb$i /mnt/nfs$i done For each filesystem we need about 100,000 files. Less might work but I haven't found the magic number yet. ext3 is limited to 32000 files per directory so I create 1000 directories and then a further 100 in each one. All these files are under a root directory called 'test' that the test will operate on to trigger the problem. This script creates the files on the NFS server since it's quicker. The test does not remove the files so this step only has to be done once. for i in `seq 5 12` do cd /mnt/sdb$i mkdir test cd test for i in `seq -w 1 1000` do mkdir dir_$i if [ $? -ne 0 ] then exit 1 fi for j in `seq -w 1 100` do mkdir dir_$i/dir_$j if [ $? -ne 0 ] then exit 1 fi done done done The next step is to prime the dentry cache with the NFS client filesystems. This needs to be done each time the test is run. for i in `seq 5 12` do ( cd /mnt/nfs$i; find . > /dev/null ) & done wait The 'wait' lets me know when it's done. Then the final step is to get all the NFS clients to call shrink_dcache_parent() at the same time on the root 'test' directory - this will cause each NFS client to purge all of it's 100,000 dentries. Getting all the clients to call shrink_dcache_parent() from nfs_lookup_revalidate() is too difficult - it requires timing and luck. Instead we'll use the fact that the NFS rename operation calls shrink_dcache_parent() on the old file. Fire off all 8 processes into the background and sleep for a few seconds to ensure they will all hit the rename at the same time. I rename the test directory back again so it's ready to run again. for i in `seq 5 12` do ( cd /mnt/nfs$i; sleep 5; mv test test1; mv test1 test ) & done This test will cause my system to lockup for up to 30 minutes and generate over 600 soft lockup warnings.
I ran this test with the attached patch (adapted from RHEL3) and there were no soft lockup messages.
After running the test a few times with the RHEL3 patch the system eventually become unresponsive. There were still no soft lockup messages though. It appears the system has run short on free memory since kswapd was chewing up 100% CPU before top stopped updating and slabtop showed over 800,000 entries in each of the nfs_inode_cache, ext3_inode_cache and buffer_head caches. The dentry cache was 2340 entries so that got purged successfully. I don't know if this is a new problem introduced by the patch or an existing bug. I'll try the patches from mainline and see if they work better.
Created attachment 364551 [details] Per-superblock dentry list patch This patch is the result of merging these fixes from mainline: 1. d52b908646b88cb1952ab8c9b2d4423908a23f11 2. 4a0962abd187df29b7d1378b2f372a55667d54c0 3. 24c32d733dd44dbc5b9dcd0b8de58e16fdbeac76 4. 85864e103850962389d77467391bc3da021d070f 5. 6eac3f93f5e6b7256fb20b7608d62ec192da12be 6. 37c42524d6090644206ae6d310d7e830bd3ccb47 7. da3bbdd4632c0171406b2677e31494afa5bde2f8 8. f3c6ba986ab4527b6dfacf9f3b9e40f72466a8b2 9. c490d79bb70c549e096099576b1df40a8810b0d8 Patch 7 provides the per-sb dentry list support, patches 8+9 are fixes to that change and the rest are minor dependencies to avoid hand merging changes. This patch survives 5 consecutive runs of the test case without any soft lockups or hard lockups (as was seen with the RHEL3 patch). But after running the tests I am unable to unmount the local disks so I suspect a reference count leak somewhere.
I ran the test case against the latest kernel, 2.6.18-169, and had some success. I don't see any changes to select_parent() or prune_dcache() in that kernel that would fix this problem so I'm surprised by this result.
The reason the test case is passing on more recent kernels is due to this patch that was added to 2.6.18-153: changelog: - [sched] fix cond_resched_softirq() offset (Jesse Larrew ) [496935] patch file: linux-2.6-sched-fix-cond_resched_softirq-offset.patch It fixes a bug in cond_resched_lock() that prevents the current thread from releasing the spin lock and scheduling out when it should. This function is called from prune_dcache(). static void prune_dcache(int count, struct super_block *sb) { spin_lock(&dcache_lock); for (; count ; count--) { struct dentry *dentry; struct list_head *tmp; struct rw_semaphore *s_umount; cond_resched_lock(&dcache_lock); <---- tmp = dentry_unused.prev; if (sb) { /* Try to find a dentry for this sb, but don't try * too hard, if they aren't near the tail they will * be moved down again soon */ int skip = count; while (skip && tmp != &dentry_unused && list_entry(tmp, struct dentry, d_lru)->d_sb != sb) { skip--; tmp = tmp->prev; } } The use of cond_resched_lock() is to workaround code that holds a spinlock for too long and I made the assumption that this worked as advertised. So this means the outer for loop in prune_dcache() has in fact contributed to this soft lockup issue. Since the inner while loop (that does not have a cond_resched_lock()) still exists in the latest RHEL5 kernel I suspect it is still possible to get these soft lockups but they are less likely to occur.
While the test case is not causing soft lockups on my test system with recent RHEL5 kernels there are noticeable pauses where the system is not responsive. I think on a system with much more RAM that can cache more dentries there is a chance the inner while loop in prune_dcache() can still cause soft lockups so I think we still need the per-superblock dentry list patch.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** Bug 450194 has been marked as a duplicate of this bug. ***
Gave this patch to LLNL for testing as well. They were experiencing the same issue but I was not able to come up with a reproducer and they found a workaround.
Created attachment 368975 [details] Working fix from RHEL3 This is the fix from RHEL3 for the same bug with some minor changes. The fix in RHEL3 has a bug in it that can leak dentries and cause a hard lockup. This problem has been fixed in this patch.
in kernel-2.6.18-174.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
*** Bug 452628 has been marked as a duplicate of this bug. ***
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html