526612 – kernel: BUG: soft lockup with dcache_lock

Bug 526612 - kernel: BUG: soft lockup with dcache_lock

Summary: kernel: BUG: soft lockup with dcache_lock

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	5.4
Assignee:	Lachlan McIlroy
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	450194 452628 (view as bug list)
Depends On:
Blocks:	526950 537019
TreeView+	depends on / blocked

Reported:	2009-10-01 02:06 UTC by Lachlan McIlroy
Modified:	2018-11-14 11:32 UTC (History)
CC List:	14 users (show)
Fixed In Version:	jpirko@redhat.com
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 06:55:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fix from RHEL3 (3.59 KB, patch) 2009-10-01 02:23 UTC, Lachlan McIlroy	no flags	Details \| Diff
Per-superblock dentry list patch (17.07 KB, patch) 2009-10-13 06:04 UTC, Lachlan McIlroy	no flags	Details \| Diff
Working fix from RHEL3 (3.07 KB, patch) 2009-11-11 00:39 UTC, Lachlan McIlroy	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Lachlan McIlroy 2009-10-01 02:06:57 UTC

Description of problem:
Customer is seeing regular soft lockup messages due to contention on the dcache_lock.

Linux 2.6.18-92.el5 #1 SMP x86_64
Intel(R) Xeon(R) CPU      X7350  @ 2.93GHz
HP DL580G5
NAS Storage (EMC)

They use NAS attached storage for the 8 servers. They provides storage for the customer with multimedia data, so there are many upload and download actions and lot of files are left. They usually remove unnessary files daily by manual.
Becuase there are so many small files, sometimes it tooks more than half day to remove it all. After doing this remove job, sometimes they got following messages in the /var/log/messages and after that  they couldn't do socket communication irreguarly.

BUG: soft lockup - CPU#10 stuck for 10s! [M1IF:3779]

Call Trace:
 [<ffffffff8000c31a>] _atomic_dec_and_lock+0x39/0x57
 [<ffffffff8000d052>] dput+0x2c/0x114
 [<ffffffff80022b12>] path_release+0xc/0x24
 [<ffffffff800285ff>] vfs_stat_fd+0x3c/0x4a
 [<ffffffff8003aa41>] hrtimer_try_to_cancel+0x4a/0x53
 [<ffffffff8005a21d>] hrtimer_cancel+0xc/0x16
 [<ffffffff80063c5d>] do_nanosleep+0x47/0x70
 [<ffffffff8005a10a>] hrtimer_nanosleep+0x58/0x118
 [<ffffffff800233f9>] sys_newstat+0x19/0x31
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

I checked the each functions and found that they all stucked when try to hold 'dcache_lock' variable.
After that, sometimes it just recovered without any big problem, but another times it caused the system hang or leads to communication problem in the apps.

They usually remove files with average size of 100GBytes and 25 million number of files in a day.

Version-Release number of selected component (if applicable):
RHEL5.2
kernel-2.6.18-92.EL

How reproducible:
Customer sees these soft-lockup messages almost daily.

Steps to Reproduce:
Still working on a test case.

Comment 1 Lachlan McIlroy 2009-10-01 02:09:09 UTC

I think this call trace is key to this problem:

 [<ffffffff8002e7d7>] prune_dcache+0x2d/0x149
 [<ffffffff8004cd30>] shrink_dcache_parent+0x3f/0xe1
 [<ffffffff8004cd0d>] shrink_dcache_parent+0x1c/0xe1
 [<ffffffff883f541b>] :nfs:nfs_lookup_revalidate+0x35c/0x416
 [<ffffffff88403931>] :nfs:nfs3_proc_access+0x12f/0x18e
 [<ffffffff80064c47>] .text.lock.kernel_lock+0x5/0x1f
 [<ffffffff80009523>] __d_lookup+0xb0/0xff
 [<ffffffff8000cc5e>] do_lookup+0x17d/0x1d4
 [<ffffffff80009918>] __link_path_walk+0x3a6/0xf42
 [<ffffffff8000e782>] link_path_walk+0x5c/0xe5
 [<ffffffff8000c965>] do_path_lookup+0x270/0x2e8
 [<ffffffff800234db>] __path_lookup_intent_open+0x56/0x97
 [<ffffffff8001a954>] open_namei+0x73/0x6d5
 [<ffffffff8002732a>] do_filp_open+0x1c/0x38
 [<ffffffff80019720>] do_sys_open+0x44/0xbe
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

What's happening here is an NFS client has tried to lookup a file and it either failed or the inode was found to be stale.  When this happens the inode is revalidated and if it is a directory all it's dentries are invalidated via shrink_dcache_parent() and prune_dcache().  Both these functions can hold the dcache_lock for extended periods if there are a significant number of dentries to be scanned.

From the sosreport we can see that there is 1.2 million dentries in the cache:

$ cat slabinfo | grep dentry_cache
dentry_cache      1204176 1204956    216   18    1 : tunables  120   60    8 : slabdata  66912  66942    360

This issue looks like a duplicate of:

https://bugzilla.redhat.com/show_bug.cgi?id=450194

which didn't get resolved but does provide a few more clues including this patch in mainline:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=da3bbdd4632c0171406b2677e31494afa5bde2f8

It also looks to be the same as:

https://bugzilla.redhat.com/show_bug.cgi?id=413731

which does have a patch that just might work too.

Comment 2 Lachlan McIlroy 2009-10-01 02:10:59 UTC

I've narrowed the problem down to this specific bit of code:

In shrink_dcache_parent() we call select_parent() to move all the child dentries of the parent dentry onto the end of the unused dentry list.  It also returns the number of dentries moved.  This count can be very high (as high as 50000 has been seen before).

void shrink_dcache_parent(struct dentry * parent)
{
       int found;

       while ((found = select_parent(parent)) != 0)
               prune_dcache(found, parent->d_sb);
}

We then call into prune_dcache() and we start scanning backwards from the end of the unused dentries list (where select_parent() just put some dentries).

static void prune_dcache(int count, struct super_block *sb)
{
       spin_lock(&dcache_lock);
       for (; count ; count--) {
               struct dentry *dentry;
               struct list_head *tmp;
               struct rw_semaphore *s_umount;

               cond_resched_lock(&dcache_lock);

               tmp = dentry_unused.prev;
               if (sb) {
                       /* Try to find a dentry for this sb, but don't try
                        * too hard, if they aren't near the tail they will
                        * be moved down again soon
                        */
                       int skip = count;
                       while (skip && tmp != &dentry_unused &&
                           list_entry(tmp, struct dentry, d_lru)->d_sb != sb) {
                               skip--;
                               tmp = tmp->prev;
                       }
               }

The outer for loop will execute count times and it will periodically release the dcache_lock in cond_resched_lock() so I think that is playing fair and is not the problem.  The problem is the inner while loop that can be traversing a long list without releasing the dcache_lock.

In between calling select_parent() and prune_dcache() (and also while prune_dcache() drops the dcache_lock) more dentries could be put on the list.  If the system is removing many files then it could easily add another 50000 or so onto the end of the unused list.  This can result in the inner loop traversing a list of count dentries for every iteration of the outer loop and our total iterations is closer to count squared.  So it's no wonder this bit of code is getting very busy.

There are many approaches that could be taken to fix this - including the fix from https://bugzilla.redhat.com/show_bug.cgi?id=413731 which has select_parent() compile a private list of dentries for prune_dcache() to purge so it never has to scan the unused list again and that removes the inner while loop.  I'm sure this will fix this issue.

Comment 3 Lachlan McIlroy 2009-10-01 02:23:28 UTC

Created attachment 363274 [details]
Fix from RHEL3

This is the patch that was used to fix the same problem in RHEL3.  I've just updated it to 2.6.18-92.

Comment 4 Lachlan McIlroy 2009-10-01 02:31:57 UTC

Looking at the patch that was tried in BZ 450194 I can see why it didn't work.  The patch was from here:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=da3bbdd4632c0171406b2677e31494afa5bde2f8

While this patch removes the problematic inner while loop from prune_dcache() it replaces it with another one in __shrink_dcache_sb() although the list will probably be shorter since it's only dealing with dentries for a single filesystem and not the global unused list.

Comment 5 Lachlan McIlroy 2009-10-01 05:05:48 UTC

This fixes the problem with the patch used in BZ 450194 and these two patches should be enough to fix this issue.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f3c6ba986ab4527b6dfacf9f3b9e40f72466a8b2

Comment 6 Lachlan McIlroy 2009-10-01 05:13:36 UTC

Might also need this patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c490d79bb70c549e096099576b1df40a8810b0d8

Comment 7 Lachlan McIlroy 2009-10-05 05:34:03 UTC

I have a reliable test case now. The test requires multiple NFS client filesystems, each big enough to create about 100,000 files. The NFS clients must mount different exports since I found that if I mount the same export many times the test doesn't work (this is because the NFS clients all had the same superblock value). The NFS clients and server can be on the same system.

Here's what I did:

I used a 300GB disk and split it up into 8 logical partitions. I chose 8 partitions because I have 8 cores in my test system and I wanted one for each core. This is allow all the cores to be contesting on the dcache_lock.

Disk /dev/sdb: 300.0 GB, 300000000000 bytes
255 heads, 63 sectors/track, 36472 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 1 36472 292961308+ 5 Extended
/dev/sdb5 1 4500 36146187 83 Linux
/dev/sdb6 4501 9000 36146218+ 83 Linux
/dev/sdb7 9001 13500 36146218+ 83 Linux
/dev/sdb8 13501 18000 36146218+ 83 Linux
/dev/sdb9 18001 22500 36146218+ 83 Linux
/dev/sdb10 22501 27000 36146218+ 83 Linux
/dev/sdb11 27001 31500 36146218+ 83 Linux
/dev/sdb12 31501 36000 36146218+ 83 Linux

Make all the filesystems;

for i in `seq 5 12`
do
mkfs.ext3 /dev/sdb$i
done

Mount the filesystems:

for i in `seq 5 12`
do
mount /dev/sdb$i /mnt/sdb$i
mount -t nfs 10.64.176.213:/mnt/sdb$i /mnt/nfs$i
done

For each filesystem we need about 100,000 files. Less might work but I haven't found the magic number yet. ext3 is limited to 32000 files per directory so I create 1000 directories and then a further 100 in each one. All these files are under a root directory called 'test' that the test will operate on to trigger the problem. This script creates the files on the NFS server since it's quicker. The test does not remove the files so this step only has to be done once.

for i in `seq 5 12`
do
cd /mnt/sdb$i
mkdir test
cd test

for i in `seq -w 1 1000`
do
mkdir dir_$i

if [ $? -ne 0 ]
then
exit 1
fi

for j in `seq -w 1 100`
do
mkdir dir_$i/dir_$j

if [ $? -ne 0 ]
then
exit 1
fi
done
done
done

The next step is to prime the dentry cache with the NFS client filesystems. This needs to be done each time the test is run.

for i in `seq 5 12`
do
( cd /mnt/nfs$i; find . > /dev/null ) &
done
wait

The 'wait' lets me know when it's done.

Then the final step is to get all the NFS clients to call shrink_dcache_parent() at the same time on the root 'test' directory - this will cause each NFS client to purge all of it's 100,000 dentries. Getting all the clients to call shrink_dcache_parent() from nfs_lookup_revalidate() is too difficult - it requires timing and luck. Instead we'll use the fact that the NFS rename operation calls shrink_dcache_parent() on the old file. Fire off all 8 processes into the background and sleep for a few seconds to ensure they will all hit the rename at the same time. I rename the test directory back again so it's ready to run again.

for i in `seq 5 12`
do
( cd /mnt/nfs$i; sleep 5; mv test test1; mv test1 test ) &
done

This test will cause my system to lockup for up to 30 minutes and generate over 600 soft lockup warnings.

Comment 8 Lachlan McIlroy 2009-10-07 07:22:32 UTC

I ran this test with the attached patch (adapted from RHEL3) and there were no soft lockup messages.

Comment 9 Lachlan McIlroy 2009-10-08 02:51:35 UTC

After running the test a few times with the RHEL3 patch the system eventually become unresponsive.  There were still no soft lockup messages though.  It appears the system has run short on free memory since kswapd was chewing up 100% CPU before top stopped updating and slabtop showed over 800,000 entries in each of the nfs_inode_cache, ext3_inode_cache and buffer_head caches.  The dentry cache was 2340 entries so that got purged successfully.  I don't know if this is a new problem introduced by the patch or an existing bug.  I'll try the patches from mainline and see if they work better.

Comment 11 Lachlan McIlroy 2009-10-13 06:04:41 UTC

Created attachment 364551 [details]
Per-superblock dentry list patch

This patch is the result of merging these fixes from mainline:

1. d52b908646b88cb1952ab8c9b2d4423908a23f11
2. 4a0962abd187df29b7d1378b2f372a55667d54c0
3. 24c32d733dd44dbc5b9dcd0b8de58e16fdbeac76
4. 85864e103850962389d77467391bc3da021d070f
5. 6eac3f93f5e6b7256fb20b7608d62ec192da12be
6. 37c42524d6090644206ae6d310d7e830bd3ccb47
7. da3bbdd4632c0171406b2677e31494afa5bde2f8
8. f3c6ba986ab4527b6dfacf9f3b9e40f72466a8b2
9. c490d79bb70c549e096099576b1df40a8810b0d8

Patch 7 provides the per-sb dentry list support, patches 8+9 are fixes to that change and the rest are minor dependencies to avoid hand merging changes.

This patch survives 5 consecutive runs of the test case without any soft lockups or hard lockups (as was seen with the RHEL3 patch).  But after running the tests I am unable to unmount the local disks so I suspect a reference count leak somewhere.

Comment 12 Lachlan McIlroy 2009-10-14 05:57:17 UTC

I ran the test case against the latest kernel, 2.6.18-169, and had some success.  I don't see any changes to select_parent() or prune_dcache() in that kernel that would fix this problem so I'm surprised by this result.

Comment 13 Lachlan McIlroy 2009-10-20 06:02:49 UTC

The reason the test case is passing on more recent kernels is due to this patch that was added to 2.6.18-153:

changelog:
- [sched] fix cond_resched_softirq() offset (Jesse Larrew ) [496935]

patch file:
linux-2.6-sched-fix-cond_resched_softirq-offset.patch

It fixes a bug in cond_resched_lock() that prevents the current thread from releasing the spin lock and scheduling out when it should.  This function is called from prune_dcache().

static void prune_dcache(int count, struct super_block *sb)
{
      spin_lock(&dcache_lock);
      for (; count ; count--) {
              struct dentry *dentry;
              struct list_head *tmp;
              struct rw_semaphore *s_umount;

              cond_resched_lock(&dcache_lock);    <----

              tmp = dentry_unused.prev;
              if (sb) {
                      /* Try to find a dentry for this sb, but don't try
                       * too hard, if they aren't near the tail they will
                       * be moved down again soon
                       */
                      int skip = count;
                      while (skip && tmp != &dentry_unused &&
                          list_entry(tmp, struct dentry, d_lru)->d_sb != sb) {
                              skip--;
                              tmp = tmp->prev;
                      }
              }

The use of cond_resched_lock() is to workaround code that holds a spinlock for too long and I made the assumption that this worked as advertised.  So this means the outer for loop in prune_dcache() has in fact contributed to this soft lockup issue.  Since the inner while loop (that does not have a cond_resched_lock()) still exists in the latest RHEL5 kernel I suspect it is still possible to get these soft lockups but they are less likely to occur.

Comment 14 Lachlan McIlroy 2009-10-23 05:52:40 UTC

While the test case is not causing soft lockups on my test system with recent RHEL5 kernels there are noticeable pauses where the system is not responsive.  I think on a system with much more RAM that can cache more dentries there is a chance the inner while loop in prune_dcache() can still cause soft lockups so I think we still need the per-superblock dentry list patch.

Comment 19 RHEL Program Management 2009-10-30 00:10:06 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 21 Ben Woodard 2009-11-06 16:03:00 UTC

*** Bug 450194 has been marked as a duplicate of this bug. ***

Comment 22 Ben Woodard 2009-11-06 16:05:56 UTC

Gave this patch to LLNL for testing as well. They were experiencing the same issue but I was not able to come up with a reproducer and they found a workaround.

Comment 23 Lachlan McIlroy 2009-11-11 00:39:02 UTC

Created attachment 368975 [details]
Working fix from RHEL3

This is the fix from RHEL3 for the same bug with some minor changes.  The fix in RHEL3 has a bug in it that can leak dentries and cause a hard lockup.  This problem has been fixed in this patch.

Comment 27 Don Zickus 2009-11-17 21:57:20 UTC

in kernel-2.6.18-174.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 29 Peter Staubach 2009-11-18 15:30:38 UTC

*** Bug 452628 has been marked as a duplicate of this bug. ***

Comment 33 Chris Ward 2010-02-11 10:19:21 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 43 errata-xmlrpc 2010-03-30 06:55:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.