Bug 554574

Summary:

soft lockups in dcache operations due to shrink_dcache_sb() hogging dcache_lock

Product:

Red Hat Enterprise Linux 5

Reporter:

Lachlan McIlroy <lmcilroy>

Component:

kernel

Assignee:

Eric Sandeen <esandeen>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

5.5

CC:

anton, esandeen, ikent, jbacik, jlayton, kgraham, rwheeler, steved, vgaikwad

Target Milestone:

Target Release:

5.5

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-03-15 02:20:23 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
slabinfo before-and-after soft lockup	none
Patch to fix spinlock contention in shrink_dcache_sb().	none

Description Lachlan McIlroy 2010-01-12 02:08:15 UTC

This was originally thought to be a case of BZ526612 but the problem still persisted after upgrading to a kernel with that fix.  It appears that the kernel function shrink_dcache_sb() has the same problem that was fixed in prune_dcache() with kernel 2.6.18-174.  The two functions do very similar things but are slightly different so they can't be merged together so we'll need another fix.

From the customer:

Just saw the RHSA-2009-1670 announcement that included bugzilla 533822 and 537019. However I went back and looked -- our system running the 2.6.18-174.el5 test kernel is showing soft lockups (apparently in dcache-related areas):

BUG: soft lockup - CPU#3 stuck for 10s! [pcscd:2774]
CPU 3:
Modules linked in: nfs fscache nfs_acl autofs4 lockd sunrpc dm_mirror dm_multipa
th scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi ac
pi_memhotplug ac parport_pc lp parport sr_mod cdrom joydev usb_storage i5000_eda
c bnx2 sg edac_mc serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 2774, comm: pcscd Not tainted 2.6.18-174.el5 #1
RIP: 0010:[<ffffffff80064bcc>]  [<ffffffff80064bcc>] .text.lock.spinlock+0x2/0x30
RSP: 0018:ffff81042a46de20  EFLAGS: 00000286
RAX: ffff81042e422400 RBX: ffffffff80315b80 RCX: 0000000000000000
RDX: ffff8100ba01d12c RSI: ffffffff802a6add RDI: ffffffff803ec480
RBP: ffff81042e423910 R08: 0000000000000000 R09: ffffff9c2a46de58
R10: ffff81042a46dd78 R11: 0000000000000048 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
FS:  000000004193b940(0063) GS:ffff81010eb9e6c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaacc4000 CR3: 000000042c502000 CR4: 00000000000006e0

Call Trace:
[<ffffffff800227e5>] d_alloc+0x166/0x1a9
[<ffffffff800f12b2>] dcache_dir_open+0x0/0x26
[<ffffffff800f12c6>] dcache_dir_open+0x14/0x26
[<ffffffff8001e9eb>] __dentry_open+0xd9/0x1dc
[<ffffffff800274b4>] do_filp_open+0x2a/0x38
[<ffffffff80063cd5>] do_nanosleep+0x47/0x70
[<ffffffff80019db6>] do_sys_open+0x44/0xbe
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

BUG: soft lockup - CPU#5 stuck for 10s! [sendmail:2932]
CPU 5:
Modules linked in: nfs fscache nfs_acl autofs4 lockd sunrpc dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod cdrom joydev usb_storage i5000_edac bnx2 sg edac_mc serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm ehci_hcd
Pid: 2932, comm: sendmail Not tainted 2.6.18-174.el5 #1
0
RSP: 0018:ffff8104271fbc90  EFLAGS: 00000286
RAX: ffff81010eb38400 RBX: ffff8104271fbd28 RCX: 0000000000000000
RDX: ffff810403896eac RSI: ffff81042136800d RDI: ffffffff803ec480
RBP: 0000000000000106 R08: ffff8104271fbc38 R09: 0000000000000000
R10: ffff8104271fbc98 R11: 0000000000000048 R12: 0000000000000000
R13: 000000000000002f R14: 0000000000000007 R15: ffff8104271fbc4c
FS:  00002b2c98294bd0(0000) GS:ffff81042fc21540(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b497681d4f8 CR3: 0000000427776000 CR4: 00000000000006e0

Call Trace:
[<ffffffff800227e5>] d_alloc+0x166/0x1a9
[<ffffffff8000cee6>] do_lookup+0xc0/0x1e6
[<ffffffff8000a22b>] __link_path_walk+0xa01/0xf42
[<ffffffff8000e9d6>] link_path_walk+0x42/0xb2
[<ffffffff8000cca6>] do_path_lookup+0x275/0x2f1
[<ffffffff80129178>] selinux_file_alloc_security+0x2a/0x54
[<ffffffff80023684>] __path_lookup_intent_open+0x56/0x97
[<ffffffff8001ae77>] open_namei+0x73/0x6d5
[<ffffffff800274a6>] do_filp_open+0x1c/0x38
[<ffffffff80019db6>] do_sys_open+0x44/0xbe
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

BUG: soft lockup - CPU#7 stuck for 10s! [check_avmon:7426]
CPU 7:
Modules linked in: nfs fscache nfs_acl autofs4 lockd sunrpc dm_mirror dm_multipa
th scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi ac
pi_memhotplug ac parport_pc lp parport sr_mod cdrom joydev usb_storage i5000_eda
c bnx2 sg edac_mc serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm
_mod dm_mem_cache shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd
ehci_hcd
Pid: 7426, comm: check_avmon Not tainted 2.6.18-174.el5 #1
RIP: 0010:[<ffffffff80064bcc>]  [<ffffffff80064bcc>] .text.lock.spinlock+0x2/0x3
0
RSP: 0018:ffff8103fb8addf0  EFLAGS: 00000286
RAX: 0000000000000000 RBX: ffffffff803ec480 RCX: 0000000000000001
RDX: 00007fff42e4a0e0 RSI: ffffffff803ec480 RDI: ffffffff803ec480
RBP: ffff8104151e1110 R08: ffff8103fb8adcb8 R09: 0000000000000000
R10: ffff8103fb8add98 R11: 0000000000000048 R12: ffff8104151e0810
R13: ffff81042f36f780 R14: ffff810346ce7000 R15: 000000157a914da0
FS:  00002b9c3c15ff90(0000) GS:ffff81042fcab3c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000033dda99bc0 CR3: 0000000333b3e000 CR4: 00000000000006e0

Call Trace:
[<ffffffff8000c5f2>] _atomic_dec_and_lock+0x39/0x57
[<ffffffff8000d3ae>] dput+0x2c/0x114
[<ffffffff80022ca2>] path_release+0xc/0x24
[<ffffffff8003f372>] vfs_lstat_fd+0x39/0x47
[<ffffffff8002ab0d>] sys_newlstat+0x19/0x31
[<ffffffff8005d28d>] tracesys+0xd5/0xe0


BUG: soft lockup - CPU#0 stuck for 10s! [mount.nfs:7951]
CPU 0:
Modules linked in: nfs fscache nfs_acl autofs4 lockd sunrpc dm_mirror dm_multip th scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi a pi_memhotplug ac parport_pc lp parport sr_mod cdrom joydev usb_storage i5000_ed c bnx2 sg edac_mc serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log d _mod dm_mem_cache shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hc  ehci_hcdffffffff803155e0
Pid: 7951, comm: mount.nfs Not tainted 2.6.18-174.el5 #1
RIP: 0010:[<ffffffff800ea90a>]  [<ffffffff800ea90a>] shrink_dcache_sb+0x6b/0xda
RSP: 0018:ffff81042413b838  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff8102a1da6228 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff803155e0
RBP: 0000000000000041 R08: ffff81010eb2e280 R09: 0000000000000000
R10: ffff81010eb2e280 R11: 00000000000000d0 R12: 0000000000007473
R13: 6f686c61636f6c40 R14: 736f6967616e3d4c R15: 49414d454e494d44
FS:  00002af42828a6e0(0000) GS:ffffffff803c5000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff48276000 CR3: 000000040faa9000 CR4: 00000000000006e0

Call Trace:
[<ffffffff800ea8b8>] shrink_dcache_sb+0x19/0xda
[<ffffffff800e274b>] do_remount_sb+0x4e/0x153
[<ffffffff800e3174>] get_sb_single+0x8d/0xb3
[<ffffffff800e3007>] vfs_kern_mount+0x93/0x11a
[<ffffffff800f15b3>] simple_pin_fs+0x42/0xa3
[<ffffffff882f289d>] :sunrpc:rpc_get_mount+0x1e/0x2f
[<ffffffff882e4e56>] :sunrpc:rpc_setup_pipedir+0x34/0xc5
[<ffffffff882e521f>] :sunrpc:rpc_new_client+0x1c3/0x299
[<ffffffff882e567b>] :sunrpc:rpc_create_client+0xb/0x3c
[<ffffffff88359b92>] :nfs:nfs_create_rpc_client+0x11d/0x160
[<ffffffff8835a482>] :nfs:nfs_create_server+0xf9/0x4ae
[<ffffffff801283ec>] avc_has_perm+0x46/0x58
[<ffffffff88361918>] :nfs:nfs_get_sb+0x1fd/0x472
[<ffffffff8012ad2a>] selinux_sb_copy_data+0x23/0x1c5
[<ffffffff800e3007>] vfs_kern_mount+0x93/0x11a
[<ffffffff800e30d0>] do_kern_mount+0x36/0x4d
[<ffffffff800ed3d0>] do_mount+0x6a9/0x719
[<ffffffff8002fc43>] dev_queue_xmit+0x250/0x271
[<ffffffff80032210>] ip_output+0x29a/0x2dd
[<ffffffff800347af>] ip_queue_xmit+0x437/0x491
[<ffffffff800220d4>] __up_read+0x19/0x7f
[<ffffffff800ec150>] copy_mount_options+0xce/0x127
[<ffffffff8009e745>] search_exception_tables+0x1d/0x2d
[<ffffffff80066ca5>] do_page_fault+0x64b/0x830
[<ffffffff8003dbed>] lock_timer_base+0x1b/0x3c
[<ffffffff8001cbd0>] __mod_timer+0xb0/0xbe
[<ffffffff800cb6c0>] zone_statistics+0x3e/0x6d
[<ffffffff8000f2c4>] __alloc_pages+0x78/0x308
[<ffffffff8004c7a5>] sys_mount+0x8a/0xcd
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

BUG: soft lockup - CPU#6 stuck for 10s! [problem_handler:7944]
CPU 6:
ehci_hcd
Pid: 7944, comm: problem_handler Not tainted 2.6.18-174.el5 #1
0
RSP: 0018:ffff81036ea9dcb0  EFLAGS: 00000286
RAX: ffff810427d496c0 RBX: ffff810428ea4ac0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff81036ea9de48 RDI: ffffffff803ec480
RBP: 000000000000012d R08: ffff81036ea9dc58 R09: 0000000000000000
R10: ffff81036ea9dcb8 R11: 0000000000000048 R12: 0000000000000000
R13: 00000000000000e7 R14: 0000000000000007 R15: ffff81036ea9de88
FS:  00002ac58e5cb250(0000) GS:ffff81042fcabbc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002ea30a8 CR3: 0000000147d8e000 CR4: 00000000000006e0

Call Trace:
[<ffffffff88330d0f>] :autofs4:autofs4_revalidate+0xa4/0x149
[<ffffffff8000cfb5>] do_lookup+0x18f/0x1e6
[<ffffffff80009bd0>] __link_path_walk+0x3a6/0xf42
[<ffffffff8000e9d6>] link_path_walk+0x42/0xb2
[<ffffffff8000cca6>] do_path_lookup+0x275/0x2f1
[<ffffffff800127f8>] getname+0x15b/0x1c2
[<ffffffff80023870>] __user_walk_fd+0x37/0x4c
[<ffffffff80028848>] vfs_stat_fd+0x1b/0x4a
[<ffffffff80066b58>] do_page_fault+0x4fe/0x830
[<ffffffff80062486>] __sched_text_start+0xf6/0xbd6
[<ffffffff800235a2>] sys_newstat+0x19/0x31
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 2 Kevin Graham 2010-01-19 23:47:08 UTC

Created attachment 385557 [details]
slabinfo before-and-after soft lockup

Caught another one from system still running the 2.6.18-174.el5 test kernel. Attached is a capture of slabinfo from before (trimmed it to ~10m prior, but have data going further back if needed) and after the soft lockup was detected.

Comment 3 Lachlan McIlroy 2010-01-20 01:38:11 UTC

There's almost 60 million entries in the dentry cache - no wonder it's having problems.

I have a test kernel with version kernel-2.6.18-183.el5.it345190.1 that includes a fix for spinlock contention on the dcache lock in the shrink_dcache_sb() function.  Can you give it a try?

The kernel is available from here:
http://people.redhat.com/~lmcilroy/bz554574/

Lachlan

Comment 4 Lachlan McIlroy 2010-01-20 01:39:50 UTC

Created attachment 385567 [details]
Patch to fix spinlock contention in shrink_dcache_sb().

Comment 5 Jiri Pirko 2010-02-10 11:51:52 UTC

Ok, there are two fixes for this issue:

1) easy to backport but possibly not complete for huge number of entries:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=37c42524d6090644206ae6d310d7e830bd3ccb47

2) complete but large backport:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=da3bbdd4632c0171406b2677e31494afa5bde2f8

Second patch should solve the issue completely. Ccing some FS folks.

Comment 7 Lachlan McIlroy 2010-02-11 01:53:35 UTC

I tried pushing patch 2 into RHEL5.4 to fix BZ526612 but it got knocked back in review due to the kABI changes to the superblock structure.

Patch 1 should help in most cases but there is still the risk that another shrink_dcache_sb() will race when the dcache_lock is released in prune_one_dentry() or cond_resched_lock().  In this case it wont be a few new dentries we have to skip but a whole filesystem's worth that's just been moved to the tail.  The patch I attached to this BZ doesn't have that problem since it only scans the dentry_unused list once.

Comment 8 Eric Sandeen 2010-03-03 16:04:12 UTC

It'd be a bit of a hack, but we could possibly put the list in/on the ext2/ext3/ext4 private superblocks instead, with appropriate tests etc, to help resolve the issue for those filesystems...

Comment 9 Eric Sandeen 2010-03-03 16:28:15 UTC

Well, let me look at Lachlan's proposed patch again, Josef reminds me that we can probably play tricks w/ #ifdefs etc, right...

Comment 10 Eric Sandeen 2010-03-03 17:38:52 UTC

Ok, I should have reread the whole thread through again before commenting.  It seems that we mostly psyched ourselves out on Lachlan's big patch, but there are probably some valid concerns that are hard to pin down, as they relate to what 3rd party filesystems -could- possibly do ....

Josef, I think Lachlan's patch attached in comment #4 is a forward-port of what you did for RHEL3 - would you mind looking at it with an eye towards this problem on RHEL5?  It seems to me that it should certainly help us be more efficient in shrink_dcache_sb(), at least.

Do we have a decent testcase for this problem so that we can evaluate the effectiveness of the patch?

Thanks,
-Eric

Comment 11 Ian Kent 2010-03-04 03:05:21 UTC

(In reply to comment #3)
> There's almost 60 million entries in the dentry cache - no wonder it's having
> problems.

Right, but:

inode_cache         6186   6335
dentry_cache      59704364 59704488

What's going on here?
Looks like there are truck loads of dentrys with no inode.

Comment 12 Lachlan McIlroy 2010-03-04 04:10:18 UTC

(In reply to comment #11)
> (In reply to comment #3)
> > There's almost 60 million entries in the dentry cache - no wonder it's having
> > problems.
> 
> Right, but:
> 
> inode_cache         6186   6335
> dentry_cache      59704364 59704488
> 
> What's going on here?
> Looks like there are truck loads of dentrys with no inode.    

Good point.  I suspect these dentries have been put on the dentry_unused list after the file was deleted (and therefore unlinked from the dentry).  Or maybe we have a leak.

Comment 13 Lachlan McIlroy 2010-03-04 04:20:24 UTC

(In reply to comment #8)
> It'd be a bit of a hack, but we could possibly put the list in/on the
> ext2/ext3/ext4 private superblocks instead, with appropriate tests etc, to help
> resolve the issue for those filesystems...    

It would have to work for NFS too since this issue is primarily seen with NFS when using autofs.

Comment 14 Lachlan McIlroy 2010-03-04 04:38:01 UTC

(In reply to comment #10)
> Ok, I should have reread the whole thread through again before commenting.  It
> seems that we mostly psyched ourselves out on Lachlan's big patch, but there
> are probably some valid concerns that are hard to pin down, as they relate to
> what 3rd party filesystems -could- possibly do ....
> 
> Josef, I think Lachlan's patch attached in comment #4 is a forward-port of what
> you did for RHEL3 - would you mind looking at it with an eye towards this
> problem on RHEL5?  It seems to me that it should certainly help us be more
> efficient in shrink_dcache_sb(), at least.

There's a patch for this in RHEL3?  You might be thinking of:
https://bugzilla.redhat.com/show_bug.cgi?id=413731
which got forward ported to RHEL5 in:
https://bugzilla.redhat.com/show_bug.cgi?id=526612
It's a very similar problem but in a different bit of code.

> 
> Do we have a decent testcase for this problem so that we can evaluate the
> effectiveness of the patch?

I haven't had any luck yet.  It may be that I simply don't have enough memory to cache so many dentries or maybe we need to find a way to get multiple threads to call shrink_dcache_sb simultaneously.

Comment 15 Ian Kent 2010-03-04 04:59:52 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > (In reply to comment #3)
> > > There's almost 60 million entries in the dentry cache - no wonder it's having
> > > problems.
> > 
> > Right, but:
> > 
> > inode_cache         6186   6335
> > dentry_cache      59704364 59704488
> > 
> > What's going on here?
> > Looks like there are truck loads of dentrys with no inode.    
> 
> Good point.  I suspect these dentries have been put on the dentry_unused list
> after the file was deleted (and therefore unlinked from the dentry).  Or maybe
> we have a leak.    

It looks suspect, shrink_dcache_sb() -> prune_one_dentry()
unconditionally calls d_free() after dentry_iput(), which is
when the dentry becomes negative during the release process
(I'm assuming that the rcu list is actually getting a chance
to free these guys after d_free() is called). We'd need to
look at all other cases but I believe negative dentrys are
only possible where the dentry has a positive reference count,
is not actually instantiated, and the reference count is not
decremented.

Do we have any chance of finding out more about these negative
dentrys? A kdump perhaps or some info gathering printks in a
test kernel?

Comment 16 Ian Kent 2010-03-04 05:09:46 UTC

(In reply to comment #14)
> > 
> > Do we have a decent testcase for this problem so that we can evaluate the
> > effectiveness of the patch?
> 
> I haven't had any luck yet.  It may be that I simply don't have enough memory
> to cache so many dentries or maybe we need to find a way to get multiple
> threads to call shrink_dcache_sb simultaneously.    

It would be useful to inspect the inode_cache and dentry_cache
usage when trying to reproduce this. If we don't see anomalous
values their we probably haven't actually reproduced the problem.

Comment 17 Lachlan McIlroy 2010-03-04 06:22:25 UTC

> Do we have any chance of finding out more about these negative
> dentrys? A kdump perhaps or some info gathering printks in a
> test kernel?    

We don't have a vmcore from this customer.  I have already provided one test kernel (with the proposed fix attached to this BZ) but I don't think we've got any feedback yet.

Could we get the information with a systemtap module?

Comment 18 Ian Kent 2010-03-04 07:31:58 UTC

(In reply to comment #17)
> > Do we have any chance of finding out more about these negative
> > dentrys? A kdump perhaps or some info gathering printks in a
> > test kernel?    
> 
> We don't have a vmcore from this customer.  I have already provided one test
> kernel (with the proposed fix attached to this BZ) but I don't think we've got
> any feedback yet.
> 
> Could we get the information with a systemtap module?    

Don't know, I haven't really worked with systemtap.

What I think would be useful to know is if these dentrys are
really negative, what their d_count is and what fs they belong
to, to see if they all belong to a particular fs.

Comment 19 Ian Kent 2010-03-04 07:58:38 UTC

(In reply to comment #17)
> 
> We don't have a vmcore from this customer.  I have already provided one test
> kernel (with the proposed fix attached to this BZ) but I don't think we've got
> any feedback yet.
> 

This sounds a lot like RHEL-4 bug 501565, except it's dentrys
instead of inodes.

If we think that reducing the list traversal size is all that needs
to be done to fix this then the proposed fix here doesn't seem quite
enough. To reduce the dentry list size without still needing a full
list traversal then we would need to add a per-superblock list for
the unused dentrys, similar to the inode list of bug 501565. Of course
that can never fly because it's a kABI breaker, as are the inode
list patches for 501565.

Comment 20 Eric Sandeen 2010-03-04 15:25:35 UTC

(In reply to comment #14)

> > Josef, I think Lachlan's patch attached in comment #4 is a forward-port of what
> > you did for RHEL3 - would you mind looking at it with an eye towards this
> > problem on RHEL5?  It seems to me that it should certainly help us be more
> > efficient in shrink_dcache_sb(), at least.
> 
> There's a patch for this in RHEL3?

I guess I was thinking of the fix for #413731 which is a similar flavor but not quite the same.

Comment 21 Josef Bacik 2010-03-04 15:51:27 UTC

In re: to c#10, it's like the fix for RHEL3 (which was already forward ported to RHEL5), but it covers the other side of things, when the VM decides to clear things out.  Since its a proven method of mitigating the unmount problem I don't see why it wouldn't also be an effective way of fixing this particular problem, tho I think I'd rather wait for feedback from the customer since we can't easily reproduce this problem.

Comment 22 Kevin Graham 2010-03-06 16:46:46 UTC

>  tho I think I'd rather wait for feedback from the customer since we
> can't easily reproduce this problem. 

Present. Sorry for leaving this hanging.

> I suspect these dentries have been put on the dentry_unused list after the
> file was deleted (and therefore unlinked from the dentry).  Or maybe
> we have a leak. 

It seems suspect. The machine I was reproducing on has ~72k inodes in use across 3 ext3's. NFS filesystem w/ every home directory adds ~1.1m. >50m of transient usage seems fishy at best.

What data are best to confirm/deny this and move forward? With or without the patched kernel from comment 3 (which I shamefully haven't tried yet)?

Comment 23 Kevin Graham 2010-03-06 17:02:59 UTC

For what its worth, the workload from comment 0 and comment 1 is a hacked up Nagios instance that will tickle the lockup on restart.

We also had a cluster of Apache instances doing mostly static file delivery off NFS with bug 526612 updates that are also still seeing this. The pattern is consistent, ~60m entries in dentry_cache without a comparable inode_cache (as a point of reference, these hosts do have about 28m inodes in mounted filesystems).

$ grep '^inod\|den' /proc/slabinfo
inode_cache         1677   1694    560    7    1 : tunables   54   27    8 : slabdata    242    242      0
dentry_cache      65085024 65096424    216   18    1 : tunables  120   60    8 : slabdata 3616468 3616468      0
$

$ grep '^inod\|den' /proc/slabinfo
inode_cache         1689   1708    560    7    1 : tunables   54   27    8 : slabdata    244    244      0
dentry_cache      62927063 63320094    216   18    1 : tunables  120   60    8 : slabdata 3517783 3517783      0
$

$ grep '^inod\|den' /proc/slabinfo
inode_cache         1694   1694    560    7    1 : tunables   54   27    8 : slabdata    242    242      0
dentry_cache      59596668 59596668    216   18    1 : tunables  120   60    8 : slabdata 3310926 3310926      0
$

Comment 24 Ian Kent 2010-03-07 05:03:01 UTC

(In reply to comment #23)
> For what its worth, the workload from comment 0 and comment 1 is a hacked up
> Nagios instance that will tickle the lockup on restart.
> 
> We also had a cluster of Apache instances doing mostly static file delivery off
> NFS with bug 526612 updates that are also still seeing this. The pattern is
> consistent, ~60m entries in dentry_cache without a comparable inode_cache (as a
> point of reference, these hosts do have about 28m inodes in mounted
> filesystems).
> 
> $ grep '^inod\|den' /proc/slabinfo
> inode_cache         1677   1694    560    7    1 : tunables   54   27    8 :
> slabdata    242    242      0
> dentry_cache      65085024 65096424    216   18    1 : tunables  120   60    8
> : slabdata 3616468 3616468      0
> $

There has to be something broken here and it sounds like your saying
the NFS client may be candidate but we still need to confirm that.

I think that at this stage we are still trying to narrow down potential
places to look for the problem. This bug is fairly recent so there must
have been a point in the past when this wasn't happening or other
machines with older kernels not showing the problem?

If this is something that happens after a dentry has been made positive
and is now on the way to being destroyed then there are a fairly small
number of places to look in the VFS. It could be a filesystem causing
it but that would be more likely to result in an oops at some point
before the cache grows this large.

It is more likely dentrys are not being instantiated upon creation and
the reference count is not being decremented for some reason. A missing
dput() in a patch could cause this.

There are a couple of things we need to do. Look at the kernel version
history to see if we can identify a time where this started to happen.
Find out what we have in the dentry cache, check if the dentrys are
really negative, what their d_count is, if there is a particular fs
they belong to and if they are hashed or not. Probably the best way
to do this is to reproduce on a test machine and get a kdump so we
can have a look.

Comment 25 Ian Kent 2010-03-07 05:08:26 UTC

(In reply to comment #24)
> > 
> > $ grep '^inod\|den' /proc/slabinfo
> > inode_cache         1677   1694    560    7    1 : tunables   54   27    8 :
> > slabdata    242    242      0
> > dentry_cache      65085024 65096424    216   18    1 : tunables  120   60    8
> > : slabdata 3616468 3616468      0
> > $

Oh ... hang on, maybe I'm talking nonsense.

What are the values in /proc/sys/fs/dentry-state and
/proc/sys/fs/inode-state?

Comment 26 Kevin Graham 2010-03-07 14:02:08 UTC

> What are the values in /proc/sys/fs/dentry-state [...]

$ cat /proc/sys/fs/{inode,dentry}-state
44763   12      0       0       0       0       0
64803227        64786025        45      0       0       0
$ grep '^inod\|den' /proc/slabinfo
inode_cache         1745   1785    560    7    1 : tunables   54   27    8 : slabdata    255    255      0
dentry_cache      64799130 64845090    216   18    1 : tunables  120   60    8 : slabdata 3602505 3602505      0
$

> This bug is fairly recent so there must have been a point in the past when this
> wasn't happening or other machines with older kernels not showing the problem?

These are actually complete workload groups -- there's the one loner and the 3 covered in comment 23 is all of them. These just started getting load ~8mo ago and have seen the lockups since. (Most machines w/ similar workloads in our environment are still RHEL4, which might explain the latent discovery.

> it sounds like your saying the NFS client may be candidate

Only because its the only file system getting meaningful activity (...and since you're the one asking, Ian, its all autofs'y as well).

Comment 27 Ian Kent 2010-03-07 17:56:29 UTC

(In reply to comment #26)
> > What are the values in /proc/sys/fs/dentry-state [...]
> 
> $ cat /proc/sys/fs/{inode,dentry}-state
> 44763   12      0       0       0       0       0
> 64803227        64786025        45      0       0       0

Phew, we're still in business.

> $ grep '^inod\|den' /proc/slabinfo
> inode_cache         1745   1785    560    7    1 : tunables   54   27    8 :
> slabdata    255    255      0
> dentry_cache      64799130 64845090    216   18    1 : tunables  120   60    8
> : slabdata 3602505 3602505      0
> $
> 
> > This bug is fairly recent so there must have been a point in the past when this
> > wasn't happening or other machines with older kernels not showing the problem?
> 
> These are actually complete workload groups -- there's the one loner and the 3
> covered in comment 23 is all of them. These just started getting load ~8mo ago
> and have seen the lockups since. (Most machines w/ similar workloads in our
> environment are still RHEL4, which might explain the latent discovery.

Bummer.

> 
> > it sounds like your saying the NFS client may be candidate
> 
> Only because its the only file system getting meaningful activity (...and since
> you're the one asking, Ian, its all autofs'y as well).    

Hahaha, yep it could be autofs and if we can verify that then
that's fine and we should be able to fix it.

In the kernel you're using, in fact since kernel rev 106, autofs
dentrys exist and are negative only during directory creation
(between the kernel lookup and following kernel mkdir), they
never go negative after that until they are released, assuming
the VFS isn't doing something it isn't supposed to. So that
narrows down the code to check quite a lot. I'm fairly confident
it isn't autofs but I'll certainly have a look.

Comment 28 Jiri Pirko 2010-03-24 13:16:14 UTC

I see FS guys are paying attention to this. Reassigning to default.

Comment 29 Ric Wheeler 2010-03-25 11:33:43 UTC

Jiri, Lachlan has been working on the fix for this & keeping the others in the FS team involved as needed. Moving it back to him, thanks!

Comment 31 RHEL Program Management 2010-12-07 10:10:33 UTC

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 39 Ric Wheeler 2011-03-15 02:20:23 UTC

We don't have enough information to make progress here. Please reopen if this continues to be an issue and we can try to gather fresh data.

Thanks!