Bug 554574
| Summary: | soft lockups in dcache operations due to shrink_dcache_sb() hogging dcache_lock | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Lachlan McIlroy <lmcilroy> | ||||||
| Component: | kernel | Assignee: | Eric Sandeen <esandeen> | ||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 5.5 | CC: | anton, esandeen, ikent, jbacik, jlayton, kgraham, rwheeler, steved, vgaikwad | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 5.5 | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-03-15 02:20:23 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Lachlan McIlroy
2010-01-12 02:08:15 UTC
Created attachment 385557 [details]
slabinfo before-and-after soft lockup
Caught another one from system still running the 2.6.18-174.el5 test kernel. Attached is a capture of slabinfo from before (trimmed it to ~10m prior, but have data going further back if needed) and after the soft lockup was detected.
There's almost 60 million entries in the dentry cache - no wonder it's having problems. I have a test kernel with version kernel-2.6.18-183.el5.it345190.1 that includes a fix for spinlock contention on the dcache lock in the shrink_dcache_sb() function. Can you give it a try? The kernel is available from here: http://people.redhat.com/~lmcilroy/bz554574/ Lachlan Created attachment 385567 [details]
Patch to fix spinlock contention in shrink_dcache_sb().
Ok, there are two fixes for this issue: 1) easy to backport but possibly not complete for huge number of entries: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=37c42524d6090644206ae6d310d7e830bd3ccb47 2) complete but large backport: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=da3bbdd4632c0171406b2677e31494afa5bde2f8 Second patch should solve the issue completely. Ccing some FS folks. I tried pushing patch 2 into RHEL5.4 to fix BZ526612 but it got knocked back in review due to the kABI changes to the superblock structure. Patch 1 should help in most cases but there is still the risk that another shrink_dcache_sb() will race when the dcache_lock is released in prune_one_dentry() or cond_resched_lock(). In this case it wont be a few new dentries we have to skip but a whole filesystem's worth that's just been moved to the tail. The patch I attached to this BZ doesn't have that problem since it only scans the dentry_unused list once. It'd be a bit of a hack, but we could possibly put the list in/on the ext2/ext3/ext4 private superblocks instead, with appropriate tests etc, to help resolve the issue for those filesystems... Well, let me look at Lachlan's proposed patch again, Josef reminds me that we can probably play tricks w/ #ifdefs etc, right... Ok, I should have reread the whole thread through again before commenting. It seems that we mostly psyched ourselves out on Lachlan's big patch, but there are probably some valid concerns that are hard to pin down, as they relate to what 3rd party filesystems -could- possibly do .... Josef, I think Lachlan's patch attached in comment #4 is a forward-port of what you did for RHEL3 - would you mind looking at it with an eye towards this problem on RHEL5? It seems to me that it should certainly help us be more efficient in shrink_dcache_sb(), at least. Do we have a decent testcase for this problem so that we can evaluate the effectiveness of the patch? Thanks, -Eric (In reply to comment #3) > There's almost 60 million entries in the dentry cache - no wonder it's having > problems. Right, but: inode_cache 6186 6335 dentry_cache 59704364 59704488 What's going on here? Looks like there are truck loads of dentrys with no inode. (In reply to comment #11) > (In reply to comment #3) > > There's almost 60 million entries in the dentry cache - no wonder it's having > > problems. > > Right, but: > > inode_cache 6186 6335 > dentry_cache 59704364 59704488 > > What's going on here? > Looks like there are truck loads of dentrys with no inode. Good point. I suspect these dentries have been put on the dentry_unused list after the file was deleted (and therefore unlinked from the dentry). Or maybe we have a leak. (In reply to comment #8) > It'd be a bit of a hack, but we could possibly put the list in/on the > ext2/ext3/ext4 private superblocks instead, with appropriate tests etc, to help > resolve the issue for those filesystems... It would have to work for NFS too since this issue is primarily seen with NFS when using autofs. (In reply to comment #10) > Ok, I should have reread the whole thread through again before commenting. It > seems that we mostly psyched ourselves out on Lachlan's big patch, but there > are probably some valid concerns that are hard to pin down, as they relate to > what 3rd party filesystems -could- possibly do .... > > Josef, I think Lachlan's patch attached in comment #4 is a forward-port of what > you did for RHEL3 - would you mind looking at it with an eye towards this > problem on RHEL5? It seems to me that it should certainly help us be more > efficient in shrink_dcache_sb(), at least. There's a patch for this in RHEL3? You might be thinking of: https://bugzilla.redhat.com/show_bug.cgi?id=413731 which got forward ported to RHEL5 in: https://bugzilla.redhat.com/show_bug.cgi?id=526612 It's a very similar problem but in a different bit of code. > > Do we have a decent testcase for this problem so that we can evaluate the > effectiveness of the patch? I haven't had any luck yet. It may be that I simply don't have enough memory to cache so many dentries or maybe we need to find a way to get multiple threads to call shrink_dcache_sb simultaneously. (In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #3) > > > There's almost 60 million entries in the dentry cache - no wonder it's having > > > problems. > > > > Right, but: > > > > inode_cache 6186 6335 > > dentry_cache 59704364 59704488 > > > > What's going on here? > > Looks like there are truck loads of dentrys with no inode. > > Good point. I suspect these dentries have been put on the dentry_unused list > after the file was deleted (and therefore unlinked from the dentry). Or maybe > we have a leak. It looks suspect, shrink_dcache_sb() -> prune_one_dentry() unconditionally calls d_free() after dentry_iput(), which is when the dentry becomes negative during the release process (I'm assuming that the rcu list is actually getting a chance to free these guys after d_free() is called). We'd need to look at all other cases but I believe negative dentrys are only possible where the dentry has a positive reference count, is not actually instantiated, and the reference count is not decremented. Do we have any chance of finding out more about these negative dentrys? A kdump perhaps or some info gathering printks in a test kernel? (In reply to comment #14) > > > > Do we have a decent testcase for this problem so that we can evaluate the > > effectiveness of the patch? > > I haven't had any luck yet. It may be that I simply don't have enough memory > to cache so many dentries or maybe we need to find a way to get multiple > threads to call shrink_dcache_sb simultaneously. It would be useful to inspect the inode_cache and dentry_cache usage when trying to reproduce this. If we don't see anomalous values their we probably haven't actually reproduced the problem. > Do we have any chance of finding out more about these negative
> dentrys? A kdump perhaps or some info gathering printks in a
> test kernel?
We don't have a vmcore from this customer. I have already provided one test kernel (with the proposed fix attached to this BZ) but I don't think we've got any feedback yet.
Could we get the information with a systemtap module?
(In reply to comment #17) > > Do we have any chance of finding out more about these negative > > dentrys? A kdump perhaps or some info gathering printks in a > > test kernel? > > We don't have a vmcore from this customer. I have already provided one test > kernel (with the proposed fix attached to this BZ) but I don't think we've got > any feedback yet. > > Could we get the information with a systemtap module? Don't know, I haven't really worked with systemtap. What I think would be useful to know is if these dentrys are really negative, what their d_count is and what fs they belong to, to see if they all belong to a particular fs. (In reply to comment #17) > > We don't have a vmcore from this customer. I have already provided one test > kernel (with the proposed fix attached to this BZ) but I don't think we've got > any feedback yet. > This sounds a lot like RHEL-4 bug 501565, except it's dentrys instead of inodes. If we think that reducing the list traversal size is all that needs to be done to fix this then the proposed fix here doesn't seem quite enough. To reduce the dentry list size without still needing a full list traversal then we would need to add a per-superblock list for the unused dentrys, similar to the inode list of bug 501565. Of course that can never fly because it's a kABI breaker, as are the inode list patches for 501565. (In reply to comment #14) > > Josef, I think Lachlan's patch attached in comment #4 is a forward-port of what > > you did for RHEL3 - would you mind looking at it with an eye towards this > > problem on RHEL5? It seems to me that it should certainly help us be more > > efficient in shrink_dcache_sb(), at least. > > There's a patch for this in RHEL3? I guess I was thinking of the fix for #413731 which is a similar flavor but not quite the same. In re: to c#10, it's like the fix for RHEL3 (which was already forward ported to RHEL5), but it covers the other side of things, when the VM decides to clear things out. Since its a proven method of mitigating the unmount problem I don't see why it wouldn't also be an effective way of fixing this particular problem, tho I think I'd rather wait for feedback from the customer since we can't easily reproduce this problem. > tho I think I'd rather wait for feedback from the customer since we > can't easily reproduce this problem. Present. Sorry for leaving this hanging. > I suspect these dentries have been put on the dentry_unused list after the > file was deleted (and therefore unlinked from the dentry). Or maybe > we have a leak. It seems suspect. The machine I was reproducing on has ~72k inodes in use across 3 ext3's. NFS filesystem w/ every home directory adds ~1.1m. >50m of transient usage seems fishy at best. What data are best to confirm/deny this and move forward? With or without the patched kernel from comment 3 (which I shamefully haven't tried yet)? For what its worth, the workload from comment 0 and comment 1 is a hacked up Nagios instance that will tickle the lockup on restart. We also had a cluster of Apache instances doing mostly static file delivery off NFS with bug 526612 updates that are also still seeing this. The pattern is consistent, ~60m entries in dentry_cache without a comparable inode_cache (as a point of reference, these hosts do have about 28m inodes in mounted filesystems). $ grep '^inod\|den' /proc/slabinfo inode_cache 1677 1694 560 7 1 : tunables 54 27 8 : slabdata 242 242 0 dentry_cache 65085024 65096424 216 18 1 : tunables 120 60 8 : slabdata 3616468 3616468 0 $ $ grep '^inod\|den' /proc/slabinfo inode_cache 1689 1708 560 7 1 : tunables 54 27 8 : slabdata 244 244 0 dentry_cache 62927063 63320094 216 18 1 : tunables 120 60 8 : slabdata 3517783 3517783 0 $ $ grep '^inod\|den' /proc/slabinfo inode_cache 1694 1694 560 7 1 : tunables 54 27 8 : slabdata 242 242 0 dentry_cache 59596668 59596668 216 18 1 : tunables 120 60 8 : slabdata 3310926 3310926 0 $ (In reply to comment #23) > For what its worth, the workload from comment 0 and comment 1 is a hacked up > Nagios instance that will tickle the lockup on restart. > > We also had a cluster of Apache instances doing mostly static file delivery off > NFS with bug 526612 updates that are also still seeing this. The pattern is > consistent, ~60m entries in dentry_cache without a comparable inode_cache (as a > point of reference, these hosts do have about 28m inodes in mounted > filesystems). > > $ grep '^inod\|den' /proc/slabinfo > inode_cache 1677 1694 560 7 1 : tunables 54 27 8 : > slabdata 242 242 0 > dentry_cache 65085024 65096424 216 18 1 : tunables 120 60 8 > : slabdata 3616468 3616468 0 > $ There has to be something broken here and it sounds like your saying the NFS client may be candidate but we still need to confirm that. I think that at this stage we are still trying to narrow down potential places to look for the problem. This bug is fairly recent so there must have been a point in the past when this wasn't happening or other machines with older kernels not showing the problem? If this is something that happens after a dentry has been made positive and is now on the way to being destroyed then there are a fairly small number of places to look in the VFS. It could be a filesystem causing it but that would be more likely to result in an oops at some point before the cache grows this large. It is more likely dentrys are not being instantiated upon creation and the reference count is not being decremented for some reason. A missing dput() in a patch could cause this. There are a couple of things we need to do. Look at the kernel version history to see if we can identify a time where this started to happen. Find out what we have in the dentry cache, check if the dentrys are really negative, what their d_count is, if there is a particular fs they belong to and if they are hashed or not. Probably the best way to do this is to reproduce on a test machine and get a kdump so we can have a look. (In reply to comment #24) > > > > $ grep '^inod\|den' /proc/slabinfo > > inode_cache 1677 1694 560 7 1 : tunables 54 27 8 : > > slabdata 242 242 0 > > dentry_cache 65085024 65096424 216 18 1 : tunables 120 60 8 > > : slabdata 3616468 3616468 0 > > $ Oh ... hang on, maybe I'm talking nonsense. What are the values in /proc/sys/fs/dentry-state and /proc/sys/fs/inode-state? > What are the values in /proc/sys/fs/dentry-state [...] $ cat /proc/sys/fs/{inode,dentry}-state 44763 12 0 0 0 0 0 64803227 64786025 45 0 0 0 $ grep '^inod\|den' /proc/slabinfo inode_cache 1745 1785 560 7 1 : tunables 54 27 8 : slabdata 255 255 0 dentry_cache 64799130 64845090 216 18 1 : tunables 120 60 8 : slabdata 3602505 3602505 0 $ > This bug is fairly recent so there must have been a point in the past when this > wasn't happening or other machines with older kernels not showing the problem? These are actually complete workload groups -- there's the one loner and the 3 covered in comment 23 is all of them. These just started getting load ~8mo ago and have seen the lockups since. (Most machines w/ similar workloads in our environment are still RHEL4, which might explain the latent discovery. > it sounds like your saying the NFS client may be candidate Only because its the only file system getting meaningful activity (...and since you're the one asking, Ian, its all autofs'y as well). (In reply to comment #26) > > What are the values in /proc/sys/fs/dentry-state [...] > > $ cat /proc/sys/fs/{inode,dentry}-state > 44763 12 0 0 0 0 0 > 64803227 64786025 45 0 0 0 Phew, we're still in business. > $ grep '^inod\|den' /proc/slabinfo > inode_cache 1745 1785 560 7 1 : tunables 54 27 8 : > slabdata 255 255 0 > dentry_cache 64799130 64845090 216 18 1 : tunables 120 60 8 > : slabdata 3602505 3602505 0 > $ > > > This bug is fairly recent so there must have been a point in the past when this > > wasn't happening or other machines with older kernels not showing the problem? > > These are actually complete workload groups -- there's the one loner and the 3 > covered in comment 23 is all of them. These just started getting load ~8mo ago > and have seen the lockups since. (Most machines w/ similar workloads in our > environment are still RHEL4, which might explain the latent discovery. Bummer. > > > it sounds like your saying the NFS client may be candidate > > Only because its the only file system getting meaningful activity (...and since > you're the one asking, Ian, its all autofs'y as well). Hahaha, yep it could be autofs and if we can verify that then that's fine and we should be able to fix it. In the kernel you're using, in fact since kernel rev 106, autofs dentrys exist and are negative only during directory creation (between the kernel lookup and following kernel mkdir), they never go negative after that until they are released, assuming the VFS isn't doing something it isn't supposed to. So that narrows down the code to check quite a lot. I'm fairly confident it isn't autofs but I'll certainly have a look. I see FS guys are paying attention to this. Reassigning to default. Jiri, Lachlan has been working on the fix for this & keeping the others in the FS team involved as needed. Moving it back to him, thanks! This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug. We don't have enough information to make progress here. Please reopen if this continues to be an issue and we can try to gather fresh data. Thanks! |