From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 Description of problem: The nr_dentry entry in /proc/sys/fs/dentry-state decreases under heavy load from certain programs and eventually becomes negative. The decrease is generally continuous, but interupted by spikes from time to time: 2004-03-05 07:00:42 -626 2424 45 0 0 0 2004-03-05 07:00:52 278 3263 45 0 0 0 and a little later 2004-03-05 07:03:07 2078 5019 45 0 0 0 2004-03-05 07:03:17 -3219 42 45 0 0 0 We've seen values down to -13.000.000 after three days at load 10-15. 'sar -v' also reports unbelievable values in its dentunusd column: 4294963724 - clearly a sign error. Version-Release number of selected component (if applicable): kernel-smp-2.4.21-9.0.1.EL How reproducible: Always Steps to Reproduce: Unfortunately this is only reproducable by running a propetary program of our own making. The program is multi-threaded and communicates through sockets with a perl script. From time to time the program flushes large amounts of data to disk. The problem seems only to show up when we're running several instances of the program simultaneously and bringing the machine to a load of 10-15. But it is reproducable every time we try. Actual Results: The dentry-nr value in /proc/sys/fs/dentry-state goes negative; load falls to 2-5 seemingly without any reason; communication between the binary program and the perl scripts slows to a crawl; netstat show all sockets to have the same inode number; 'netstat -p' looses the program information; 'sar -v' reports bogus values for dentunusd and generally the machine is behaving oddly. Note that we're not sure that all of the above symptoms are related to the dentry readings, but by inspecting the whole /proc hierachy this was the only odd reading we could find. Expected Results: Contrary to all available documentation which says that dentry-nr should be zero most of the time, the code in fs/dcache.c shows that dentry_nr should have a positive value and certainly not a negative one. Additional info: We've found a patch from Andrew Morton towards 2.5.47 that probably addresses the problem we are seing. An exerpt from the accompanied message reads: The patch also arranges (awkwardly) for all modifications of dentry_stat.nr_dentry to occur inside dcache_lock - it was racy. The patch-set can be found here: http://www.zip.com.au/~akpm/linux/patches/2.5/old/2.5.47.tar.gz (the inode-reclaim-balancing.patch patch) The message in the patch-set does not include the excerpt above (we're pretty sure its the same patch, however); instead it can be found here: http://www.linuxhq.com/kernel/changelog/v2.5/48/ . Search for 'nr_dentry'. This seems to be the changelog for the kernel proper between v2.5.47 and v2.5.48.
This is fixed in RHEL4 beta 1. All of the above symptoms have vanished completely.
RHEL3 is now closed.
Created attachment 134710 [details] updated patch This patch is a backport of this one from upstream: http://lkml.org/lkml/2004/9/19/9 It looks rather innocuous, but we'll need to make a determination if it meets the criticality threshold for 3.9.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
From what I can tell, nr_dentry isn't actually used for anything in the VM, so the fact that it goes negative shouldnt have any impact on the box. It just means that the nr_dentry counter is off when you look at it in /proc. The patch is probably pretty harmless, but the fix seems to be cosmetic.
Closing based on last comment.
As an aside: The issue in Bug #117561 is purely cosmetic; nr_dentry is only set, never read (except in /proc output) in the kernel. Bug #117400 is more worrisome since the vm makes decisions based on that number.
*sigh* firefox tabs strike again, comment hit the wrong bug.
Created attachment 146939 [details] possible debug patch for nr_unused race Also as an aside... The problem in 117400 we think is likely that something is calling dget_locked without holding the dcache_lock. This patch is a possible way to check for this, though it won't tell you if someone else is holding the dcache_lock when this task calls it. Still it might be a good way to track down how it is occurring.