Bug 327591

Summary:

NFS: file replaced on server with new inode, client sees old inode

Product:

Red Hat Enterprise Linux 4

Reporter:

jas

Component:

kernel

Assignee:

Jeff Layton <jlayton>

Status:

CLOSED DUPLICATE

QA Contact:

Martin Jenner <mjenner>

Severity:

high

Docs Contact:

Priority:

low

Version:

4.5

CC:

andriusb, chet.burgess, drobinso, jeffschroeder, jwilson, liko, staubach, steved

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-01-21 17:32:57 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
attempted reproducer	none

Description jas 2007-10-11 12:57:33 UTC

Description of problem:

At some point since upgrading to Red Hat Enterprise 4.5 (from 4.4), we started
experiencing a strange cache issue on two of our NFS clients -- namely, on our
web server and our undergraduate time sharing server.  

What ends up happening is that a user makes a change to some files on the NFS
client, and then at times, one of the other two nfs clients end up seeing the
OLD copy of the file/directory.  The file server copy is completely valid.   The
copy of the file/directory on our other NFS clients is completely valid.  If you
rename the file or directory on the file server, the bad NFS client now sees the
"new" copy, but if you rename it back, the client sees the old copy.  The new
copy on these clients is in essence hidden because the cache on the NFS client
refuses to update.   In at least one instance, when this problem occurred on a
directory, recopying the entire directory ABOVE the problem directory fixes the
problem.  Rebooting of the client obviously fixes the problem as well.

All of our other machines are also running an identical copy of RHEL 4.5 and we
don't seem to be experiencing this problem elsewhere.  Whether the problem is
load related, we do not know. 

RHEL 4.5 came with kernel 2.6.9-55.EL.
Since then, we have upgraded to 2.6.9-55.0.2.EL.

Unfortunately, we haven't been paid enough attention to the details of the
system at each instance of the problem, or it might have been easier to isolate
what kernel update has caused the issue. However, we do believe the problem
started occurring some time after upgrading from CentOS 4.4 to 4.5 or after the
kernel update.  Until we made this change, we believe that the problem never
occurred.  

When the problem occurs, our users end up getting flustered, and contacting our
technical team.  In total, the problem has occurred 5 times, 2 of which occurred
yesterday.

The NFS server is still running Linux kernel 2.4.32, but we don't believe this
is an issue since the problem was not occurring before 4.5.

The NFS clients are running:

nfs-utils-1.0.6-80.EL4
nfs-utils-lib-1.0.6-8

(It's not clear whether this is a kernel issue, or one of nfsutils).

How reproducible:

We cannot determine how to reproduce the problem.  It occurred twice yesterday
though.

We are very anxious to solve the problem.  Please ask for any other necessary
details.

Comment 1 jas 2007-10-11 13:20:09 UTC

A few items that I left out:

The last two times the problem occurred:

1) A faculty member uploaded a NEW assignment file.  (As a result, nobody would
have even opened this file yet.)

2) A faculty member changed the contents of a directory, removing 2 tar files,
and creating a couple of new directories.  On the bad NFS client, you still see
the tar files.

It looks like NFS cache consistency is not always guaranteed as per this article:

http://nfs.sourceforge.net/#faq_a8

... but in these two particular cases, only one client was making changes.
(In both cases, it is possible that students were accessing other files in the
directory at the same time.)

Our NFS mounts are as follows:
server:/obj/dept /cs/dept nfs hard,rw,intr,bg,nfsvers=3,tcp 0 0

On the NFS server, this is exported like this:

/obj/dept client.cs.yorku.ca(rw,sync) ....

Comment 2 jas 2007-10-11 20:31:50 UTC

There are several cache issues referred to on this page...

http://www.linux-nfs.org/Linux-2.6.x/2.6.21/

... that have to do with caching.

One in particular ...

http://www.linux-nfs.org/Linux-2.6.x/2.6.21/linux-2.6.21-021-fix_readdir_stale_cache.dif

... sounds like a definate possibility.

The test case does produce the anticipated results.

Might this be the problem?

Comment 3 Jeff Layton 2007-10-29 16:11:20 UTC

> Might this be the problem?

Possibly. There were a number of NFS cache-consistency patches that went into
4.5, and it's possible that one of them has introduced a regression.

Without some idea of what's happening on these boxes when this occurs, it's
going to be very hard to track this down. My suggestion is to work on finding a
way to reproduce this problem.

> If you rename the file or directory on the file server, the bad NFS client now 
> sees the "new" copy, but if you rename it back, the client sees the old copy.

That's just bizarre. NFS operates on filehandles, so I have to wonder if the
filehandle is changing when the rename occurs (or something has gone very
strange with the lookup cache).

Comment 4 jas 2007-10-29 17:06:24 UTC

We haven't been aware of the problem again since October 11.  This isn't to say
that it hasn't happened, but usually, if one of our uses experiences the
problem, we're the first to know about it.  Interestingly enough, maybe invoked
my fear of your message, the problem happened again, and this time for me!  I
was working on a temporary server that I have allocated to phpBB3 testing.  An
Apache2 web server was running on the box.  I decided to completely re-install
phpBB3.  

On MY host, I changed into the directory containing phpBB (a directory that is
nfs shared out to all clients), just did:

mv forum forum.orig

On all hosts, forum.orig appeared fine.

I then re-uncompressed/untarred phpBB3 on MY host, and renamed "phpBB3" to
"forum".  

Now, on all hosts EXCEPT for the test host:

forum was inode 20432204 and forum.orig was 16860794.

On the test host, both the inode for forum and forum.orig were 16860794!!

Just for the heck of it, I DELETED the directory "forum" on "phpbb".  Of course
deleting this copy shouldn't really be a problem.  I got an error that that the
directory cannot be deleted because it is not empty.  At this point, now doing
an ls -i revealed that "forum" was now pointing to inode 20432204!  

I have tried to repeat this again and again, but it's not happening.  This must
have something to do with something being cached/open in apache, and then moving
it underneath.  There is no other explanation since I did all the work on MY
host, yet the problem only occurred on the host that was running the web server.

Comment 5 Jeff Layton 2007-10-29 17:26:01 UTC

Which host was seeing the problem? The test host?

If inode numbers are changing when you rename the file then I tend to think that
something is very odd on the server. You may want to do some network captures of
file renames and have a look at them with wireshark. If the inode numbers are
changing when you rename then that may help show it.

You mentioned that the server is running kernel 2.4.32. What architecture is it
(x86, x86_64, etc), and what is the underlying filesystem that's being exported
here (ext3, xfs, reiser?).

Comment 6 jas 2007-10-29 18:36:05 UTC

The host seeing the problem is the test host, not the host where I was actually
doing the rename...

Just to be clear...

On my host, I renamed forum to forum.orig.  
forum.orig maintained the original inode number on all hosts INCLUDING the test
host.
The problem, which sounds like an nfs client cache isssue, is that when I
uncompresssed/untarred the file again (creating phpBB3 directory) and renamed
phpBB3 "forum" on my BOX, all the other boxes saw the "real" "new" forum with
NEW inode where-as the "test box" saw the "old" copy of forum (inode of now
forum.orig).  
The test box could see any other directory that I created, but it was "stuck" on
the forum directory itself.
In all cases, (including previous cases), the copy of the directory on the
server is fine.  

The x86 2.4.32 server "could" be the problem, but I highly doubt it since none
of our Solaris boxes have ever experienced this problem, and none of our
machines experienced this before the upgrade to 4.5.
The filesystem is ext3.

Comment 7 Jeff Schroeder 2007-10-30 16:20:35 UTC

We have seen the exact same issue using AS4.5 and nfs. Our NFS server is a
NetApp Filer and the issue only seems to be coming from 4.5 clients. Here is
some associated version info from an affected client:

rpm -qa nfs*; uname -r
nfs-utils-1.0.6-80.EL4
nfs-utils-lib-1.0.6-8
2.6.9-55.ELsmp


We change a symlink to point from 1 file to another and on the affected clients,
sometimes it shows the symlink pointing to files it hasn't pointed to for
several weeks. This is a serious regression

Comment 8 Jeff Layton 2007-10-30 16:54:49 UTC

Given that this seems to occur only sporadically it sounds like a race of some
sort. We're going to have to learn something about the conditions under which it
occurs. Since we don't have a way to reproduce this (unless one of you do), what
might be best for the time being is to collect some nfs_debug info when this
problem occurs. When you find a host that is behaving this way:

# echo 32767 > /proc/sys/rpc/nfs_debug

...then try to access the inode that's having problems:

# stat /problem/file/or/directory

...after that, turn of debugging:

# echo 0 > /proc/sys/rpc/nfs_debug

...and capture dmesg output and attach it here. Also, get the output from the
stat command, and also stat the directory above it. The idea is that we want to
capture what the client is doing at that time and perhaps why it's not
revalidating the inode correctly.

It would also be helpful to try touching the parent directory of the file you're
having trouble with and see if that causes it revalidate the entry. If there's a
simple way to force a revalidation of the file then that might also tell us
something about why it's not being revalidated to begin with.

Finally, if you have a support contract, please open a case with our support
organization so that this can be prioritized as a customer impacting issue. They
may also be able to help you track down a reproducer (or gather more info about
when and why this occurs).

Comment 9 Jeff Schroeder 2007-10-30 19:01:53 UTC

echo 32767 > /proc/sys/sunrpc/nfs_debug; stat /ptx/SYS/conf/software.conf >
stat.txt; echo 0 > /proc/sys/sunrpc/nfs_debug; dmesg -c > dmesg.txt

...

cat stat.txt; echo ------------------------------; cat dmesg.txt
  File: `/ptx/SYS/conf/software.conf' ->
`/software/scripts/conf/Persephone/software.conf-rhea-2.0'
  Size: 56              Blocks: 0          IO Block: 32768  symbolic link
Device: 1fh/31d Inode: 3115833     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (  500/     ptx)   Gid: (  500/     ptx)
Access: 2007-10-30 07:46:26.302902000 -0700
Modify: 2007-08-01 23:20:45.743523000 -0700
Change: 2007-10-16 22:48:11.198072000 -0700
------------------------------
NFS call  access
NFS: nfs_update_inode(0:1f/104 ct=1 info=0x6)
NFS reply access, status = 0
NFS: revalidating (0:1f/2670994)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:1f/2670994 ct=1 info=0x6)
NFS: (0:1f/2670994) revalidation complete
NFS call  access
NFS reply access, status = 0
NFS: revalidating (0:1f/3115833)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:1f/3115833 ct=2 info=0x6)
NFS: (0:1f/3115833) revalidation complete
NFS: dentry_delete(conf/software.conf, 8)
NFS: dentry_delete(conf/software.conf, 8)


The symlink should be pointing to:
/software/scripts/conf/Persephone/software.conf-shoulao-1.0. mount -o,remount
/nfs/share is a quick hack to refresh the nfs cache and clear out the problem.

Comment 10 Jeff Layton 2007-10-30 19:15:56 UTC

Very good info. I forgot to ask for one other bit though...

Could you run a stat against this symlink from a "working" host (or from the
server itself) and post the output here? It would be good to know what (if any)
attributes have changed.

From what you've posted already, the client tries to revalidate the directory:

NFS: revalidating (0:1f/3115833)

It does a getattr call against the inode:

NFS call  getattr
NFS reply getattr

It then calls nfs_update_inode to interpret the results:

NFS: nfs_update_inode(0:1f/3115833 ct=2 info=0x6)
NFS: (0:1f/3115833) revalidation complete

...and looks like it probably decides that the cached copy is ok (hard to tell,
the printk's aren't that comprehensive here). Though if this inode had changed
(and the filehandle didn't change), then the mtime would almost certainly have
changed, and that should have given us another printk.

Also, what's the uptime on this client?

Comment 11 Jeff Schroeder 2007-10-30 19:23:21 UTC

So on the broken client, the uptime is 34 days. This is when we deployed 4.5 and
rebooted to use the new kernel. Since the server is a NetApp Filer (BSD Deriv.)
stat doesn't work.

From a client with no issues, here is the stat:
stat /ptx/SYS/conf/software.conf
  File: `/ptx/SYS/conf/software.conf' ->
`/software/scripts/conf/Persephone/software.conf-shoulao-1.0'
  Size: 59              Blocks: 0          IO Block: 32768  symbolic link
Device: 1fh/31d Inode: 2973033     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (  500/     ptx)   Gid: (  500/     ptx)
Access: 2007-10-30 12:12:15.598210000 -0700
Modify: 2007-10-15 19:18:13.930958000 -0700
Change: 2007-10-16 22:48:11.241073000 -0700

Comment 12 Jeff Layton 2007-10-30 19:44:35 UTC

Ok thanks. The main thing here is that the inode number has changed:

2973033 on the good client vs. 3115833 on the bad one

...and evidently, the old inode is still valid on the server (or you'd be
getting an ESTALE back from it when the getattr call is done). So somehow the
symlink was deleted and recreated but the client is not noticing that it needs
to redo the lookup. I'll look through the code and see if I can determine how
this might occur (and most importantly) see if I can come up with a way to
reproduce this.

One other thing that might be helpful -- if you've not already remounted the dir
or rebooted the bad client, can you do a "stat" on the parent directory of the
symlink from both the good and bad hosts? I'd like to see if their mtimes are
identical, in particular.

Comment 13 Jeff Schroeder 2007-10-30 20:33:20 UTC

Well we have this problem intermittently throughout a cluster so there are still
a few affected hosts. We try to minimize downtime so just remounting a hot
system works.

Here is the basic workflow... system A and system B read the symlink that is
changed on system C. Everything is mounting a /vol/sharename on filer 1. From
system C, it is moved aside to ${name}.old and recreated pointing to a new
configuration file. Sometimes, system A or system B doesn't recognize anything
has changed. This might happen on 4 out of 50 systems.

From a bad client
  File: `/ptx/SYS/conf'
  Size: 4096            Blocks: 8          IO Block: 32768  directory
Device: 1fh/31d Inode: 2670994     Links: 2
Access: (0755/drwxr-xr-x)  Uid: (  500/     ptx)   Gid: (  500/     ptx)
Access: 2007-10-30 07:46:26.295902000 -0700
Modify: 2007-10-16 22:48:11.241073000 -0700
Change: 2007-10-16 22:48:11.241073000 -0700

And a good client:
  File: `/ptx/SYS/conf'
  Size: 4096            Blocks: 8          IO Block: 32768  directory
Device: 1fh/31d Inode: 2670994     Links: 2
Access: (0755/drwxr-xr-x)  Uid: (  500/     ptx)   Gid: (  500/     ptx)
Access: 2007-10-30 07:46:26.295902000 -0700
Modify: 2007-10-16 22:48:11.241073000 -0700
Change: 2007-10-16 22:48:11.241073000 -0700

That looks pretty worthless to me.

Comment 14 Jeff Layton 2007-10-30 22:19:16 UTC

> That looks pretty worthless to me.

Not to me. It tells me that both clients have the directory attributes cached
for both. So the question becomes:

Why did one client decide to redo the lookup (and get a new filehandle) for the
software.conf symlink and the other did not?

I'll have a look over the code and see if I determine a situation where that
might occur.

When you open the support case for this, then please make note that you've also
reported this via the BZ here.

Comment 15 Jeff Schroeder 2007-10-30 22:32:11 UTC

From a previous comment that points to this link...
http://www.linux-nfs.org/Linux-2.6.x/2.6.21/linux-2.6.21-021-fix_readdir_stale_cache.dif
The tar command in this test script does indeed fail on affected clients with
this error:
tar: newdir/innerdir/innerfile: file changed as we read it

If you remount the affected shares or try this script on an AS4.4 server, it
works as expected. So the script at that link is how to reproduce the bug reliably.

Does the patch in that link make any sense as to fixing this issue? Would it be
worth my time to unroll an srpm for the 4.5 kernel, have the spec file apply
that patch and reroll it? Surely I could just run that test case script and see
if it fails or not.


Signed-off-by: Neil Brown <neilb>
Signed-off-by: Trond Myklebust <Trond.Myklebust>
---

 fs/nfs/dir.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index d971547..e59fd31 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -865,6 +865,10 @@ static int nfs_dentry_delete(struct dentry *dentry)
 static void nfs_dentry_iput(struct dentry *dentry, struct inode *inode)
 {
 	nfs_inode_return_delegation(inode);
+	if (S_ISDIR(inode->i_mode))
+		/* drop any readdir cache as it could easily be old */
+		NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA;
+
 	if (dentry->d_flags & DCACHE_NFSFS_RENAMED) {
 		lock_kernel();
 		drop_nlink(inode);

Comment 16 jas 2007-10-31 01:43:54 UTC

Interesting that it doesn't work on 4.4...

I had e-mailed Neil Brown about my particular problem, and he didn't think that 
his patch as above would solve it, but very interesting that the results under 
4.4 and 4.5 are different ....

Comment 17 Jeff Schroeder 2007-10-31 03:08:46 UTC

This was not an issue until we updated from 4.4 to 4.5.

Comment 18 Jeff Layton 2007-10-31 10:51:38 UTC

4.5 included a rather large patch to fix a different issue with cache
consistency (particularly when dealing with files that were mmap'ed). It's
certainly possible that that patch introduced the bug that's mentioned in
comment #2. For now I'd recommend opening a separate BZ for that since we don't
know that that's related to this issue.

In looking over Neil's description of that patch, I also don't think it's the
same issue. In his case, it was a matter of keeping inode info from a
READDIRPLUS cached for too long.

In this one, I think the problem is related to this dentry not being flushed out
properly when the parent directory changes. Moving the symlink should cause the
mtime on the directory to change, and that should force a new lookup for the
dentry. That seems to not be happening here for some reason.

That said, I could be wrong. If the filehandle for the symlink is staying the
same while the fileid is changing, then it's possible that Neil's patch may be
what's needed. In that situation though, this printk should be firing:

                printk(KERN_ERR "%s: inode number mismatch\n"
                       "expected (%s/0x%Lx), got (%s/0x%Lx)\n",
                       __FUNCTION__,
                       inode->i_sb->s_id, (long long)nfsi->fileid,
                       inode->i_sb->s_id, (long long)fattr->fileid);

...so if you're not seeing any of those in the logs then I'm more skeptical of
this being the problem.

The patch in comment #2 looks sane enough, though. Testing it might not be a bad
idea in the absence of a better understanding of this problem. At the very least
it should rule it out as the issue you're seeing.

Comment 19 Jeff Layton 2007-10-31 10:52:50 UTC

It might also be worthwhile testing some of the kernels on my people page:

http://people.redhat.com/jlayton

...particularly if you have a way to reproduce this.

Comment 20 Jeff Layton 2007-10-31 19:09:06 UTC

Created attachment 244831 [details]
attempted reproducer

Spent a bit of time trying to reproduce this today (with no luck so far). I was
hoping that running this script which does something like what Jeff described
would reproduce it.

I was able to reproduce something that looked like it when mounting a server
with an ext3 filesystem underlying it, but not when the filesystem was xfs. The
difference there is that xfs has sub-second timestamps but ext3 does not.

When the directory is on ext3 and is revalidated and then changed within the
same second, the client doesn't have a way to know that its cache is invalid
and will keep using the cached dir contents.

xfs (and Netapp's OnTap) have timestamp granularity of 1ms or less. That's
small enough to generally prevent races like this. Timestamp granularity might
explain the problem seen by the original reporter (since he's using a linux
server with ext3), but I don't think it can explain the situation where the
server is a Netapp.

One thing that might be helpful is to know whether you can "fix" this problem
by touching the parent directory of the dentries showing the problem. If so,
then a race of this sort might help explain things.

So Jeff's case, when this problem occurs if you do this on a "working client"

    $ touch /ptx/SYS/conf

...does the problem on the "bad" client clear itself up?

Comment 21 Jeff Schroeder 2007-10-31 20:10:48 UTC

touch the parent directory from a "working client" and stat the same directory
on a "bad client" to make sure it picked up that the directory was touched. It did.

Now on the "bad client" the symlink has not changed and is still pointing to the
wrong place.

Do you want me to open a new bug about this?

Comment 22 Jeff Layton 2007-10-31 22:26:00 UTC

On the bug with the known reproducer:

http://www.linux-nfs.org/Linux-2.6.x/2.6.21/linux-2.6.21-021-fix_readdir_stale_cache.dif

...yes, please do.

Let's keep this bug to deal with these dentries that aren't properly getting new
lookups.

Comment 23 Jeff Layton 2007-11-01 01:26:19 UTC

I have to wonder if this patch may be relevant to this issue:

http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.1966.3.7

...though given that we don't understand this problem well, I can't offer any
evidence that it is. If you have an environment where you can reproduce this
readily, it might be worthwhile to test:

1) one of the test kernels on my people page:

http://people.redhat.com/jlayton

2) the same kernel with a backported version of the patch in comment #2

3) patch with a backport of the patch in the link above

...though the most effective way to get this resolved is going to be to find a
way to reproduce the problem reliably. With that I should be able to track down
the cause and fix it.

Comment 24 jas 2007-11-01 13:41:38 UTC

I wonder if this problem I am having right now is the same as the original
problem.  It actually seems a little worse! 

I have been working on setting up phpbb and storing my instructions in a text
file called "phpbb" in a directory called "php" in my home directory.  On the
file server, and on my local host, the file looks like this in the directory
listing:

23331609 -rw-------    1 jas  tech  4335 Oct 29 16:03 phpbb

... but on one host, that one file in the directory listing appears like this:

       ? ?---------    ? ?    ?        ?            ? phpbb

All the other files in the directory are displaying fine.  Actually, here are
the other files with inodes if it helps:

23330845 drwxr-xr-x    3 jas  tech  4096 Nov  1 09:29 .
 1278095 drwx--x--x  128 jas  tech 12288 Nov  1 09:24 ..
23331617 -rw-r--r--    1 jas  tech   801 Oct 31 13:22 README
23331610 -rw-r--r--    1 root tech  5361 Oct 30 09:52 auth_cse.php
23331740 -rw-r--r--    1 jas  tech   427 Oct 30 14:49 config.php
23331572 -rw-r--r--    1 jas  tech  4491 Oct 29 16:06 cse.php
       ? ?---------    ? ?    ?        ?            ? phpbb
 1278932 -rw-r--r--    1 root tech 36100 Oct 29 16:05 phpbb.gz
23331141 -rw-r--r--    1 jas  tech 10023 Oct 31 13:31 phpbb.php
23331140 -rwxr-xr-x    1 jas  tech  3102 Oct 31 13:31 phpbbsync
23331615 -rwxr-xr-x    1 jas  tech   880 Oct 31 16:13 test
23331614 -rw-r--r--    1 jas  tech   862 Oct 31 16:07 test.php
23331587 -rw-r--r--    1 jas  tech   801 Oct 31 16:04 test.php~
23331613 -rwxr-xr-x    1 jas  tech   870 Oct 31 16:05 test~
 6292183 drwxr-xr-x    2 root tech  4096 Oct 29 16:08 tmp

On the host that is having trouble:

# echo 32767 > /proc/sys/sunrpc/nfs_debug
# stat /cs/home/jas/php/phpbb
stat: cannot stat `/cs/home/home/php/phpbb': No such file or directory
# echo 0 > /proc/sys/sunrpc/nfs_debug

dmesg output:

NFS call  access
NFS: nfs_update_inode(0:13/2801958 ct=1 info=0x6)
NFS reply access, status = 0
NFS call  access
NFS: nfs_update_inode(0:1b/23330845 ct=1 info=0x6)
NFS reply access, status = 0
NFS: dentry_delete(php/phpbb, 8)
NFS: revalidating (0:13/12714036)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:13/12714036 ct=1 info=0x6)
NFS: (0:13/12714036) revalidation complete
NFS: revalidating (0:13/12714036)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:13/12714036 ct=1 info=0x6)
NFS: (0:13/12714036) revalidation complete
NFS: revalidating (0:13/4980787)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:13/4980787 ct=1 info=0x6)
NFS: (0:13/4980787) revalidation complete
NFS: revalidating (0:13/4980787)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:13/4980787 ct=1 info=0x6)
NFS: (0:13/4980787) revalidation complete

Now I run a stat on the parent  directory (from the bad host):

# stat /cs/home/jas/php
  File: `/cs/home/jas/php'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 1bh/27d Inode: 23330845    Links: 3
Access: (0755/drwxr-xr-x)  Uid: ( 1004/     jas)   Gid: ( 1000/    tech)
Access: 2007-11-01 09:12:24.000000000 -0400
Modify: 2007-11-01 09:10:03.000000000 -0400
Change: 2007-11-01 09:10:03.000000000 -0400

I touch the parent directory (from the bad host):

% touch /cs/home/jas/php
steel 352 % stat /cs/home/jas/php/phpbb
stat: cannot stat `/cs/home/jas/php/phpbb': No such file or directory

Now I stat the file from a working host:

hop 301 % stat /cs/home/jas/php/phpbb
  File: `/cs/home/jas/php/phpbb'
  Size: 4335            Blocks: 16         IO Block: 8192   regular file
Device: 1bh/27d Inode: 23331609    Links: 1
Access: (0600/-rw-------)  Uid: ( 1004/     jas)   Gid: ( 1000/    tech)
Access: 2007-11-01 09:11:01.000000000 -0400
Modify: 2007-10-29 16:03:28.000000000 -0400
Change: 2007-10-29 16:03:28.000000000 -0400

I stat the parent directory from a working host:

hop 302 % stat /cs/home/jas/php
  File: `/cs/home/jas/php'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 1bh/27d Inode: 23330845    Links: 3
Access: (0755/drwxr-xr-x)  Uid: ( 1004/     jas)   Gid: ( 1000/    tech)
Access: 2007-11-01 09:29:21.000000000 -0400
Modify: 2007-11-01 09:29:08.000000000 -0400
Change: 2007-11-01 09:29:08.000000000 -0400

I touch the file on the good client:

hop 312 % touch phpbb

I try to stat the file on the bad client:

steel 356 % stat /cs/home/jas/php/phpbb
stat: cannot stat `/cs/home/jas/php/phpbb': No such file or directory

uptime on the bad client is:  09:31:37 up 99 days, 23:54

I will keep the system in its current state so that you can ask for other details.

Comment 25 Jeff Layton 2007-11-01 14:38:09 UTC

The output with the ???'s for that dirent indicates that we have a stale dentry
in this directory that's not being flushed out correctly. 

This bug may be related:

https://bugzilla.redhat.com/show_bug.cgi?id=228801


The patch to fix that should be in the 4.6 kernels. The kernels on my people
page also have them if you wish to test those.

Also, with a long uptime like this I suspect that the problem could also be an
issue with jiffies wraparound (bug 247905, which is unfortunately marked
private). Since touching the dir doesn't clear it up, then that may very well be
the case.

bug 247905 is on the 4.7 proposed list. If I get some time later today, I may
pull some of these patches into my test kernels.

Comment 26 Jeff Layton 2007-11-01 14:44:51 UTC

Do you have netdump or diskdump configured for this client? If you can collect a
vmcore from it then I might be able to get a bit more info about why this dentry
isn't getting flushed out.

Comment 27 jas 2007-11-01 14:56:21 UTC

Unfortunately, we do not have netdump/diskdump configured, and I couldn't
configure it on any one system without rebooting since I would need to add
another disk which would make the problem go away.  I'm also not sure re:
company policy on submission of a core (which could contain other confidential
info).. :( Is there anything else I can do?
If the last bug that I found is different, I will wait until I find the first
bug again, and I will provide more details...

Comment 28 jas 2007-11-01 17:06:26 UTC

Oh you'll love this one... I "created" /cs/home/jas/php/phpbb on the bad host
and inserted "hello world<enter>" at the top.  The original file showing on the
other hosts started like this:
phpbb setup on webapp<enter>

Now, the file everywhere (including the bad host) sees:

hello world<enter>
on webapp<enter>

... and the rest of the file!

Comment 29 Jeff Layton 2007-11-01 17:25:43 UTC

WRT company policy on sending cores. We have a confidential FTP server for
receiving them. Our support folks can fill you in on the details if you open a
case with them.

Aside from that you can try running crash or another kernel debugger on the live
kernel and see if you can get at dentry for the dentry showing problems and its
parent. From that you can check it against the logic in nfs_check_verifier() and
see why the directory data isn't getting invalidated properly (which is what I
suspect is happening).

Comment 30 Jeff Layton 2007-11-01 17:28:50 UTC

The behavior when creating the file makes some sense:

1) Client thinks file doesn't exist, but it does on server.
2) Client opens file with O_CREAT, server just sends back filehandle of existing
file (probably no O_EXCL flag was used)
3) Client does a write and server dutifully overwrites the bytes at the
beginning of the file.

...not unexpected given the situation, I don't think. The crucial bit is that
the negative dentry is sticking around when it shouldn't be and that seems to be
occuring because the client isn't noticing that the directory has changed.

Comment 31 Jeff Schroeder 2007-11-01 17:41:24 UTC

Jeff, I created a new bug and hope you don't mind that it is assigned to you.
Bug #362241

Comment 32 jas 2007-11-01 18:22:59 UTC

I can't run a crash on the kernel now since after re-writing the file, the
problem is gone ;)

Comment 33 Jeff Layton 2007-11-01 18:33:07 UTC

Ok, that makes sense. The create would have instantiated the dentry and so you
wouldn't have the negative dentry stuck there anymore. You could have probably
gotten similar results by doing a 'touch phpbb' on the bad client.

Comment 34 Jeff Layton 2007-11-02 11:39:32 UTC

I've added some new test kernels to my people page:

http://people.redhat.com/jlayton

...these are essentially 4.6 kernels with some extra patches. Of interest here
are these two patches:

247905	0019-BZ-247905-NFS-Attribute-timeout-handling-and-wrapp.patch
327591	0020-BZ-327591-NFS-discard-pagecache-data-for-dirs-on-d.patch

...the patch for bug 247905 is a patch that's already proposed for inclusion in
4.7, and the other one is the fix for the known reproducer in the patch description.

Testing these might be helpful, but this is less than ideal since we have no
reproducer. If the problem goes away we can't be sure what fixed it (or even if
it is fixed).

At this point, the thing that might be most helpful when this occurs is to force
a crash on an affected client and collect a coredump. From that I may be able to
drill down into it and determine why the directory isn't being revalidated properly.

If you do this, please also log into a "working" client and collect:

output from stat on the bad entry itself (the one not showing up correctly on
the client)
output from stat on the parent directory of the "bad" dentry
presuming that the old file was moved when this occurs, a stat against the
original file

...that info will be needed to compare with the info that the malfunctioning
client has about these dentries/inodes.

Comment 35 Jeff Layton 2007-11-02 18:02:10 UTC

*** Bug 362241 has been marked as a duplicate of this bug. ***

Comment 36 Jon Wilson 2007-11-09 07:54:24 UTC

We are seeing very similar issues. We have a cluster of RHEL3 and RHEL4
machines. We made a small but critical filesystem change over the weekend, and
all the RHEL4.5 machines saw the old layout, not the new one.

Note that this filesystem is read-only everywhere on the cluster, except for one
data processing machine.

First the RHEL3 client:

[root dat]# 
/data03/datfix/dat
[root dat]# stat .
  File: `.'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 13h/19d Inode: 115835091   Links: 14
Access: (2750/drwxr-s---)  Uid: ( 1009/ UNKNOWN)   Gid: (  820/raw)
Access: 2007-11-09 18:29:31.000000000 +1100
Modify: 2007-11-04 12:44:40.000000000 +1100
Change: 2007-11-04 13:20:45.000000000 +1100

Then the RHEL4.5 client.

[root dat]# pwd
/data03/datfix/dat
[root dat]# stat .
  File: `.'
  Size: 4096            Blocks: 8          IO Block: 8192   Directory
Device: ah/10d  Inode: 105873409   Links: 14   
Access: (2750/drwxr-s---)  Uid: ( 1009/ UNKNOWN)   Gid: (  820/raw)
Access: 2007-11-09 18:31:36.000000000 +1100
Modify: 2007-11-04 12:59:37.000000000 +1100
Change: 2007-11-04 13:20:55.000000000 +1100

Comment 37 Jeff Layton 2007-11-19 12:17:12 UTC

Setting to needinfo for a vmcore or results from the test kernel.

Comment 38 Jeff Schroeder 2007-11-21 19:31:45 UTC

So this is an ultrasimplified setup of our setup:

NetApp Filer(Volume A)---snapmirror---> NetApp Filer (Readonly Volume A)
              |                                      /   |         |
              |                                     /   Server C   |
              |                               Server B             |
           Server A                                          Server D

So we will say that on Server A we do something like mkdir /nfs/dir1. A week
later on Server A, we do mkdir /nfs/dir2. Now after dir2 was created and the
snapmirror has updated, servers B and C see dir2 on their readonly mounts.
Server D can NOT see this file. mount -o,remount /nfspath generally works, but
yesterday, we had to actually umount and then mount to get the cache to update.

This seems to be quite similar to Jon's issue and setup. Perhaps it has
something to do with the readonly mount. Note that we mount the r/o mount as r/w
from the linux clients.

For good reason, management is not ok with running test kernels on transactional
production hosts. Since this is completely non-deterministic and happening in
production, our solution is to rollback to AS4.4 and rekickstart hundreds of
boxes (thank god for Sun's ILOMS).

We do still have the 2 original hosts you were helping me troubleshoot marked
for maintenance and not being used. Since vmcore can contain sensitive memory
and things like ssh keys perhaps, I'd rather not upload it publically.

Where can I upload a vmcore from the affected host you helped us with earlier?
Does redhat have an sftp server for this or something like that?

Comment 39 Jeff Layton 2007-11-23 12:25:18 UTC

The only means I know of is via non-encrypted FTP. I believe our support
organization has some other methods of transferring cores in a more secure way.

Comment 40 Jeff Layton 2007-11-30 19:02:04 UTC

bug 401551 may also be related to this issue, but it's again hard to tell
without some conclusive info.

Comment 41 Chet Burgess 2007-12-08 04:50:41 UTC

I recently ran into a NFS directory caching problem on 5.0 as well as 5.1. It
appears that I have encountered bug 364351.

In researching my problem I found bug 327591 and by extension bug 364351. I used
the test script documented in bug 327591 and found at the following URL.
 http://www.linux-nfs.org/Linux-2.6.x/2.6.21/linux-2.6.21-021-fix_readdir_stale_cache.dif


I was able to reproduce the problem on my 5.0, and 5.1 boxes, but not on my 3.8
boxes. I went ahead and tested your RHEL5 test kernels found at the following URL.

http://people.redhat.com/jlayton

I can confirm that the following kernels work in my dev environment and that
they fix bug 364351.

kernel-PAE-2.6.18-58.el5.jtltest.17
kernel-2.6.18-58.el5.jtltest.17
kernel-2.6.18-58.el5.jtltest.17.x86_64

I am cross posting this to both bug bug 364351 and bug 327591 as they are
possibly related and the kernels tested above contain potential fixes for both
bugs, though no test case exists for bug 327591.

Comment 43 Andrius Benokraitis 2008-01-07 16:40:26 UTC

So is this an issue in both RHEL 4 (originally reported) or just RHEL 5?

Comment 44 Andrius Benokraitis 2008-01-07 16:43:40 UTC

Whoops, forget that - I now see the RHEL 4.7 bug 247905, and the 5.2 bug 364351.

Comment 45 Jeff Layton 2008-01-07 19:24:41 UTC

> In researching my problem I found bug 327591 and by extension bug 364351. I
> used the test script documented in bug 327591 and found at the following
> URL.
> 
>
http://www.linux-nfs.org/Linux-2.6.x/2.6.21/linux-2.6.21-021-fix_readdir_stale_cache.dif

Aye, that's the rub. We can easily reproduce *a* problem on both RHEL4 and
RHEL5, and that patch fixes it. I'm proposing that patch for the next set of
updates (see bug 364351 and bug 364361).

The problem is that we have no way to know whether the problem originally
reported in this BZ has anything to do either bug 247905 or bug 364351. I'll
also note that there's another patch that might be related in bug 401551, but
I'm planning to devel_nack that patch at this point since it doesn't seem to fix
the problem reported in that BZ.

I've asked the original reporters to produce a core dump if possible, but no one
has come across with one as of yet. In fact, I'm going to set this back to
needinfo since we still don't have a clear understanding of the problem reported
in this bug and I need something to go on at this point. A way to reproduce this
at will would be ideal, but a crash dump may also be helpful.

Comment 46 jas 2008-01-07 19:31:56 UTC

unfortunately (I guess), it hasn't happened again.  It's very very strange.

Comment 47 Jeff Layton 2008-01-07 19:42:03 UTC

Ok, setting this to NEEDINFO until it does...

Comment 48 jas 2008-01-15 18:04:14 UTC

Ok.  Finally, it has happened again.  

I haven't had time to look into setting up kernel dumps since this would have
required setting this up on every machine in our department "just in case", and
my schedule hasn't allowed me this time.

Here are the present details.  I can send the output of commands that you like...

I modified /cs/local/lib/udb/get-classlist on an NFS client, hop today:

% ls -ali /cs/local/lib/udb/get-classlist
19858066 -rwxr-x---  1 root tech 433 Jan 15 11:49 /cs/local/lib/udb/get-classlist

From all of our other NFS clients you see the old copy:

% ls -ali /cs/local/lib/udb/get-classlist
19858379 -rwxr-x---  1 root tech 590 Jan  2  2007 /cs/local/lib/udb/get-classlist

Comment 49 Jeff Layton 2008-01-15 19:02:13 UTC

What kernel is on this client? How long has it been up?

> I haven't had time to look into setting up kernel dumps since this would have
> required setting this up on every machine in our department "just in case",
> and my schedule hasn't allowed me this time.

I don't think we can tell much here without a core. The other alternative is to
run crash on the live kernel, but that's really too much to do by "remote
control", not to mention that the data structures in the live kernel might
change while you're looking at the problem.

Though if you want to take a crack at it, get the debuginfo packages for your
kernel (talk to your support person, I'm not sure if they're available through
RHN), install the "crash" kernel debugger on it and then run crash on the live
kernel. Dave Anderson's people page has some good documentation for starters:

    http://people.redhat.com/anderson

The idea will be to track down the inode that's not showing the changes, and its
parent inode. From that, you'll want to walk through the validation logic in
nfs_revalidate_inode and nfs_update_inode, track down the corresponding data
structures in kernel memory and see if you can tell why things aren't being
revalidated like they should.

Let me know if you have questions or get a core...

Comment 50 Peter Staubach 2008-01-15 19:54:22 UTC

Regarding Comment #48, yes, please, do post the commands required to get
to this state.

I am wondering whether some changes for bz157266,

        https://bugzilla.redhat.com/show_bug.cgi?id=157266

might resolve this.  That bugzilla references metadata cache handling
and better handles stale file handles and situations where a file
has changed state in some unexpected fashion.

Comment 51 jas 2008-01-15 20:33:19 UTC

Hi Peter.

I think my message wasn't clear.  I was saying that I could send the output of
any commands that you would like me to try to provide details necessary to solve
the problem.  I only edited the file to get to this particular state, and can't
really provide other feedback.  I am looking into whether I can provide a kernel
dump if diskdump/netdump were not previously installed.

Comment 52 Peter Staubach 2008-01-15 20:56:13 UTC

Oops, sorry, I misread your comment.  Perhaps too much hopeful thinking
on my part...  :-)

Comment 53 Jeff Layton 2008-01-16 12:12:45 UTC

To clarify, when this problem occurs, here's what I need:

1) run the "stat" command against the "problem" file, and its parent directory,
and its parent directory, etc. all the way back to the mountpoint on the client.
This needs to be done on the machine showing the problem and from another
machine that isn't (the server is fine for the machine that isn't). The idea is
that I need to understand what the client thinks about the file's metadata and
what the real metadata should be. Collect the output from both machines.

2) After that, collect a coredump. This means setting up either netdump or
diskdump, and forcing a crash. The easiest way to crash the box is to do:

     # echo c > /proc/sysrq-trigger

...it's highly recommended that you test your ability to collect crash dumps
before relying on them. Getting them working correctly can be tricky. I
recommend getting our support people involved as well. They've helped a lot of
people set them up.

Comment 54 Jeff Layton 2008-01-16 12:16:49 UTC

My hope there is that I'll be able to track down the inode that's showing the
issue and the NFS-specific inode info and can determine why the file isn't being
revalidated as it should be.

...actually, I have a 1.5 here as well:

1.5) just before crashing the box, do this:

# echo 32767 > /proc/sys/sunrpc/nfs_debug

...and then do a final:

# stat /path/to/problem/file

...this should dump a bit of NFS debug info into the ring buffer. The core
should then have that info and that should help me to track it down the inode
inside the coredump. It may also give me a hint as to why the inode isn't being
revalidated.

Comment 55 jas 2008-01-16 15:19:48 UTC

As it happens, I can now provide everything that you have requested including
the core.

on server experiencing the problem:

# cd /cs/local/lib/udb
# stat get-classlist
  File: `get-classlist'
  Size: 590             Blocks: 8          IO Block: 8192   regular file
Device: 14h/20d Inode: 19858379    Links: 1
Access: (0750/-rwxr-x---)  Uid: (    0/    root)   Gid: ( 1000/    tech)
Access: 2008-01-16 03:24:53.000000000 -0500
Modify: 2007-01-02 13:12:09.000000000 -0500
Change: 2008-01-15 11:49:16.000000000 -0500
# cd ..
# stat udb
  File: `udb'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 14h/20d Inode: 19858363    Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: ( 1000/    tech)
Access: 2008-01-16 03:14:13.000000000 -0500
Modify: 2008-01-15 13:12:24.000000000 -0500
Change: 2008-01-15 13:12:24.000000000 -0500
# cd ..
# stat lib
  File: `lib'
  Size: 36864           Blocks: 72         IO Block: 8192   directory
Device: 14h/20d Inode: 6505094     Links: 125
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: ( 1000/    tech)
Access: 2008-01-16 03:08:29.000000000 -0500
Modify: 2007-12-11 11:24:47.000000000 -0500
Change: 2007-12-11 11:24:47.000000000 -0500
# cd ..
# stat local
  File: `local'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 14h/20d Inode: 14385919    Links: 19
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    1/     bin)
Access: 2008-01-16 03:08:22.000000000 -0500
Modify: 2007-05-17 09:08:36.000000000 -0400
Change: 2007-05-17 09:08:36.000000000 -0400
# cd ..
# stat cs
  File: `cs'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 13h/19d Inode: 2801958     Links: 9
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2008-01-16 09:17:05.000000000 -0500
Modify: 2008-01-02 11:05:48.000000000 -0500
Change: 2008-01-02 11:05:48.000000000 -0500

On another server not experiencing the problem:

# cd /cs/local/lib/udb
# stat get-classlist
  File: `get-classlist'
  Size: 433             Blocks: 8          IO Block: 8192   regular file
Device: 14h/20d Inode: 19858066    Links: 1
Access: (0750/-rwxr-x---)  Uid: (    0/    root)   Gid: ( 1000/    tech)
Access: 2008-01-16 03:24:53.000000000 -0500
Modify: 2008-01-15 11:49:16.000000000 -0500
Change: 2008-01-15 11:49:16.000000000 -0500
# cd ..
# stat udb
  File: `udb'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 14h/20d Inode: 19858363    Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: ( 1000/    tech)
Access: 2008-01-16 09:42:34.000000000 -0500
Modify: 2008-01-15 13:12:24.000000000 -0500
Change: 2008-01-15 13:12:24.000000000 -0500
# cd ..
# stat lib
  File: `lib'
  Size: 36864           Blocks: 72         IO Block: 8192   directory
Device: 14h/20d Inode: 6505094     Links: 125
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: ( 1000/    tech)
Access: 2008-01-16 03:08:29.000000000 -0500
Modify: 2007-12-11 11:24:47.000000000 -0500
Change: 2007-12-11 11:24:47.000000000 -0500
# cd ..
# stat local
  File: `local'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 14h/20d Inode: 14385919    Links: 19
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    1/     bin)
Access: 2008-01-16 03:08:22.000000000 -0500
Modify: 2007-05-17 09:08:36.000000000 -0400
Change: 2007-05-17 09:08:36.000000000 -0400
# cd ..
# stat cs
  File: `cs'
  Size: 4096            Blocks: 8          IO Block: 8192   directory
Device: 13h/19d Inode: 2801958     Links: 9
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2008-01-16 09:17:05.000000000 -0500
Modify: 2008-01-02 11:05:48.000000000 -0500
Change: 2008-01-02 11:05:48.000000000 -0500

On both systems in /etc/fstab:

forest-mrpriv:/obj/cs   /cs nfs hard,rw,intr,nfsvers=3,tcp 0 0
forest-mrpriv:/obj/local-intel-linux /cs/local nfs hard,rw,intr,nfsvers=3,tcp 0 0

On the bad system:

# echo 32767 >! /proc/sys/sunrpc/nfs_debug
# stat /cs/local/lib/udb/get-classlist
# sysctl -w kernel.sysrq=1
# echo c >! /proc/sysrq-trigger

Where do I securely upload the 4.2 GB core and associated log file?

Comment 56 Jeff Layton 2008-01-16 15:35:59 UTC

Please bzip2 the core file and log, and follow these instructions:

To upload file(s):
>lftp dropbox.redhat.com
>cd incoming
>put unique-filename
Or
>mput unique-filename1 unique-filename2 ... unique-filenameX

...after it's uploaded, email me the exact filenames that they were given (it's
a blind directory and I can't easily list the files within it).

Comment 57 jas 2008-01-16 19:22:20 UTC

As you aware, the core file and log have been uploaded and you have been
e-mailed as instructed.  Let the debugging begin :) :)

Comment 58 Jeff Layton 2008-01-16 19:48:12 UTC

I've unzipped the core -- can you elaborate where I can get the sources, kernel
package and debuginfo for this kernel?

     2.6.9-55.0.2.EL.CSE.smp

...in particular, the i686 version. Note that this kernel is unsupported by Red
Hat, but presuming that it doesn't have any NFS related patches, I'd still like
to have a look and see if I can determine the cause of this problem.

Even without that, there's some interesting info in the logs:

All of the containing directory info seems to be identical on both machines. The
main difference seems to be the file itself. The inode number has changed, which
suggests that the file was silly-deleted or renamed and then replaced with a new
copy.

There are several cases where the "udb" directory is revalidated, but the cache
is not purged. This one is interesting:

NFS: nfs_do_filldir() filling ended @ cookie 4824; returning = -22
NFS: nfs_update_inode(0:14/19858363 ct=1 info=0x6)
NFS: dentry_delete(udb/get-classlist, 8)
NFS: readdir_search_pagecache() searching for cookie 4824
NFS: find_dirent_page() searching directory page 0
NFS: found cookie 12
...
NFS: revalidating (0:14/19858363)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:14/19858363 ct=1 info=0x6)
NFS: (0:14/19858363) revalidation complete
NFS: readdir_search_pagecache() searching for offset 0
NFS: find_dirent_page() searching directory page 0
NFS: found cookie 12 at index 0

...so here we're doing a GETATTR against the directory to validate the info we
have in cache. We're not invalidating the cache however, so that suggests that
the directory contents in pagecache are tagged with the current timestamp.
Obviously, however, the contents are old.

There are a couple of possiblities here:

1) Given that this is a 32-bit kernel, this bug is a possibility:

https://bugzilla.redhat.com/show_bug.cgi?id=228801

...especially if the host has been up for a long time and this directory is only
infrequently accessed.

2) another possibility is that we're looking at a race of some sort where a
readdirplus was done against this directory just before the old inode was
removed and this inode added to it. If all of that happened within the same
second, then the client would have no way to know that its cached info was out
of date. This is a limitation of ext3, so a different server filesystem that
supports nanosecond timestamps might be worth considering.

3) finally, we can't rule out bug 364361 here. The log never shows any getattrs
being done against either inode of get-classlist. I also never see any
readdirplus calls being done. This suggests that the inode is being repopulated
out of the cached readdirplus information. It's not clear to me that that info
would be getting flushed out on a kernel with the patch in 364361, but it's
possible it would help.

Comment 59 jas 2008-01-16 20:42:05 UTC

Although we do recompile the kernel, it is only because we enable a couple of
built-in modules that are not in the default like ieee1394.  We don't change the
source at all. We used to use this for reiserfs as well, but we don't use reiser
any more.   For your purposes, you can assume identically:

kernel-smp-2.6.9-55.0.2.EL.i686.rpm

.. which was, I think, what came with 4.5...

I can send you the debuginfo.rpm from the compile if you like.

As for uptime, the host was up around 27 days, I believe.

Comment 60 jas 2008-01-16 20:44:01 UTC

By the way, the file get-classlist was updated by loading it into emacs/vi (I
can't remember which I used), making a few changes, and then saving.

Comment 61 Jeff Layton 2008-01-17 11:48:18 UTC

Changing tagline to something more descriptive...

Comment 62 Jeff Layton 2008-01-17 12:16:43 UTC

Yes, please do upload the debuginfo RPM. There are differences in memory
locations with this kernel and the stock debuginfo won't work. You can just FTP
as you did the core and post the filename here.

Comment 63 jas 2008-01-17 13:55:21 UTC

jas327591.tar.bz2 uploaded 
(including kernel, devel, debuginfo)

Comment 64 Jeff Layton 2008-01-17 15:03:20 UTC

Thanks. Got the debuginfo and I'm now able to look in the core. Some notes...

Here's the mount in question:

crash> mount /cs/local
VFSMOUNT SUPERBLK TYPE   DEVNAME                              DIRNAME
f6c63e00 c31ad200 nfs    forest-mrpriv:/obj/local-intel-linux /cs/local 

...this tells us the top level dentry in the superblock:

crash> struct super_block.s_root c31ad200
  s_root = 0xf6c2a214,

..."lib" should be on the d_subdirs list:

crash> struct dentry.d_subdirs 0xf6c2a214
  d_subdirs = {
    next = 0xdbcb979c, 
    prev = 0xf6703c5c
  }, 

walking through the list, the dentry addresses are:

lib = 0xf67e217c
udb = 0xf14af63c
get-classlist = 0xf1d9f63c

Comment 65 Jeff Layton 2008-01-17 15:20:33 UTC

...inode for get-classlist is here:

crash> struct dentry.d_inode 0xf1d9f63c
  d_inode = 0xf1748e8c, 

...abusing the "list" functionality in crash's struct command gives us this:

crash> struct nfs_inode -l nfs_inode.vfs_inode 0xf1748e8c

...some interesting fields in the nfs_inode:

  cache_validity = 0, 
  readdir_timestamp = 0, 
  read_cache_jiffies = 2338890832, 
  attrtimeo = 60000, 
  attrtimeo_timestamp = 2338890832, 
  change_attr = 0, 
  last_updated = 2338890832, 
  cache_change_attribute = 2262627382, 
  data_updates = {
    counter = 0
  }, 

...

    i_ino = 19858379, 

...at this point I need to look over the revalidation logic and make sense of
what we're seeing here. I'll also probably need to look over similar info for
the "udb" directory.

Comment 66 Jeff Layton 2008-01-17 15:47:18 UTC

For udb directory:

  cache_validity = 0, 
  readdir_timestamp = 2273924542, 
  read_cache_jiffies = 2338941777, 
  attrtimeo = 60000, 
  attrtimeo_timestamp = 2338890515, 
  change_attr = 17388445753563687752, 
  last_updated = 2338941777, 
  cache_change_attribute = 2273924542, 
  data_updates = {
    counter = 0
  }, 


....many of the values here are in units of jiffies -- at the time of the core:

crash> p jiffies
jiffies = $1 = 2339201735

...from the info in the core:

        DATE: Wed Jan 16 10:10:54 2008
      UPTIME: 27 days, 01:51:41

Comment 67 Jeff Layton 2008-01-18 14:29:48 UTC

So what I want to do is to determine from the timestamps in the inode fields, at
exactly what wallclock times we did the last readdir on this directory and
revalidated these inodes.

So I need to establish the exact timeofday at boot time, so we can interpret the
values in jiffies to wallclock times.

current time at time of crash:

crash> p xtime
xtime = $1 = {
  tv_sec = 1200496254, 
  tv_nsec = 661257126
}

Add wall_to_monotonic to determine uptime:

crash> p wall_to_monotonic
wall_to_monotonic = $2 = {
  tv_sec = -1198157670, 
  tv_nsec = 332117792
}

So that gives boot time as roughly this many seconds ago:

tv_sec = 1200496254-1198157670 + (661257126-332117792)/10^9 = 2338584.329139334

...which should be ~ this many jiffies:

2338584329.139334000

...but we're off by quite a bit here from the jiffies value in the kernel. This
implies that we have 617405 more timer interrupts than we should (a little over
10 minutes).

617405.86066600

...so perhaps wall_to_monotonic was skewed? I see that ntpd is running, so it's
possible that it's causing things not to add up exactly. I'll need to ponder how
best to make sense of the jiffies values above. It may be best to not worry
about wall_to_monotonic and to simply try to work backward from the time of the
crash.

Comment 68 Jeff Layton 2008-01-18 14:51:05 UTC

I'm guessing you're in Eastern timezone too, since the mtime that you posted on
the file and in the core works out exactly to my local timezone...

Working backward from jiffies at time of crash. The readdir_timestamp is set to
jiffies just before a readdir/readdirplus call goes out onto the wire:

(2339201735-2273924542)/1000 = 65277.19300000000000000000

1200496254-65277 = 1200430977

$ perl -e 'print localtime(1200430977) . "\n";'
Tue Jan 15 16:02:57 2008

...so that's when the last readdir was actually done on the wire. Well after the
mtime on the directory, which is:

[jlayton@tleilax ~]$ perl -e 'print localtime(1200420744) . "\n";'
Tue Jan 15 13:12:24 2008

Comment 69 Jeff Layton 2008-01-18 16:32:27 UTC

Just a note that the patch that was originally proposed for bug 401551 might be
of use here. I need to check some other metadata first and see if I can verify
that...

Comment 70 Jeff Layton 2008-01-18 19:50:00 UTC

So this is interesting...

The dentry's d_time field is set by nfs_renew_times(). If I read it correctly,
then it gets set to jiffies whenever the dentry is revalidated or created.

crash> struct dentry.d_time 0xf1d9f63c
    d_time = 1518062219,

...which is:

    Sun Jan  6 22:05:14 2008

...so the dentry has existed since at least then and hasn't had its time renewed
since.

What I'd like to do now is have a look at the readdirplus data in pagecache and
see if we can confirm that the OTW readdirplus call has the new fileid for the
file. This is tricky though since the pages in question are indexed by a radix
tree and I'm not sure how to walk a radix tree in crash...

Comment 71 Jeff Layton 2008-01-21 15:11:45 UTC

Ok, this is odd...

crash> struct inode.i_nlink,i_dentry 0xf1748e8c
  i_nlink = 1, 
  i_dentry = {
    next = 0xcfb03384, 
    prev = 0xf1d9f67c
  }, 

crash> list 0xcfb03384
cfb03384
f1d9f67c
f1748e9c

crash> struct dentry.d_name -l dentry.d_alias f1d9f67c
  d_name = {
    hash = 1734851129, 
    len = 13, 
    name = 0xf1d9f6b0 "get-classlist"
  }, 
crash> struct dentry.d_name -l dentry.d_alias cfb03384
  d_name = {
    hash = 1903515488, 
    len = 14, 
    name = 0xcfb033b8 "get-classlist~"
  }, 

...so i_nlink=1, which means that the inode shouldn't be hardlinked, but there
are 2 dentries attached to it. This seems significant, but I'm not quite sure
how this would occur...

Comment 72 Jeff Layton 2008-01-21 17:00:01 UTC

Presuming that the contents of this directory have not changed, can you run a
stat against the get-classlist~ file on the server and post the output here?

Comment 73 Jeff Layton 2008-01-21 17:02:23 UTC

As a note, I'm starting to think that this might be another manifestation of bug
228801. Comment #24, in particular seems to show the exact same symptoms, but
you also seem to be occasionally getting stale positive dentries as well.

Comment 74 Jeff Layton 2008-01-21 17:32:57 UTC

Yes, I think the patch for 228801 would probably fix this. We're obviously
occasionally getting a valid dcache lookup hit, so we're probably passing this
dentry into nfs_lookup_revalidate() at least occasionally. That function does this:

        /* Force a full look up iff the parent directory has changed */
        if (nfs_check_verifier(dir, dentry)) {
                if (nfs_lookup_verify_inode(inode, nd))
                        goto out_zap_parent;
                goto out_valid;
        }

The patch for bug 228801 changes nfs_check_verifier. In particular, it adds this
condition:

+			|| verf != NFS_I(dir)->cache_change_attribute)

...from what I can tell, if this dentry passed through this function at the time
of the core, this function would have returned 0 and we wouldn't be blindly
accepting the dentry. It would instead be calling nfs_cached_lookup() on it and
we'd be getting a new dentry (if I'm interpreting this correctly).

I'm going to close this as a dupe of bug 228801. I recommend updating to the 4.6
kernel and seeing if you're still able to reproduce this. If so, then please
reopen this bug and I'll have a second look.



*** This bug has been marked as a duplicate of 228801 ***