453763 – NFS symlink/readlink test failure in rhts

Bug 453763 - NFS symlink/readlink test failure in rhts

Summary: NFS symlink/readlink test failure in rhts

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.6.z
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	J. Bruce Fields
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-07-02 13:05 UTC by Vivek Goyal
Modified:	2012-06-20 16:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 16:16:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Vivek Goyal 2008-07-02 13:05:12 UTC

Description of problem:

In rhts we have observed NFS symlink/readlink test failure with same host in two
different runs.
Once we observed it with 74.EL kernel and other time we observed it with 76.EL
kernel.

Following are two captured logs.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3261981
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3422373

Version-Release number of selected component (if applicable):

Observed in 74.EL and 76.EL

How reproducible:

Observed twice so far.

Steps to Reproduce:
1. rhts connectathon tests
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jeff Layton 2008-07-02 13:25:38 UTC

What kind of server was this?

Looking at the capture, the problem appears to be due to packet loss:

21864  21.806849 10.16.65.101 -> 10.16.46.228 NFS V2 LOOKUP Call,
DH:0x8033d14a/file.8
21865  21.807040 10.16.46.228 -> 10.16.65.101 NFS V2 LOOKUP Reply (Call In
21864), FH:0x5d005c02

...we do a lookup and file.8 is there.

21866  21.807066 10.16.65.101 -> 10.16.46.228 NFS V2 READLINK Call, FH:0x5d005c02
21867  21.807247 10.16.46.228 -> 10.16.65.101 NFS V2 READLINK Reply (Call In
21866) Path:/this/is/a/symlink8

...readlink call that's probably unrelated.

21868  21.807272 10.16.65.101 -> 10.16.46.228 NFS V2 REMOVE Call,
DH:0x8033d14a/file.8
21869  22.906839 10.16.65.101 -> 10.16.46.228 NFS [RPC retransmission of
#21868]V2 REMOVE Call, DH:0x8033d14a/file.8
21870  22.907050 10.16.46.228 -> 10.16.65.101 NFS V2 REMOVE Reply (Call In
21868) Error:NFSERR_NOENT

..then we do a remove, get no reply and resend the call. What appears to have
happened is that the server received the call, did the remove and the reply got
lost somewhere along the way.

I had thought that the fact that we're using the same XID should make the server
just ignore the retransmission, but maybe it doesn't?

Comment 2 Jeff Layton 2008-07-02 13:38:56 UTC

As best I can tell, the client is doing the right thing here. The problem
appears to be server side. Let me know what kind of servers these are -- I'm
guessing RHEL3/4 according to the logs but I'd like to know what kernels they're
running.

Maybe I can set up another reproducer simulating extreme packet loss and see if
we can reproduce this...

Comment 4 Jeff Layton 2008-07-02 17:42:24 UTC

Thanks Jeff...

I'm pretty sure that this isn't a client-side bug. For now, let's call this a
server side bug, and move this to being a RHEL4.6.z problem. The server is running:

    kernel-smp-2.6.9-67.0.3.EL

...the problem may also be in RHEL3, but I'll focus on RHEL4 for now. This bug
seems somewhat similar to bug 430946, but that problem should be specific to
READDIR calls. This, however looks pretty clearly to be a problem with REMOVE's.
Still, it would be good to know whether this problem is reproducible with a
4.7ish kernel.

I took a quick stab at reproducing this and it didn't happen, I'll plan to give
it another shot when I get the chance. My guess is that the problem is something
wrong with duplicate reply cache handling, but nothing's really standing out at
me so far...

Comment 5 Peter Staubach 2008-07-02 18:48:10 UTC

It would be interesting to know what the load on the server(s) looked like
when the problem occurred.  I wonder if there was enough load from other
clients in order to cause the cache entry for the REMOVE to be recycled
too early.

It also seems odd to me that the client is issuring a READLINK request
at all.  In my normal testing, the client does not issue any READLINK
requests because the symbolic links are all cached.  This is a separate
issue though, I believe.

Comment 6 Jeff Layton 2008-08-08 19:01:13 UTC

That may be the case, though there was only a 1s delay between the original call and the retransmission. That seems like an awfully short time to live in the cache...

(todo: check the conditions that cause the cache to cycle out old entries)

Another possibility is that there was some sort of problem allocating for or encoding the reply after the unlink was already done. It looks like nfsd_dispatch calls this in those situations:

     nfsd_cache_update(rqstp, RC_NOCACHE, NULL);

...so the reply wouldn't get cached (or sent) but the file would have still been removed.

Comment 7 Jeff Layton 2010-08-17 12:04:50 UTC

Handing this off to Bruce since this is more his area. Please hand back to me if you don't have the cycles to deal with it...

Occasionally we get RHTS (test suite) failures that appear to be due to the server attempting to re-perform a call instead of sending a reply out of the DRC. I'm assuming this means that the server is flushing entries out of the DRC too quickly or the reply never made it into the cache in the first place for some reason.

We've definitely seen this on RHEL4 servers. It's not clear whether we've seen it on anything more recent.

Comment 8 J. Bruce Fields 2010-09-15 16:05:00 UTC

If the problem is an allocation or encoding error, turning on nfsd svc debugging (rpcdebug -m nfsd -s svc) should result in errors logged in that case.

Can this be reliably reproduced e.g. by running that single test repeatedly in a tight loop?

It doesn't look to me like entries should drop out of this loop so fast, but I need to review the code more carefully.

Comment 9 Jiri Pallich 2012-06-20 16:16:29 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.