Description of problem: In rhts we have observed NFS symlink/readlink test failure with same host in two different runs. Once we observed it with 74.EL kernel and other time we observed it with 76.EL kernel. Following are two captured logs. http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3261981 http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3422373 Version-Release number of selected component (if applicable): Observed in 74.EL and 76.EL How reproducible: Observed twice so far. Steps to Reproduce: 1. rhts connectathon tests 2. 3. Actual results: Expected results: Additional info:
What kind of server was this? Looking at the capture, the problem appears to be due to packet loss: 21864 21.806849 10.16.65.101 -> 10.16.46.228 NFS V2 LOOKUP Call, DH:0x8033d14a/file.8 21865 21.807040 10.16.46.228 -> 10.16.65.101 NFS V2 LOOKUP Reply (Call In 21864), FH:0x5d005c02 ...we do a lookup and file.8 is there. 21866 21.807066 10.16.65.101 -> 10.16.46.228 NFS V2 READLINK Call, FH:0x5d005c02 21867 21.807247 10.16.46.228 -> 10.16.65.101 NFS V2 READLINK Reply (Call In 21866) Path:/this/is/a/symlink8 ...readlink call that's probably unrelated. 21868 21.807272 10.16.65.101 -> 10.16.46.228 NFS V2 REMOVE Call, DH:0x8033d14a/file.8 21869 22.906839 10.16.65.101 -> 10.16.46.228 NFS [RPC retransmission of #21868]V2 REMOVE Call, DH:0x8033d14a/file.8 21870 22.907050 10.16.46.228 -> 10.16.65.101 NFS V2 REMOVE Reply (Call In 21868) Error:NFSERR_NOENT ..then we do a remove, get no reply and resend the call. What appears to have happened is that the server received the call, did the remove and the reply got lost somewhere along the way. I had thought that the fact that we're using the same XID should make the server just ignore the retransmission, but maybe it doesn't?
As best I can tell, the client is doing the right thing here. The problem appears to be server side. Let me know what kind of servers these are -- I'm guessing RHEL3/4 according to the logs but I'd like to know what kernels they're running. Maybe I can set up another reproducer simulating extreme packet loss and see if we can reproduce this...
Thanks Jeff... I'm pretty sure that this isn't a client-side bug. For now, let's call this a server side bug, and move this to being a RHEL4.6.z problem. The server is running: kernel-smp-2.6.9-67.0.3.EL ...the problem may also be in RHEL3, but I'll focus on RHEL4 for now. This bug seems somewhat similar to bug 430946, but that problem should be specific to READDIR calls. This, however looks pretty clearly to be a problem with REMOVE's. Still, it would be good to know whether this problem is reproducible with a 4.7ish kernel. I took a quick stab at reproducing this and it didn't happen, I'll plan to give it another shot when I get the chance. My guess is that the problem is something wrong with duplicate reply cache handling, but nothing's really standing out at me so far...
It would be interesting to know what the load on the server(s) looked like when the problem occurred. I wonder if there was enough load from other clients in order to cause the cache entry for the REMOVE to be recycled too early. It also seems odd to me that the client is issuring a READLINK request at all. In my normal testing, the client does not issue any READLINK requests because the symbolic links are all cached. This is a separate issue though, I believe.
That may be the case, though there was only a 1s delay between the original call and the retransmission. That seems like an awfully short time to live in the cache... (todo: check the conditions that cause the cache to cycle out old entries) Another possibility is that there was some sort of problem allocating for or encoding the reply after the unlink was already done. It looks like nfsd_dispatch calls this in those situations: nfsd_cache_update(rqstp, RC_NOCACHE, NULL); ...so the reply wouldn't get cached (or sent) but the file would have still been removed.
Handing this off to Bruce since this is more his area. Please hand back to me if you don't have the cycles to deal with it... Occasionally we get RHTS (test suite) failures that appear to be due to the server attempting to re-perform a call instead of sending a reply out of the DRC. I'm assuming this means that the server is flushing entries out of the DRC too quickly or the reply never made it into the cache in the first place for some reason. We've definitely seen this on RHEL4 servers. It's not clear whether we've seen it on anything more recent.
If the problem is an allocation or encoding error, turning on nfsd svc debugging (rpcdebug -m nfsd -s svc) should result in errors logged in that case. Can this be reliably reproduced e.g. by running that single test repeatedly in a tight loop? It doesn't look to me like entries should drop out of this loop so fast, but I need to review the code more carefully.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.