+++ This bug was initially created as a clone of Bug #428720 +++ Description of problem: Connectathon NFSv2 udp base test06 connectathon fails when running against a RHEL4.U6 server. Version-Release number of selected component (if applicable): 2.6.18-64.el5 How reproducible: Intermittently Steps to Reproduce: 1. Run the Connectathon test in RHTS 2. with RHEL5.1 base distro and 2.6.18-64-el5 Actual results: We saw this issue two different kernels. With -64,-67 -64 log http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1539980 -67 log http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1583755 While running the rhel4-nfs NFSv2 udp base test06 connectathon test we see these errors: ./test6: readdir ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read '.' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read '..' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.87' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.88' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.89' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.90' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.91' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.92' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.93' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.94' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.95' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.96' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.97' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.98' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.99' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.100' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.101' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.102' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.103' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.104' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.105' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.106' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.107' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.108' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.109' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.110' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.111' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.112' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.113' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.114' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.115' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.116' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.117' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.118' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.119' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.120' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.121' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.122' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.123' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.124' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.125' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.126' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.127' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.128' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.129' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.130' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.131' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.132' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.133' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.134' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.135' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.136' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.137' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.138' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.139' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.140' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.141' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.142' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.143' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.144' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.145' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.146' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.147' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.148' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.149' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.150' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.151' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.152' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.153' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.154' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.155' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.156' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.157' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.158' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.159' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.160' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.161' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.162' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.163' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.164' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.165' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.166' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.167' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.168' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.169' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.170' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.171' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.172' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.173' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.174' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.175' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.176' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.177' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.178' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.179' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.180' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.181' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.182' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.183' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.184' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.185' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.186' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.187' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.188' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.189' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.190' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.191' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.192' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.193' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.194' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.195' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.196' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.197' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.198' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.199' dir entry, pass 87 ./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) Test failed with 115 errors basic tests failed Looking at the latest failure it was between: ----- start: Fri Jan 11 10:38:13 EST 2008 ----- and ----- end: Fri Jan 11 10:38:50 EST 2008 ----- Looking at the server log files during that time I see the following: Jan 11 10:38:11 intel-d3c4102-01 mountd[4041]: authenticated unmount request from ixen-02.rhts.boston.redhat.com:920 for /export/home (/export/home) Jan 11 10:38:13 intel-d3c4102-01 mountd[4041]: authenticated mount request from hp-olympia1.rhts.boston.redhat.com:637 for /export/home (/export/home) Jan 11 10:38:17 intel-d3c4102-01 mountd[4041]: authenticated unmount request from hp-matterhorn1.rhts.boston.redhat.com:787 for /export/home (/export/home) Jan 11 10:38:20 intel-d3c4102-01 mountd[4041]: authenticated mount request from intel-s6e5231-01.rhts.boston.redhat.com:1022 for /export/home (/export/home) Jan 11 10:38:20 intel-d3c4102-01 mountd[4041]: authenticated mount request from ixen-02.rhts.boston.redhat.com:600 for /export/home (/export/home) Jan 11 10:38:26 intel-d3c4102-01 mountd[4041]: authenticated unmount request from sun-x4600-01.rhts.boston.redhat.com:731 for /export/home (/export/home) Jan 11 10:38:27 intel-d3c4102-01 mountd[4041]: authenticated mount request from hp-matterhorn1.rhts.boston.redhat.com:894 for /export/home (/export/home) Jan 11 10:38:31 intel-d3c4102-01 mountd[4041]: authenticated unmount request from intel-s6e5231-01.rhts.boston.redhat.com:708 for /export/home (/export/home) Jan 11 10:38:36 intel-d3c4102-01 mountd[4041]: authenticated mount request from sun-x4600-01.rhts.boston.redhat.com:838 for /export/home (/export/home) Jan 11 10:38:40 intel-d3c4102-01 mountd[4041]: authenticated mount request from intel-s6e5231-01.rhts.boston.redhat.com:815 for /export/home (/export/home) Jan 11 10:38:47 intel-d3c4102-01 mountd[4041]: authenticated unmount request from dell-pesc430-02.rhts.boston.redhat.com:939 for /export/home (/export/home) Jan 11 10:38:48 intel-d3c4102-01 mountd[4041]: authenticated unmount request from hp-matterhorn1.rhts.boston.redhat.com:648 for /export/home (/export/home) Jan 11 10:38:53 intel-d3c4102-01 mountd[4041]: authenticated unmount request from ixen-02.rhts.boston.redhat.com:854 for /export/home (/export/home) Jan 11 10:38:58 intel-d3c4102-01 mountd[4041]: authenticated mount request from dell-pesc430-02.rhts.boston.redhat.com:632 for /export/home (/export/home) Jan 11 10:38:58 intel-d3c4102-01 rpc.statd[3630]: Received erroneous SM_UNMON request from 127.0.0.1 for 10.12.5.42 Jan 11 10:38:59 intel-d3c4102-01 mountd[4041]: authenticated mount request from hp-matterhorn1.rhts.boston.redhat.com:756 for /export/home (/export/home) The client ia64 that was running the test was hp-olympia1.rhts.boston.redhat.com has address 10.12.4.185 I did not see anything obvious that happened on the server during that moment in time. The client i386 that had this issue was hp-dl585-01.rhts.boston.redhat.com Expected results: This should pass without errors. Additional info: The first time it happened it was on a i386 kernel. The second time was on a ia64 kernel. Here is a link to a packet capture while the error happened. http://rhts.lab.boston.redhat.com/testlogs/13197/45855/357857/rhel4-nfs_nfsvers=2_udp_base.cap.bz2 -- Additional comment from jlayton on 2008-01-15 10:03 EST -- I'll try to set up a test rig today and see if I can run this test in a loop and reproduce this. -- Additional comment from jlayton on 2008-01-15 14:37 EST -- Ran the test in a loop for several hours and never hit the problem once. Looking at the capture is interesting though. If I understand the test correctly, it creates 200 files in a directory, and then removes the files one by one, doing a readdir in between each remove and checking that the directory contents are what is expected. The capture shows everything going along well until after file.86 is removed. After that, we do a READDIR call. It looks like the first part of the READDIR response is lost, however and we get a UDP packet that contains what looks like the last part of the READDIR response (starting at ~file.145). Presumably, the client just discards this. After this are 2 retransmissions of the READDIR request, and finally a READDIR reply, which just contains the first 70 bytes of the packet. It ends just after the NFS status code in the header (which was NFS_OK). Wireshark says it's a malformed packet. Judging from the results of the test though, it looks like maybe the client accepted the packet and took it to mean that the directory was totally empty. Maybe the nfs2 XDR decoder is borked here? This packet should probably have been rejected... -- Additional comment from staubach on 2008-01-15 14:49 EST -- The NFSv2 code may not handle a borked response correctly, but the real questions are, I think, why the server generated such a packet and what happened to the rest of the first response or whether the server was doing the right things but the client mangled these two packets. This sure looks like either a lower level networking transport implementation issue or an issue where the NFS server manipulated, ie. mangled/freed/etc, the response data structures before the network transport implementation was done with them. -- Additional comment from jlayton on 2008-01-15 15:01 EST -- Indeed. I'm thinking we're looking at (at least) 2 bugs here. One on the client and one on the server. I wonder if we could reproduce the server bug with pynfs or something -- make it so that the first readdir response on a directory is dropped and then see how the response looks after the request is retransmitted... -- Additional comment from jlayton on 2008-01-16 09:34 EST -- After having a closer look, I think that Peter's theory about the server generating a bogus reply is probably the most likely. The UDP wrapper on the packet has the correct length field, so the packet is consistent from the standpoint of the UDP layer. The NFS field is definitely truncated though. After the status code, there's nothing. So this does seem like it's probably a server bug. One that is likely triggered by spotty network connectivity causing retransmits. On the client side, the NFS readdir xdr (nfs_xdr_readdirres) decoder has this: if ((status = ntohl(*p++))) return -nfs_stat_to_errno(status); hdrlen = (u8 *) p - (u8 *) iov->iov_base; if (iov->iov_len < hdrlen) { printk(KERN_WARNING "NFS: READDIR reply header overflowed:" "length %d > %Zu\n", hdrlen, iov->iov_len); return -errno_NFSERR_IO; } else if (iov->iov_len != hdrlen) { ... after setting "status" p should be beyond the end of the packet, and I think we would have fallen into the first case of the following if statement. Do we happen to have dmesg on the server at the time this occured? It would be good to confirm the theory since that KERN_WARNING message should have popped up. -- Additional comment from jburke on 2008-01-16 10:30 EST -- Jeff, Just a point of observation. This has never been seen at all up until recently. So far it has only been seen with a RHEL4 Server and RHEL5 Client. We have only just started testing RHEL5.2 kernels. It has never been seen with a RHEL5.1-Z Client. Jeff -- Additional comment from jlayton on 2008-01-16 11:04 EST -- Sorry, my last post should have read: "Do we happen to have logs from the client when this occurred?" kernel logs, in particular... -- Additional comment from dzickus on 2008-01-16 11:09 EST -- Below in the Bug Info section (and here) you will find the URL to the logs http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1539980 Scroll to the bottom to see the console logs -- Additional comment from jlayton on 2008-01-16 13:09 EST -- Thanks. Hmm not quite what I was expecting to see, but similar: NFS: readdir reply truncated! ...so it did detect a problem with a readdir response at some point. Most likely, the problem is with the response packet we got. Here's the relevant code from nfs_xdr_readdirres: out: kunmap_atomic(kaddr, KM_USER0); return nr; short_pkt: entry[0] = entry[1] = 0; /* truncate listing ? */ if (!nr) { printk(KERN_NOTICE "NFS: readdir reply truncated!\n"); entry[1] = 1; } goto out; ...so nr was 0, which I think is the number of entries. I wonder if we should also be setting this inside that if statement? nr = -errno_NFSERR_IO; It seems like there's clearly a problem with the response. Returning zero from this function doesn't seem to be considered an error and I think it's what makes higher levels consider this to be a empty dir. -- Additional comment from dzickus on 2008-01-22 10:33 EST -- Is this problem solved by DaveM's solution in bz 429417 comment #5? -- Additional comment from jlayton on 2008-01-22 11:09 EST -- I think probably not (though if DaveM wants to chime in and tell me otherwise, I'd be happy to listen). As best I can tell in this case, the server sent us a bogus readdir response in an RPC retransmission, and the client accepted it as a valid response. I think this is at least 2 bugs, really: A) the server probably has a bug that causes it to send these bogus responses. Peter's theory was that the server might be prematurely releasing a buffer and that makes it send a packet with a status code but no actual entries to the client. I've been considering ways to reproduce this. We might be able to do it by selectively filtering packets. Something like this: 1) client does a readdir request 2) server sends a readdir response. We block the response here. 3) client retransmits the request, we block the retransmissions for a bit, hopefully giving the server to do whatever it does that's wrong here 4) unblock traffic -- hopefully the server will send along a truncated response ...we might even get lucky and force a crash if we turn on memory poisoning here too. B) client side bug, the capture shows the bogus readdir response. I think the client accepted the response and interpreted it to mean that the directory was empty (no entries at all, not even "." or ".."). The function that interprets this is nfs_xdr_readdirres(). The logic in this function is a bit...obtuse. This particular packet seems to be a very special case. We see this message in the client's log: NFS: readdir reply truncated! So we do know is that we ended up in this section of code: out: kunmap_atomic(kaddr, KM_USER0); return nr; short_pkt: entry[0] = entry[1] = 0; /* truncate listing ? */ if (!nr) { printk(KERN_NOTICE "NFS: readdir reply truncated!\n"); entry[1] = 1; } goto out; ...obviously, this function returned 0 here. My thinking is that we might should consider a patch like this here: diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 2cd35f2..e72d37b 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -474,6 +474,7 @@ nfs_xdr_readdirres(struct rpc_rqst *req, u32 *p, void *dummy) if (!nr) { printk(KERN_NOTICE "NFS: readdir reply truncated!\n"); entry[1] = 1; + nr = -errno_NFSERR_IO; } goto out; err_unmap: ...instead of returning 0 entries, we return -EIO and (hopefully) the request would be retransmitted. Peter, any thoughts on that patch? Or am I misinterpreting the readdir response xdr code? -- Additional comment from staubach on 2008-01-22 13:36 EST -- Well, I don't think that I would be quite so quick to dismiss the possibility that it is the bug from 429417. I don't know that it is, but it sounds like perhaps it could be. On the client side, it seems that the client is not being quite as defensive as it could be. The suggestion to return the EIO error seems reasonable to me. -- Additional comment from jlayton on 2008-01-22 14:21 EST -- I could be very wrong here, but here's why I think that's not it... If I'm reading bug 429417 correctly, the problem is that the UDP layer is occasionally getting SKB's that have inconsistent size fields. I don't think that would affect the host's ability to collect captures, however. The packet capture here seem to pretty clearly indicate that the server sent the client a bad packet. I suppose the problem could ultimately be a manifestation of that problem on the server, but I'm not sure. I've not really looked at the server side of this yet... Either way, we should probably go ahead and clone this bug for RHEL4 and start treating the server side as a separate issue.
*** This bug has been marked as a duplicate of 430946 ***