Bug 429734 - Connectathon RHEL5 client to RHEL4 server, Connectathon failure
Connectathon RHEL5 client to RHEL4 server, Connectathon failure
Status: CLOSED DUPLICATE of bug 430946
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.6
All Linux
medium Severity medium
: rc
: ---
Assigned To: Jeff Layton
Martin Jenner
http://rhts.lab.boston.redhat.com/cgi...
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-22 14:22 EST by Jeff Layton
Modified: 2008-02-02 09:49 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-02-02 09:49:28 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jeff Layton 2008-01-22 14:22:58 EST
+++ This bug was initially created as a clone of Bug #428720 +++

Description of problem:
Connectathon NFSv2 udp base test06 connectathon fails when running against a
RHEL4.U6 server.

Version-Release number of selected component (if applicable):
2.6.18-64.el5

How reproducible:
Intermittently

Steps to Reproduce:
1. Run the Connectathon test in RHTS
2. with RHEL5.1 base distro and 2.6.18-64-el5
  
Actual results:
We saw this issue two different kernels. With -64,-67
-64 log
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1539980
-67 log
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1583755

While running the rhel4-nfs NFSv2 udp base test06 connectathon test we see these
errors:
./test6: readdir
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read '.' dir entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read '..' dir entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.87' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.88' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.89' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.90' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.91' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.92' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.93' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.94' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.95' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.96' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.97' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.98' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.99' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.100' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.101' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.102' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.103' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.104' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.105' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.106' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.107' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.108' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.109' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.110' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.111' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.112' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.113' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.114' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.115' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.116' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.117' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.118' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.119' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.120' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.121' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.122' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.123' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.124' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.125' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.126' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.127' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.128' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.129' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.130' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.131' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.132' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.133' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.134' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.135' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.136' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.137' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.138' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.139' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.140' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.141' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.142' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.143' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.144' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.145' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.146' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.147' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.148' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.149' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.150' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.151' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.152' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.153' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.154' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.155' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.156' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.157' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.158' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.159' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.160' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.161' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.162' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.163' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.164' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.165' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.166' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.167' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.168' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.169' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.170' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.171' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.172' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.173' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.174' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.175' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.176' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.177' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.178' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.179' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.180' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.181' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.182' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.183' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.184' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.185' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.186' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.187' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.188' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.189' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.190' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.191' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.192' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.193' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.194' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.195' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.196' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.197' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.198' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) didn't read expected 'file.199' dir
entry, pass 87
	./test6: (/mnt/rhel4-nfs/hp-dl585-01.test) Test failed with 115 errors
basic tests failed

Looking at the latest failure it was between:
----- start: Fri Jan 11 10:38:13 EST 2008 -----
and
----- end: Fri Jan 11 10:38:50 EST 2008 -----

Looking at the server log files during that time I see the following:
Jan 11 10:38:11 intel-d3c4102-01 mountd[4041]: authenticated unmount request
from ixen-02.rhts.boston.redhat.com:920 for /export/home (/export/home)
Jan 11 10:38:13 intel-d3c4102-01 mountd[4041]: authenticated mount request from
hp-olympia1.rhts.boston.redhat.com:637 for /export/home (/export/home)
Jan 11 10:38:17 intel-d3c4102-01 mountd[4041]: authenticated unmount request
from hp-matterhorn1.rhts.boston.redhat.com:787 for /export/home (/export/home)
Jan 11 10:38:20 intel-d3c4102-01 mountd[4041]: authenticated mount request from
intel-s6e5231-01.rhts.boston.redhat.com:1022 for /export/home (/export/home)
Jan 11 10:38:20 intel-d3c4102-01 mountd[4041]: authenticated mount request from
ixen-02.rhts.boston.redhat.com:600 for /export/home (/export/home)
Jan 11 10:38:26 intel-d3c4102-01 mountd[4041]: authenticated unmount request
from sun-x4600-01.rhts.boston.redhat.com:731 for /export/home (/export/home)
Jan 11 10:38:27 intel-d3c4102-01 mountd[4041]: authenticated mount request from
hp-matterhorn1.rhts.boston.redhat.com:894 for /export/home (/export/home)
Jan 11 10:38:31 intel-d3c4102-01 mountd[4041]: authenticated unmount request
from intel-s6e5231-01.rhts.boston.redhat.com:708 for /export/home (/export/home)
Jan 11 10:38:36 intel-d3c4102-01 mountd[4041]: authenticated mount request from
sun-x4600-01.rhts.boston.redhat.com:838 for /export/home (/export/home)
Jan 11 10:38:40 intel-d3c4102-01 mountd[4041]: authenticated mount request from
intel-s6e5231-01.rhts.boston.redhat.com:815 for /export/home (/export/home)
Jan 11 10:38:47 intel-d3c4102-01 mountd[4041]: authenticated unmount request
from dell-pesc430-02.rhts.boston.redhat.com:939 for /export/home (/export/home)
Jan 11 10:38:48 intel-d3c4102-01 mountd[4041]: authenticated unmount request
from hp-matterhorn1.rhts.boston.redhat.com:648 for /export/home (/export/home)
Jan 11 10:38:53 intel-d3c4102-01 mountd[4041]: authenticated unmount request
from ixen-02.rhts.boston.redhat.com:854 for /export/home (/export/home)
Jan 11 10:38:58 intel-d3c4102-01 mountd[4041]: authenticated mount request from
dell-pesc430-02.rhts.boston.redhat.com:632 for /export/home (/export/home)
Jan 11 10:38:58 intel-d3c4102-01 rpc.statd[3630]: Received erroneous SM_UNMON
request from 127.0.0.1 for 10.12.5.42
Jan 11 10:38:59 intel-d3c4102-01 mountd[4041]: authenticated mount request from
hp-matterhorn1.rhts.boston.redhat.com:756 for /export/home (/export/home)

The client ia64 that was running the test was 
hp-olympia1.rhts.boston.redhat.com has address 10.12.4.185
I did not see anything obvious that happened on the server during that moment in
time.

The client i386 that had this issue was hp-dl585-01.rhts.boston.redhat.com

Expected results:
This should pass without errors.

Additional info:
The first time it happened it was on a i386 kernel. The second time was on a
ia64 kernel.

Here is a link to a packet capture while the error happened.
http://rhts.lab.boston.redhat.com/testlogs/13197/45855/357857/rhel4-nfs_nfsvers=2_udp_base.cap.bz2

-- Additional comment from jlayton@redhat.com on 2008-01-15 10:03 EST --
I'll try to set up a test rig today and see if I can run this test in a loop and
reproduce this.


-- Additional comment from jlayton@redhat.com on 2008-01-15 14:37 EST --
Ran the test in a loop for several hours and never hit the problem once.

Looking at the capture is interesting though. If I understand the test
correctly, it creates 200 files in a directory, and then removes the files one
by one, doing a readdir in between each remove and checking that the directory
contents are what is expected.

The capture shows everything going along well until after file.86 is removed.
After that, we do a READDIR call. It looks like the first part of the READDIR
response is lost, however and we get a UDP packet that contains what looks like
the last part of the READDIR response (starting at ~file.145). Presumably, the
client just discards this.

After this are 2 retransmissions of the READDIR request, and finally a READDIR
reply, which just contains the first 70 bytes of the packet. It ends just after
the NFS status code in the header (which was NFS_OK). Wireshark says it's a
malformed packet. Judging from the results of the test though, it looks like
maybe the client accepted the packet and took it to mean that the directory was
totally empty.

Maybe the nfs2 XDR decoder is borked here? This packet should probably have been
rejected...


-- Additional comment from staubach@redhat.com on 2008-01-15 14:49 EST --
The NFSv2 code may not handle a borked response correctly, but the real
questions are, I think, why the server generated such a packet and what
happened to the rest of the first response or whether the server was
doing the right things but the client mangled these two packets.

This sure looks like either a lower level networking transport
implementation issue or an issue where the NFS server manipulated,
ie. mangled/freed/etc, the response data structures before the network
transport implementation was done with them.

-- Additional comment from jlayton@redhat.com on 2008-01-15 15:01 EST --
Indeed. I'm thinking we're looking at (at least) 2 bugs here. One on the client
and one on the server. I wonder if we could reproduce the server bug with pynfs
or something -- make it so that the first readdir response on a directory is
dropped and then see how the response looks after the request is retransmitted...


-- Additional comment from jlayton@redhat.com on 2008-01-16 09:34 EST --
After having a closer look, I think that Peter's theory about the server
generating a bogus reply is probably the most likely. The UDP wrapper on the
packet has the correct length field, so the packet is consistent from the
standpoint of the UDP layer. The NFS field is definitely truncated though. After
the status code, there's nothing.

So this does seem like it's probably a server bug. One that is likely triggered
by spotty network connectivity causing retransmits.

On the client side, the NFS readdir xdr (nfs_xdr_readdirres) decoder has this:

        if ((status = ntohl(*p++)))
                return -nfs_stat_to_errno(status);

        hdrlen = (u8 *) p - (u8 *) iov->iov_base;
        if (iov->iov_len < hdrlen) {
                printk(KERN_WARNING "NFS: READDIR reply header overflowed:"
                                "length %d > %Zu\n", hdrlen, iov->iov_len);
                return -errno_NFSERR_IO;
        } else if (iov->iov_len != hdrlen) {
...

after setting "status" p should be beyond the end of the packet, and I think we
would have fallen into the first case of the following if statement. Do we
happen to have dmesg on the server at the time this occured? It would be good to
confirm the theory since that KERN_WARNING message should have popped up.


-- Additional comment from jburke@redhat.com on 2008-01-16 10:30 EST --
Jeff,
   Just a point of observation. This has never been seen at all up until
recently. So far it has only been seen with a RHEL4 Server and RHEL5 Client.

   We have only just started testing RHEL5.2 kernels. It has never been seen
with a RHEL5.1-Z Client.

Jeff


-- Additional comment from jlayton@redhat.com on 2008-01-16 11:04 EST --
Sorry, my last post should have read:

"Do we happen to have logs from the client when this occurred?"

kernel logs, in particular...



-- Additional comment from dzickus@redhat.com on 2008-01-16 11:09 EST --
Below in the Bug Info section (and here) you will find the URL to the logs
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1539980

Scroll to the bottom to see the console logs

-- Additional comment from jlayton@redhat.com on 2008-01-16 13:09 EST --
Thanks. Hmm not quite what I was expecting to see, but similar:

     NFS: readdir reply truncated!

...so it did detect a problem with a readdir response at some point. Most
likely, the problem is with the response packet we got. Here's the relevant code
from nfs_xdr_readdirres:

 out:
        kunmap_atomic(kaddr, KM_USER0);
        return nr;
 short_pkt:
        entry[0] = entry[1] = 0;
        /* truncate listing ? */
        if (!nr) {
                printk(KERN_NOTICE "NFS: readdir reply truncated!\n");
                entry[1] = 1;
        }
        goto out;


...so nr was 0, which I think is the number of entries. I wonder if we should
also be setting this inside that if statement?

    nr = -errno_NFSERR_IO;

It seems like there's clearly a problem with the response. Returning zero from
this function doesn't seem to be considered an error and I think it's what makes
higher levels consider this to be a empty dir.


-- Additional comment from dzickus@redhat.com on 2008-01-22 10:33 EST --

Is this problem solved by DaveM's solution in bz 429417 comment #5?

-- Additional comment from jlayton@redhat.com on 2008-01-22 11:09 EST --
I think probably not (though if DaveM wants to chime in and tell me otherwise,
I'd be happy to listen).

As best I can tell in this case, the server sent us a bogus readdir response in
an RPC retransmission, and the client accepted it as a valid response. I think
this is at least 2 bugs, really:

A) the server probably has a bug that causes it to send these bogus responses.
Peter's theory was that the server might be prematurely releasing a buffer and
that makes it send a packet with a status code but no actual entries to the client.

I've been considering ways to reproduce this. We might be able to do it by
selectively filtering packets. Something like this:

1) client does a readdir request
2) server sends a readdir response. We block the response here.
3) client retransmits the request, we block the retransmissions for a bit,
hopefully giving the server to do whatever it does that's wrong here
4) unblock traffic -- hopefully the server will send along a truncated response

...we might even get lucky and force a crash if we turn on memory poisoning here
too.

B) client side bug, the capture shows the bogus readdir response. I think the
client accepted the response and interpreted it to mean that the directory was
empty (no entries at all, not even "." or "..").

The function that interprets this is nfs_xdr_readdirres(). The logic in this
function is a bit...obtuse. This particular packet seems to be a very special
case. We see this message in the client's log:

     NFS: readdir reply truncated!

So we do know is that we ended up in this section of code:

 out:
        kunmap_atomic(kaddr, KM_USER0);
        return nr;
 short_pkt: 
        entry[0] = entry[1] = 0;
        /* truncate listing ? */
        if (!nr) {
                printk(KERN_NOTICE "NFS: readdir reply truncated!\n");
                entry[1] = 1;
        }
        goto out;

...obviously, this function returned 0 here. My thinking is that we might should
consider a patch like this here:

diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c
index 2cd35f2..e72d37b 100644
--- a/fs/nfs/nfs2xdr.c
+++ b/fs/nfs/nfs2xdr.c
@@ -474,6 +474,7 @@ nfs_xdr_readdirres(struct rpc_rqst *req, u32 *p, void *dummy)
        if (!nr) {
                printk(KERN_NOTICE "NFS: readdir reply truncated!\n");
                entry[1] = 1;
+               nr = -errno_NFSERR_IO;
        }
        goto out;
 err_unmap:

...instead of returning 0 entries, we return -EIO and (hopefully) the request
would be retransmitted.

Peter, any thoughts on that patch? Or am I misinterpreting the readdir response
xdr code?


-- Additional comment from staubach@redhat.com on 2008-01-22 13:36 EST --
Well, I don't think that I would be quite so quick to dismiss the
possibility that it is the bug from 429417.  I don't know that it
is, but it sounds like perhaps it could be.

On the client side, it seems that the client is not being quite as
defensive as it could be.  The suggestion to return the EIO error
seems reasonable to me.

-- Additional comment from jlayton@redhat.com on 2008-01-22 14:21 EST --
I could be very wrong here, but here's why I think that's not it...

If I'm reading bug 429417 correctly, the problem is that the UDP layer is
occasionally getting SKB's that have inconsistent size fields. I don't think
that would affect the host's ability to collect captures, however. The packet
capture here seem to pretty clearly indicate that the server sent the client a
bad packet.

I suppose the problem could ultimately be a manifestation of that problem on the
server, but I'm not sure. I've not really looked at the server side of this yet...

Either way, we should probably go ahead and clone this bug for RHEL4 and start
treating the server side as a separate issue.
Comment 1 Jeff Layton 2008-02-02 09:49:28 EST

*** This bug has been marked as a duplicate of 430946 ***

Note You need to log in before you can comment on or make changes to this bug.