Bug 112766 - e.34 kernel breaks UDP NFS badly
Summary: e.34 kernel breaks UDP NFS badly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 107565
TreeView+ depends on / blocked
 
Reported: 2003-12-30 23:18 UTC by Joshua Jensen
Modified: 2007-11-30 22:06 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-02-03 20:01:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
tcpdump.bin.gz (5.75 KB, application/octet-stream)
2003-12-31 18:17 UTC, Joshua Jensen
no flags Details
tcpdump file from RHL7.3 box on same subnet as AS2.1 client machine (4.72 KB, application/octet-stream)
2004-01-02 22:22 UTC, Joshua Jensen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2004:044 0 normal SHIPPED_LIVE Moderate: kernel security update 2004-02-03 05:00:00 UTC

Description Joshua Jensen 2003-12-30 23:18:39 UTC
Description of problem:

Same mount point, same file, very different I/O. rsize and wsize are
not specified at all:

mount option of nfsvers=3,tcp gives 1.04MB/s throughput
mount option of nfsvers=3,udp gives 121kB/s throughput
mount option of nfsvers=3     gives 120kB/s throughput

mount option of nfsvers=2,tcp gives 1.01MB/s throughput
mount option of nfsvers=2,udp gives 78kB/s throughput
mount option of nfsvers=2     gives 81kB/s throughput

NFS version 2 AND 3 are clearly broken in e.34 when using UDP.  Since
UDP is the default, THIS IS A SERIOUS BUG!

On a RHL7.3-based machine, running the 2.4.20-27.7 kernel, I get about
the same numbers as a "good" run of the above... same file, same
network... TCP and UDP give me about the same throughput:  1MB/s


Version-Release number of selected component (if applicable):

e.34 kernel on AS2.1


How reproducible:

Install AS2.1 Update3, with e34 kernel
Perform tests on a clean low traffic network
  
Actual results:

NFS IO differs by large amounts... TCP is MUCH better than UDP

Expected results:

NFS IO should differ by a much smaller ammount... TCP isn't that much
better than UDP on a clean network

Additional info:

No sysctl.conf tweaking or ipchains or iptables rules.  This terrible
I/O with NFS UDP is true for more than just one driver... tests above
are on eepro100 and e100 and tg3 drivers.  Test was run using rsync of
a large file on a NetApps NFS server.

Note that this badly breaks installations over NFS as well... 20
minute installs now take over 2 hours.

Comment 1 Alan Cox 2003-12-31 14:51:44 UTC
Three pieces of information would be useful

1. Does a straight file copy show the same problem
2. NFS numbers with rsize=1024,wsize=1024,udp - this will avoid ip
level fragmentation, drops through congestion on cards/hubs etc and
makes it easier to work out where the problem might be
3. A tcpdump of a _short_ section of a problem UDP transfer

You specify 3 different drivers, I assume same sort of numbers with each 


Comment 2 Joshua Jensen 2003-12-31 18:15:41 UTC
Alan,

1)  I'm using rsync instead of cp only because of --progress...
however  I took the time to find "pv" at
http://www.ivarch.com/programs/pv.shtml
"cat /mount/path/largefile | pv -pr > /dev/null" shows the same numbers
Also, NFS kickstarts are miserably slow.

2)  With "rsize=1024,wsize=1024,udp" I get 15-18kB/s with nfsvers 2 and 3

3)  tcpdump with above params and nfsvers 3 attached

Comment 3 Joshua Jensen 2003-12-31 18:17:24 UTC
Created attachment 96734 [details]
tcpdump.bin.gz

tcpdump of problem transfer... from the beginning to 439 packets into it

Comment 4 Frank Hirtz 2004-01-01 08:53:12 UTC
I was somewhat alarmed at this report and thus attempted to replicate.
I've not been successful in that I don't see the discrepancy mentioned :

First run against a RHEL 3 NFS server (RHEL 2.1 does not serve TCP):

[root@cube tmp]# uname -r
2.4.9-e.34enterprise
[root@cube tmp]# lsmod
Module                  Size  Used by    Not tainted
nfs                    98592   4  (autoclean)
iscsi                  40704   0  (unused)
nfsd                   80704   8  (autoclean)
lockd                  61184   1  (autoclean) [nfs nfsd]
sunrpc                 84464   1  (autoclean) [nfs nfsd lockd]
e100                   58308   1
appletalk              29676   0  (autoclean)
ipx                    25492   0  (autoclean)
iptable_filter          2912   0  (autoclean) (unused)
ip_tables              14688   1  [iptable_filter]
usb-uhci               26948   0  (unused)
usbcore                68896   1  [usb-uhci]
ext3                   70944   3
jbd                    55444   3  [ext3]
raid0                   4960   2
aic7xxx               127200   6
ips                    43840   0  (unused)
sd_mod                 13920   6
scsi_mod              126908   4  [iscsi aic7xxx ips sd_mod]
[root@cube tmp]# vim /etc/fstab
[root@cube tmp]# mount buildv2
[root@cube tmp]# mount buildv3
[root@cube tmp]# mount buildtcp
[root@cube tmp]# mount | grep build
192.168.24.170:/tmp on /tmp/buildv2 type nfs
(rw,vers=2,addr=192.168.24.170)
192.168.24.170:/tmp on /tmp/buildv3 type nfs
(rw,vers=3,addr=192.168.24.170)
192.168.24.170:/tmp on /tmp/buildtcp type nfs
(rw,vers=3,tcp,addr=192.168.24.170)
[root@cube tmp]# cat /proc/mounts | grep build
192.168.24.170:/tmp /tmp/buildv2 nfs
rw,v2,rsize=8192,wsize=8192,hard,udp,lock,addr=192.168.24.170 0 0
192.168.24.170:/tmp /tmp/buildv3 nfs
rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=192.168.24.170 0 0
192.168.24.170:/tmp /tmp/buildtcp nfs
rw,v3,rsize=8192,wsize=8192,hard,tcp,lock,addr=192.168.24.170 0 0
[root@cube tmp]# dd if=/dev/zero of=/tmp/testfile1 bs=1M count=100
100+0 records in
100+0 records out
[root@cube tmp]# dd if=/dev/zero of=/tmp/testfile2 bs=1M count=100
100+0 records in
100+0 records out
[root@cube tmp]# dd if=/dev/zero of=/tmp/testfile3 bs=1M count=100
100+0 records in
100+0 records out
[root@cube tmp]# ls -lh testfile*
-rw-r--r--    1 root     root         100M Jan  1 02:23 testfile1
-rw-r--r--    1 root     root         100M Jan  1 02:25 testfile2
-rw-r--r--    1 root     root         100M Jan  1 02:28 testfile3
[root@cube tmp]# time cp testfile1 buildv2/

real    1m35.831s
user    0m0.120s
sys     0m2.390s
[root@cube tmp]# time cp testfile2 buildv3/

real    1m38.132s
user    0m0.110s
sys     0m2.900s
[root@cube tmp]# time cp testfile3 buildtcp/

real    1m42.955s
user    0m0.030s
sys     0m2.810s
[root@cube tmp]# rm -rf testfile*

<touch the testfiles on the remote server to invalidate the cached
attributes>

[root@cube tmp]# time cp buildv2/testfile1 .

real    1m41.538s
user    0m0.100s
sys     0m3.500s
[root@cube tmp]# time cp buildv3/testfile2 .

real    1m42.124s
user    0m0.120s
sys     0m3.520s
[root@cube tmp]# time cp buildtcp/testfile3 .

real    1m58.503s
user    0m0.110s
sys     0m3.400s

I also ran the tests against a netapp filer that we have internally
with similar results (server name changed):

[root@cube tmp]# mount homev2
[root@cube tmp]# mount homev3
[root@cube tmp]# mount hometcp
[root@cube tmp]# ls
buildtcp  buildv3      hometcp  homev3          MQSeries.18831 
screens    testfile2
buildv2   environment  homev2   MQSeries.18773  orbit-root     
testfile1  testfile3
[root@cube tmp]# mount | grep home
FOO.redhat.com:/vol/home2/support/fhirtz/test on /tmp/homev2 type nfs
(rw,vers=2,addr=172.16.52.114)
FOO.redhat.com:/vol/home2/support/fhirtz/test on /tmp/homev3 type nfs
(rw,vers=3,addr=172.16.52.114)
FOO.redhat.com:/vol/home2/support/fhirtz/test on /tmp/hometcp type nfs
(rw,tcp,vers=3,addr=172.16.52.114)
[root@cube tmp]# cat /proc/mounts | grep home
FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev2 nfs
rw,v2,rsize=8192,wsize=8192,hard,udp,lock,addr=vader.corp.redhat.com 0 0
FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev3 nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=vader.corp.redhat.com 0 0
FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/hometcp nfs
rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=vader.corp.redhat.com 0 0
[root@cube tmp]# time cp testfile1 homev2/

real    1m33.835s
user    0m0.120s
sys     0m2.400s
[root@cube tmp]# time cp testfile2 homev3/

real    1m30.642s
user    0m0.020s
sys     0m2.260s
[root@cube tmp]# time cp testfile3 hometcp/

real    1m35.579s
user    0m0.030s
sys     0m1.990s

[root@cube tmp]# time cp homev2/testfile1 .

real    1m39.772s
user    0m0.140s
sys     0m3.570s
[root@cube tmp]# time cp homev3/testfile2 .

real    3m38.785s
user    0m0.180s
sys     0m4.060s
[root@cube tmp]# time cp hometcp/testfile3 .

real    1m55.839s
user    0m0.170s
sys     0m2.820s

At this point, I was a bit concerned on the poor showing of v3 UDP
with default settings, so I set the r/wsize to a somewhat more stable
8k instead of the negotiated 32k. 32k requires quite the network for
good performance and apparently my link to the filer is not ideal for
this...

[root@cube tmp]# rm -f testfile*
[root@cube tmp]# umount home*
[root@cube tmp]# vim /etc/fstab
[root@cube tmp]# cat /proc/mounts | grep home
FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev3 nfs
rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=vader.corp.redhat.com 0 0
[root@cube tmp]# mount homev3/
[root@cube tmp]# time cp homev3/testfile2 .

real    1m40.481s
user    0m0.120s
sys     0m3.460s

The adjustment to 8k appears to have cleared up the discrepancy here
(the RHEL3 system negotiated to 8k by default). 



Comment 5 Frank Hirtz 2004-01-02 16:19:38 UTC
The below is a test of the above "cat" test case on a filer:

[root@cube tmp]# uname -a
Linux cube 2.4.9-e.34enterprise #1 SMP Wed Dec 10 16:42:39 EST 2003
i686 unknown
[root@cube tmp]# lsmod
Module                  Size  Used by    Not tainted
nfs                    98592   7  (autoclean)
iscsi                  40704   0  (unused)
nfsd                   80704   8  (autoclean)
lockd                  61184   1  (autoclean) [nfs nfsd]
sunrpc                 84464   1  (autoclean) [nfs nfsd lockd]
e100                   58308   1 
appletalk              29676   0  (autoclean)
ipx                    25492   0  (autoclean)
iptable_filter          2912   0  (autoclean) (unused)
ip_tables              14688   1  [iptable_filter]
usb-uhci               26948   0  (unused)
usbcore                68896   1  [usb-uhci]
ext3                   70944   3 
jbd                    55444   3  [ext3]
raid0                   4960   2 
aic7xxx               127200   6 
ips                    43840   0  (unused)
sd_mod                 13920   6 
scsi_mod              126908   4  [iscsi aic7xxx ips sd_mod]
[root@cube tmp]# cat /proc/mounts | grep home
FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/hometcp nfs
rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=FOO.redhat.com 0 0
FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev2 nfs
rw,v2,rsize=8192,wsize=8192,hard,udp,lock,addr=FOO.redhat.com 0 0
FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev3 nfs
rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=FOO.redhat.com 0 0
[root@cube tmp]# ls -lh homev2/testfile1
-rw-r--r--    1 nfsnobod nfsnobod     100M Jan  1 03:15 homev2/testfile1
[root@cube tmp]# time cat homev2/testfile1  > testfile

real    1m41.176s
user    0m0.000s
sys     0m3.360s
[root@cube tmp]# time cat homev3/testfile1  > testfile

real    1m40.635s
user    0m0.030s
sys     0m3.560s
[root@cube tmp]# time cat hometcp/testfile1  > testfile

real    1m55.329s
user    0m0.020s
sys     0m2.710s

It would appear to not differ from the "cp" test case in that it does
not replicate the reported issue.


Comment 6 Eric Hagberg 2004-01-02 17:38:35 UTC
Have you attempted to replicate this against a NetApp nfs server, as
reported in the original request? It does seem to have the problem
described if you run the tests against a netapp, at least I was able
to see similar throughput reduction over udp.

Comment 7 Frank Hirtz 2004-01-02 17:57:37 UTC
The second instance in the above and the last entry were against a
netapp filer, though I'm not sure of the model. Which filer are you
seeing this against? 

Comment 8 Neil Horman 2004-01-02 18:59:16 UTC
I've repeated most of the tests Frank ran on the same equipment using
rsync rather than cp and have come up with the same results:
---------------------------------------------------------------------
100MB file transferred over NFS via rsync --progress to NFS server
(NFS write commands)

rsize=1024,wsize=1024
104857600 100%  450.78kB/s    0:03:47

rsize=2048,wsize=2048
104857600 100%  622.50kB/s    0:02:44

rsize=4096,wsize=4096
104857600 100%  865.20kB/s    0:01:58

rsize=8192,wsize=8192
104857600 100%  967.20kB/s    0:01:45

100MB file transferred over NFS via rsync --progress from NFS server
(NFS read commands)

rsize=8192,wsize=8192
104857600 100%    1.07MB/s    0:01:33

rsize=4096,wsize=4096
104857600 100%    1.01MB/s    0:01:39

rsize=2048,wsize=2048
104857600 100%  752.81kB/s    0:02:15

rsize=1024,wsize=1024
104857600 100%  458.76kB/s    0:03:43
----------------------------------------------------------------------

I don't currently have access to a netapp filer, so unfortunately I
cannot reproduce that portion of the environment,

1) The client and the server are across at least 1 network hop.  Is it
possible this is simply a problem with a intermediate device?  Perhaps
some QOS/rate limiting based on packet size?  If u.34 negotiates a
different write size than other kernels, this may explain the
performance difference.

2) There is a very uniform behavior in the attached tcpdump.  The
server responds to each NFS request in approximately 6/100ths of a
second (60 milliseconds).  A trace of the above tests shows a typical
response time of only a few milliseconds (for files cached in ram). 
Granted that we cannot guarantee files will be cached in ram, but
still, almost a 30x decrease in speed for every packet (rather than
just the first few until readahead on the filer caches the file) seems
excessive.  

I think if Joshua can take a trace of a non e.34 system accessing the
filer for comparison, that would be helpful.  Also, if he can attempt
to access the netapp device from an e.34 machine attached to the same
subnet as the filer, that would help to eliminate the network as a
culprit. 


Comment 9 Eric Hagberg 2004-01-02 19:03:07 UTC
using version 6.4 R1 D5 (as told to me by one of our storage folks)

Comment 10 Kevin Krafthefer 2004-01-02 19:18:05 UTC
For what it's worth, I recently saw nfsd die outright on an rhel3
machine that was being mounted from rhel3 clients via nfs over udp.
The clients reported a number of "nfs server not responding" prior to
the nfsd stoppage. The network traffic looks clean at the time leading
up the outage and the server itself didn't report anything worthwhile.
This seems hardware independent and can be reproduced by doing a find
from a client on a ~1TB nfs mount (even happens with the noatime option).

Comment 11 Joshua Jensen 2004-01-02 22:21:18 UTC
The netapps server I'm pointing at is running version 6.3.3

Attached is a tcpdump file from a RHL7.3 (with all errata) box on the
same subnet as the problematic AS2.1 client machine, with these mount
options: nfsvers=3,udp,rsize=1024,wsize=1024


Comment 12 Joshua Jensen 2004-01-02 22:22:15 UTC
Created attachment 96747 [details]
tcpdump file from RHL7.3 box on same subnet as AS2.1 client machine

Comment 13 Steve Dickson 2004-01-03 20:48:43 UTC
What does nfsstat -rc show a large number of retrans? 
Also does ifconfig -i show that the interface is drop packets?
Finally is there router or switch in the network that 
could be dropping packets? 


Comment 14 Joshua Jensen 2004-01-05 16:50:54 UTC
Hi Steve... in answer to your questions, ifconfig is showing 0 packets
dropped, however, nfsstat -cr is showing about 3000 calls, and 1200
retrans.  Those numbers are all with r/w size of 8192... r/w size of
1024 and 4096 shows all of 2 retrans.

This seems out of the ordinary, as my RHL7.3 box on the same subnet
shows something like 120 retrans for 8192 r/w size from the same
length of time.

So yes, there could be a router or something on the network that is
dropping some UDP packets, but only from the AS2.1 box... RHL7.3 isn't
having any problems.

Comment 15 Brian Long 2004-01-06 12:14:37 UTC
Please try to duplicate this over a 60+ms WAN link, like the one we
are using internally.  NFS/UDP/WAN has never been a reliable
connection whether it be Linux or Solaris.  This is why we set the
option "nfs.udp.xfersize" to 8192 on all our Netapp filers.  All our
Solaris clients default to TCP, but Linux and other clients default to
UDP.

What is the current plan of attack?  Is someone looking at the NFS
code changes between e.25 and e.34 to see what changed?  Do you plan
to issue an errata, maybe e.36, to fix this issue?

Comment 21 Brian Long 2004-01-19 17:15:38 UTC
PLEASE update this bug with status.

Comment 22 Brian Long 2004-01-21 12:38:13 UTC
In case it was not obvious, Joshua logged this bug on behalf of Cisco
Systems which has roughly 50 AS2.1 entitlements.  This bug is critical
for our environment and I have not seen a public response about a
planned fix.  Please update this bug with status today.  Thank you.

Comment 23 Frank Hirtz 2004-01-21 15:44:44 UTC
I'm not the operative engineer on this queue, but have been working
this issue on the behalf of some of my clients. Thus, I can not
comment on the technical disposition of this case but I can comment on
the proper process around customer issues. 

The short of it is that Bugzilla is not a customer support tool. It's
an engineering tool used to gather and catagorize issues. There is no
SLA implied or other with bugzilla logged issues. 
The 50 AS entitlements come with channels into the Red Hat support
organization where this issue can be handled by a support engineer
who's function is to work with the customer and engineering to make
sure that the issue gets resolved and that there is the appropriate
level of responsiveness. The support channels do indeed have service
response guidelines, and they would be the ideal place to pursue this
issue. The support channels can be found/accessed at the following URL:

http://www.redhat.com/support

Please feel free to log technical issues in the directly in the
bugzilla system, but important issues should be funneled through our
support organization.

Comment 24 Jason Baron 2004-01-29 16:06:54 UTC
This issue is resovled in the latest test kernel, which is queued for
immediate release. in the meantime, if anybody wants to test out the
fix the 'unofficial' kernel is at:

http://people.redhat.com/jbaron/.private/stable/2.4.9-e.37/

Comment 25 Mark J. Cox 2004-02-03 20:01:54 UTC
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-044.html



Note You need to log in before you can comment on or make changes to this bug.