Description of problem: Same mount point, same file, very different I/O. rsize and wsize are not specified at all: mount option of nfsvers=3,tcp gives 1.04MB/s throughput mount option of nfsvers=3,udp gives 121kB/s throughput mount option of nfsvers=3 gives 120kB/s throughput mount option of nfsvers=2,tcp gives 1.01MB/s throughput mount option of nfsvers=2,udp gives 78kB/s throughput mount option of nfsvers=2 gives 81kB/s throughput NFS version 2 AND 3 are clearly broken in e.34 when using UDP. Since UDP is the default, THIS IS A SERIOUS BUG! On a RHL7.3-based machine, running the 2.4.20-27.7 kernel, I get about the same numbers as a "good" run of the above... same file, same network... TCP and UDP give me about the same throughput: 1MB/s Version-Release number of selected component (if applicable): e.34 kernel on AS2.1 How reproducible: Install AS2.1 Update3, with e34 kernel Perform tests on a clean low traffic network Actual results: NFS IO differs by large amounts... TCP is MUCH better than UDP Expected results: NFS IO should differ by a much smaller ammount... TCP isn't that much better than UDP on a clean network Additional info: No sysctl.conf tweaking or ipchains or iptables rules. This terrible I/O with NFS UDP is true for more than just one driver... tests above are on eepro100 and e100 and tg3 drivers. Test was run using rsync of a large file on a NetApps NFS server. Note that this badly breaks installations over NFS as well... 20 minute installs now take over 2 hours.
Three pieces of information would be useful 1. Does a straight file copy show the same problem 2. NFS numbers with rsize=1024,wsize=1024,udp - this will avoid ip level fragmentation, drops through congestion on cards/hubs etc and makes it easier to work out where the problem might be 3. A tcpdump of a _short_ section of a problem UDP transfer You specify 3 different drivers, I assume same sort of numbers with each
Alan, 1) I'm using rsync instead of cp only because of --progress... however I took the time to find "pv" at http://www.ivarch.com/programs/pv.shtml "cat /mount/path/largefile | pv -pr > /dev/null" shows the same numbers Also, NFS kickstarts are miserably slow. 2) With "rsize=1024,wsize=1024,udp" I get 15-18kB/s with nfsvers 2 and 3 3) tcpdump with above params and nfsvers 3 attached
Created attachment 96734 [details] tcpdump.bin.gz tcpdump of problem transfer... from the beginning to 439 packets into it
I was somewhat alarmed at this report and thus attempted to replicate. I've not been successful in that I don't see the discrepancy mentioned : First run against a RHEL 3 NFS server (RHEL 2.1 does not serve TCP): [root@cube tmp]# uname -r 2.4.9-e.34enterprise [root@cube tmp]# lsmod Module Size Used by Not tainted nfs 98592 4 (autoclean) iscsi 40704 0 (unused) nfsd 80704 8 (autoclean) lockd 61184 1 (autoclean) [nfs nfsd] sunrpc 84464 1 (autoclean) [nfs nfsd lockd] e100 58308 1 appletalk 29676 0 (autoclean) ipx 25492 0 (autoclean) iptable_filter 2912 0 (autoclean) (unused) ip_tables 14688 1 [iptable_filter] usb-uhci 26948 0 (unused) usbcore 68896 1 [usb-uhci] ext3 70944 3 jbd 55444 3 [ext3] raid0 4960 2 aic7xxx 127200 6 ips 43840 0 (unused) sd_mod 13920 6 scsi_mod 126908 4 [iscsi aic7xxx ips sd_mod] [root@cube tmp]# vim /etc/fstab [root@cube tmp]# mount buildv2 [root@cube tmp]# mount buildv3 [root@cube tmp]# mount buildtcp [root@cube tmp]# mount | grep build 192.168.24.170:/tmp on /tmp/buildv2 type nfs (rw,vers=2,addr=192.168.24.170) 192.168.24.170:/tmp on /tmp/buildv3 type nfs (rw,vers=3,addr=192.168.24.170) 192.168.24.170:/tmp on /tmp/buildtcp type nfs (rw,vers=3,tcp,addr=192.168.24.170) [root@cube tmp]# cat /proc/mounts | grep build 192.168.24.170:/tmp /tmp/buildv2 nfs rw,v2,rsize=8192,wsize=8192,hard,udp,lock,addr=192.168.24.170 0 0 192.168.24.170:/tmp /tmp/buildv3 nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=192.168.24.170 0 0 192.168.24.170:/tmp /tmp/buildtcp nfs rw,v3,rsize=8192,wsize=8192,hard,tcp,lock,addr=192.168.24.170 0 0 [root@cube tmp]# dd if=/dev/zero of=/tmp/testfile1 bs=1M count=100 100+0 records in 100+0 records out [root@cube tmp]# dd if=/dev/zero of=/tmp/testfile2 bs=1M count=100 100+0 records in 100+0 records out [root@cube tmp]# dd if=/dev/zero of=/tmp/testfile3 bs=1M count=100 100+0 records in 100+0 records out [root@cube tmp]# ls -lh testfile* -rw-r--r-- 1 root root 100M Jan 1 02:23 testfile1 -rw-r--r-- 1 root root 100M Jan 1 02:25 testfile2 -rw-r--r-- 1 root root 100M Jan 1 02:28 testfile3 [root@cube tmp]# time cp testfile1 buildv2/ real 1m35.831s user 0m0.120s sys 0m2.390s [root@cube tmp]# time cp testfile2 buildv3/ real 1m38.132s user 0m0.110s sys 0m2.900s [root@cube tmp]# time cp testfile3 buildtcp/ real 1m42.955s user 0m0.030s sys 0m2.810s [root@cube tmp]# rm -rf testfile* <touch the testfiles on the remote server to invalidate the cached attributes> [root@cube tmp]# time cp buildv2/testfile1 . real 1m41.538s user 0m0.100s sys 0m3.500s [root@cube tmp]# time cp buildv3/testfile2 . real 1m42.124s user 0m0.120s sys 0m3.520s [root@cube tmp]# time cp buildtcp/testfile3 . real 1m58.503s user 0m0.110s sys 0m3.400s I also ran the tests against a netapp filer that we have internally with similar results (server name changed): [root@cube tmp]# mount homev2 [root@cube tmp]# mount homev3 [root@cube tmp]# mount hometcp [root@cube tmp]# ls buildtcp buildv3 hometcp homev3 MQSeries.18831 screens testfile2 buildv2 environment homev2 MQSeries.18773 orbit-root testfile1 testfile3 [root@cube tmp]# mount | grep home FOO.redhat.com:/vol/home2/support/fhirtz/test on /tmp/homev2 type nfs (rw,vers=2,addr=172.16.52.114) FOO.redhat.com:/vol/home2/support/fhirtz/test on /tmp/homev3 type nfs (rw,vers=3,addr=172.16.52.114) FOO.redhat.com:/vol/home2/support/fhirtz/test on /tmp/hometcp type nfs (rw,tcp,vers=3,addr=172.16.52.114) [root@cube tmp]# cat /proc/mounts | grep home FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev2 nfs rw,v2,rsize=8192,wsize=8192,hard,udp,lock,addr=vader.corp.redhat.com 0 0 FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev3 nfs rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=vader.corp.redhat.com 0 0 FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/hometcp nfs rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=vader.corp.redhat.com 0 0 [root@cube tmp]# time cp testfile1 homev2/ real 1m33.835s user 0m0.120s sys 0m2.400s [root@cube tmp]# time cp testfile2 homev3/ real 1m30.642s user 0m0.020s sys 0m2.260s [root@cube tmp]# time cp testfile3 hometcp/ real 1m35.579s user 0m0.030s sys 0m1.990s [root@cube tmp]# time cp homev2/testfile1 . real 1m39.772s user 0m0.140s sys 0m3.570s [root@cube tmp]# time cp homev3/testfile2 . real 3m38.785s user 0m0.180s sys 0m4.060s [root@cube tmp]# time cp hometcp/testfile3 . real 1m55.839s user 0m0.170s sys 0m2.820s At this point, I was a bit concerned on the poor showing of v3 UDP with default settings, so I set the r/wsize to a somewhat more stable 8k instead of the negotiated 32k. 32k requires quite the network for good performance and apparently my link to the filer is not ideal for this... [root@cube tmp]# rm -f testfile* [root@cube tmp]# umount home* [root@cube tmp]# vim /etc/fstab [root@cube tmp]# cat /proc/mounts | grep home FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev3 nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=vader.corp.redhat.com 0 0 [root@cube tmp]# mount homev3/ [root@cube tmp]# time cp homev3/testfile2 . real 1m40.481s user 0m0.120s sys 0m3.460s The adjustment to 8k appears to have cleared up the discrepancy here (the RHEL3 system negotiated to 8k by default).
The below is a test of the above "cat" test case on a filer: [root@cube tmp]# uname -a Linux cube 2.4.9-e.34enterprise #1 SMP Wed Dec 10 16:42:39 EST 2003 i686 unknown [root@cube tmp]# lsmod Module Size Used by Not tainted nfs 98592 7 (autoclean) iscsi 40704 0 (unused) nfsd 80704 8 (autoclean) lockd 61184 1 (autoclean) [nfs nfsd] sunrpc 84464 1 (autoclean) [nfs nfsd lockd] e100 58308 1 appletalk 29676 0 (autoclean) ipx 25492 0 (autoclean) iptable_filter 2912 0 (autoclean) (unused) ip_tables 14688 1 [iptable_filter] usb-uhci 26948 0 (unused) usbcore 68896 1 [usb-uhci] ext3 70944 3 jbd 55444 3 [ext3] raid0 4960 2 aic7xxx 127200 6 ips 43840 0 (unused) sd_mod 13920 6 scsi_mod 126908 4 [iscsi aic7xxx ips sd_mod] [root@cube tmp]# cat /proc/mounts | grep home FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/hometcp nfs rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=FOO.redhat.com 0 0 FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev2 nfs rw,v2,rsize=8192,wsize=8192,hard,udp,lock,addr=FOO.redhat.com 0 0 FOO.redhat.com:/vol/home2/support/fhirtz/test /tmp/homev3 nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=FOO.redhat.com 0 0 [root@cube tmp]# ls -lh homev2/testfile1 -rw-r--r-- 1 nfsnobod nfsnobod 100M Jan 1 03:15 homev2/testfile1 [root@cube tmp]# time cat homev2/testfile1 > testfile real 1m41.176s user 0m0.000s sys 0m3.360s [root@cube tmp]# time cat homev3/testfile1 > testfile real 1m40.635s user 0m0.030s sys 0m3.560s [root@cube tmp]# time cat hometcp/testfile1 > testfile real 1m55.329s user 0m0.020s sys 0m2.710s It would appear to not differ from the "cp" test case in that it does not replicate the reported issue.
Have you attempted to replicate this against a NetApp nfs server, as reported in the original request? It does seem to have the problem described if you run the tests against a netapp, at least I was able to see similar throughput reduction over udp.
The second instance in the above and the last entry were against a netapp filer, though I'm not sure of the model. Which filer are you seeing this against?
I've repeated most of the tests Frank ran on the same equipment using rsync rather than cp and have come up with the same results: --------------------------------------------------------------------- 100MB file transferred over NFS via rsync --progress to NFS server (NFS write commands) rsize=1024,wsize=1024 104857600 100% 450.78kB/s 0:03:47 rsize=2048,wsize=2048 104857600 100% 622.50kB/s 0:02:44 rsize=4096,wsize=4096 104857600 100% 865.20kB/s 0:01:58 rsize=8192,wsize=8192 104857600 100% 967.20kB/s 0:01:45 100MB file transferred over NFS via rsync --progress from NFS server (NFS read commands) rsize=8192,wsize=8192 104857600 100% 1.07MB/s 0:01:33 rsize=4096,wsize=4096 104857600 100% 1.01MB/s 0:01:39 rsize=2048,wsize=2048 104857600 100% 752.81kB/s 0:02:15 rsize=1024,wsize=1024 104857600 100% 458.76kB/s 0:03:43 ---------------------------------------------------------------------- I don't currently have access to a netapp filer, so unfortunately I cannot reproduce that portion of the environment, 1) The client and the server are across at least 1 network hop. Is it possible this is simply a problem with a intermediate device? Perhaps some QOS/rate limiting based on packet size? If u.34 negotiates a different write size than other kernels, this may explain the performance difference. 2) There is a very uniform behavior in the attached tcpdump. The server responds to each NFS request in approximately 6/100ths of a second (60 milliseconds). A trace of the above tests shows a typical response time of only a few milliseconds (for files cached in ram). Granted that we cannot guarantee files will be cached in ram, but still, almost a 30x decrease in speed for every packet (rather than just the first few until readahead on the filer caches the file) seems excessive. I think if Joshua can take a trace of a non e.34 system accessing the filer for comparison, that would be helpful. Also, if he can attempt to access the netapp device from an e.34 machine attached to the same subnet as the filer, that would help to eliminate the network as a culprit.
using version 6.4 R1 D5 (as told to me by one of our storage folks)
For what it's worth, I recently saw nfsd die outright on an rhel3 machine that was being mounted from rhel3 clients via nfs over udp. The clients reported a number of "nfs server not responding" prior to the nfsd stoppage. The network traffic looks clean at the time leading up the outage and the server itself didn't report anything worthwhile. This seems hardware independent and can be reproduced by doing a find from a client on a ~1TB nfs mount (even happens with the noatime option).
The netapps server I'm pointing at is running version 6.3.3 Attached is a tcpdump file from a RHL7.3 (with all errata) box on the same subnet as the problematic AS2.1 client machine, with these mount options: nfsvers=3,udp,rsize=1024,wsize=1024
Created attachment 96747 [details] tcpdump file from RHL7.3 box on same subnet as AS2.1 client machine
What does nfsstat -rc show a large number of retrans? Also does ifconfig -i show that the interface is drop packets? Finally is there router or switch in the network that could be dropping packets?
Hi Steve... in answer to your questions, ifconfig is showing 0 packets dropped, however, nfsstat -cr is showing about 3000 calls, and 1200 retrans. Those numbers are all with r/w size of 8192... r/w size of 1024 and 4096 shows all of 2 retrans. This seems out of the ordinary, as my RHL7.3 box on the same subnet shows something like 120 retrans for 8192 r/w size from the same length of time. So yes, there could be a router or something on the network that is dropping some UDP packets, but only from the AS2.1 box... RHL7.3 isn't having any problems.
Please try to duplicate this over a 60+ms WAN link, like the one we are using internally. NFS/UDP/WAN has never been a reliable connection whether it be Linux or Solaris. This is why we set the option "nfs.udp.xfersize" to 8192 on all our Netapp filers. All our Solaris clients default to TCP, but Linux and other clients default to UDP. What is the current plan of attack? Is someone looking at the NFS code changes between e.25 and e.34 to see what changed? Do you plan to issue an errata, maybe e.36, to fix this issue?
PLEASE update this bug with status.
In case it was not obvious, Joshua logged this bug on behalf of Cisco Systems which has roughly 50 AS2.1 entitlements. This bug is critical for our environment and I have not seen a public response about a planned fix. Please update this bug with status today. Thank you.
I'm not the operative engineer on this queue, but have been working this issue on the behalf of some of my clients. Thus, I can not comment on the technical disposition of this case but I can comment on the proper process around customer issues. The short of it is that Bugzilla is not a customer support tool. It's an engineering tool used to gather and catagorize issues. There is no SLA implied or other with bugzilla logged issues. The 50 AS entitlements come with channels into the Red Hat support organization where this issue can be handled by a support engineer who's function is to work with the customer and engineering to make sure that the issue gets resolved and that there is the appropriate level of responsiveness. The support channels do indeed have service response guidelines, and they would be the ideal place to pursue this issue. The support channels can be found/accessed at the following URL: http://www.redhat.com/support Please feel free to log technical issues in the directly in the bugzilla system, but important issues should be funneled through our support organization.
This issue is resovled in the latest test kernel, which is queued for immediate release. in the meantime, if anybody wants to test out the fix the 'unofficial' kernel is at: http://people.redhat.com/jbaron/.private/stable/2.4.9-e.37/
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-044.html