Description of Problem: Redhat 7.3: nfs writes very slow. Version-Release number of selected component (if applicable): Kernel version 2.4.18-4 How Reproducible: Always. Steps to Reproduce: 1. Upgrade Dell Dimension 4100 800Mhz PIII with 3COM 3C905-C-TXM 10/100 ethernet card from Redhat 7.1 to 7.3. Mount user files via nfs from a Network Appliance file server. 2. Try to run vi or mail on a 1.7 mb mail file. Delete some messages, and try to write the file out. 3. After 5 or 6 minutes, the file will be written out. During this time, other commands on the machine such as ps aux, dmesg, and top will hang. After it writes file out, wakes back up, dmesg and /var/log/messages will show repeated messages of form nfs: server sinagua not responding, still trying nfs: server sinagua OK Actual Results: 5 or 6 minutes to write out 1.7 mb file to nfs file server. Expected Results: 2 or 3 seconds. Additional Information: I've observed this on the first 2 machines that i've put 7.3 on - Both are Dell Dimension 4100 800Mhz PIII 3COM 3C905-C-TXM 10/100 ethernet I tried downloading, building and using kernel version 2.5.15 from kernel.org, (the latest development version), and i don't seem to have this problem with that kernel. However, i have some 200 machines to upgrade to 7.3, and i dont know if i want to put 2.5.15 on all of them. Let me know if there is additional information that i can provide. Phil Kaslo Dept. of Computer Science 520-621/2760 University of Arizona phil.edu Tucson, Ariz. 85721
I think I ran into this myself, except that the result was much more disastrous. Can you check on this info on your box, and see if your case matches? I installed (fresh install, not upgrade) 7.3 on a Dell Optiplex GX-150, which is similar to your system. Notably, it has the same Ethernet card: 3C905C (Vortex Boomerang.) On previous RH versions (i.e. 7.1 and 7.2), NFS worked just fine. However, it fails on 7.3, with the following symptoms: - works fine for short reads - longer reads (or writes) take a very very long time to complete Unfortunately, eventually my network support people shut off my ethernet port, because my box was basically DOS flooding the NFS server! After investigation, it turns out that somehow my 3c905C is experiencing A LOT of collisions, on a switched network. The collisions are causing dropped TCP/IP packets. With TCP, this just reduces throughput, but on UDP (which NFS uses,) it generated so many retries and resends between my box and the server that I was flooding the NFS server. (The server was complaining of "dropped IP fragments".) I have noticed that the NFS issue is a red herring, in my case: it's a symptom of this ethernet collision business (I think.) The real question is what's up with ethernet. This is the same hardware, and as near as I can tell it's the same 3c59x.c driver version, so I'm not sure what changed. Also, I have used 2.4.18 on 7.2, so I don't know what it is about RH's 2.4.18-4 kernel in 7.3 that is causing this. I tried creating a slimmed down 2.4.18-4custom kernel with all multicast and other advanced TCP/IP options disabled, with no effect. My network analyst also had me try forcing the hardware into 10baseT/half duplex mode, just in case the collisions were the result of autonegotiation failures. No effect there, either. The odd thing is that this is a switched network, meaning that I should never see ethernet collisions on my interface, yet ifconfig reports exactly that. Also, the network guy says that he is seeing FAR more collisions on his end than I see on my end. So, that's my brain dump. Phil: can you poke around and see if this is the same problem you are having? (i.e. look for collisions via ifconfig.) If it's not, I need to enter another bug. :)
Oh, forgot to mention another couple details. I did a "modprobe 3c59x debug=7; ifup eth0" to see what the driver had to say. I got lots of messages like this: eth0: vortex_error(): status 0xe081 eth0: vortex_error(): status 0xe481 (I think -- I'm reciting from memory. But the errors are 0xe081 and 0xe481.) One error seemed to correspond with each collision recorded by the device. Also, the ratio of 0xe081 to 0xe481 errors is around 5:1. - Dan
Yea - the networking people here have also encouraged me to not do this (run Redhat 7.3 with the 2.4.18 kernel, as an nfs client, on our network.). Doing the nfs writes of a 2 - 3 mb mail file appears to kill the response on our main nfs server (a Network Appliance) for other users as well as myself. (Before today i was just noticing it on the box i was running 7.3 on, but today it appears that this is affecting other people.) ifconfig eth0 on the client gives me this: osp 80 ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:01:03:24:6B:91 inet addr:192.12.69.241 Bcast:192.12.69.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:30970 errors:0 dropped:0 overruns:0 frame:0 TX packets:2556692 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:13536614 (12.9 Mb) TX bytes:3673708034 (3503.5 Mb) Interrupt:3 Base address:0xdc00 The machine is on a switched network port (on a Cisco catalyst 5000). We ran a capture on that switched port using an Etherpeek sniffer for several minutes. It shows a lot of fragment errors between my machine and the file server: The message "An IP datagram has been fragmented by the host application or a router, and one of the fragments is missing". 10,555 errors on 11327 packets comprising 16,050,566 bytes. (The host and the file server are on the same network (192.12.69.0), so traffic is not going thru a router.) So it sounds like client NFS is broken in 7.3. The Cisco switch is not showing errors on the port my machine is connected to, but there are collisions on the port that the file server is connected to. Thanks for your help. Let me know if there is more information that i can provide. Phil Kaslo
I did run " modprobe 3c59x debug=7 ; ifup eth0 " on the client. /var/log/messages does give me a lot of output of the form May 15 16:18:16 osprey kernel: eth0: vortex_error(), status=0xe481 May 15 16:18:24 osprey kernel: eth0: vortex_error(), status=0xe081 May 15 16:18:43 osprey kernel: eth0: vortex_error(), status=0xe481 May 15 16:19:16 osprey last message repeated 2 times Phil
Just a quick me too. Only I wish you'd set the priority much higher (to whatever the highest level is). I'm 99% sure it's an NFS issue. I have a RH7.3 client with a RH7.2 server and a Solaris 8 server. If I untar a 1.5M file from the Solaris 8 server onto the RH7.3 client, I get ~20,000 client rpc retransmits (according to nfsstat). If I untar the same file from the RH7.2 server onto the RH7.3 client, I get an average of 0.5 client rpc retransmits. Interestingly enough, if I untar the 1.5M file from the Solaris 8 server onto a RH7.2 client, I get ~100 retransmits (versus an average of 0.5 if going from the RH7.2 server to the RH7.2 client). If there some setting we should be using?
This looks much more like the 3c59x driver is broken for some cards than NFS problems
Sorry, but I'm not using 3COM cards. The RH7.3 client is using an Intel Ethernet Pro 100 (according to /proc/pci). The RH7.2 server is using an Intel 82544EI Gigabit Ethernet Controller. NFS works fine between the RH7.3 client and RH7.2 server, but has serious problems between the RH7.3 client and the Solaris 8 server. All previous version of RH (from 6.2-7.2) have basically no problems with the Solaris 8 server (I get some retransmits, but no where near the amount that I get with RH 7.3). ...dave
What rsize/wsize are you using? Does the problem go away if you set rsize=1024 and wsize=1024 on the client? Is there any packet loss when pinging the solaris server with ping -s 8300 or ping -s 4300? This sounds like two seperate bugs and should probably be entered seperately -- it's easier to mark bugs as duplicate than to deal with two threads of information in a single bug. For the person with the 3com, double check the duplex of the link and try the above. You may need to force full duplex on 3com driver with full_duplex=1 and try again.
I tried putting '-rsize=1024,-wsize=1024' on the lines for autofs in /etc/auto.master, and 'options 3c59x full_duplex=1' in /etc/modules.conf, and rebooting. The cisco switch indicates the port that it is connected to is at auto-full duplex, and auto-100 speed. I again tried a copy of a 2 mb mail file, and edit of it using mail, d, and q. It still takes about 80 seconds to write it out, during which commands hang, and after, dmesg reports nfs: server sinagua not responding, still trying nfs: server sinagua ok' I can't keep running these tests in a production environment here, because of the effect it has on our main nfs file server, and on other users. Phil
What about the results of the pings? How much packet loss is being seen?
When i start a ping sinagua (a Network Appliance running NetApp Release 6.1.l), then run mail, d, q, as above, i get ping -s 8300 sinagua ... --- sinagua.cs.arizona.edu ping statistics --- 22 packets transmitted, 19 received, 13% loss, time 21193ms rtt min/avg/max/mdev = 2.279/2.452/3.832/0.379 ms ping -s 4300 sinagua ... --- sinagua.cs.arizona.edu ping statistics --- 24 packets transmitted, 21 received, 12% loss, time 23198ms rtt min/avg/max/mdev = 1.497/1.654/4.019/0.531 ms Phil
Ok, the [rw]size=1024 fixed my problem. I tried various sizes. It looks like RH changed the default from (I believe) 4096 to 32768??? I have no trouble up to 8192. I start to see retransmits climb at 16384, but everything is still usable. When I tried 32768 I see 10's of thousands of retransmits when untar'ing a 1.5M file. So, did RH up the default to 32768? BTW, I have 0 packet loss (on over a 1000 iterations) when I'm not trying to access the NFS server (I didn't try it when I do access the NFS server). ...dave
When i try to ping , ping -s 8300, ping -s 4300 sinagua (the netapp) when not doing that test (mail, d, q), on the client, i get 0% packet loss. Phil
FYI, the NFS server I referred to in my case is also a Network Appliance, and I see the same symptoms. Not sure what my rsize and wsize are offhand. Also, Alan: as near as I can tell (according to version) the 3c59x driver is the same as it was in 7.2, and it's the same hardware. Not sure whether the collisions are actually relevant or not, but I would guess that ethernet has to be playing a role somewhere or Phil and I would not be getting the same errors from 3c59x. Also, I'm on 10bT/half whereas Phil's on 100bT/full, which may explain why I see actual collisions and he doesn't. I may attempt to build a stock 2.4.18 (which worked on 7.2) on my box tomorrow and see if anything changes. Like Phil, though, I can't test this much, since I'm in a production environment. Thanks, guys... - Dan
I'd just like to add one more "me too" and share my experiences. I originally suspected the 3c59x driver included with RH 7.3, so I built the driver from scyld.com, as well as compared the driver that was included with RH 7.2, and got the same results. I then reasoned that my card (3c905B) might have gone bad and replaced it with an Intel EtherExpress Pro card. So I've experienced this with both NICs. Also, I've tried tuning the NFS options (I had previously used the default settings, which always worked find) by setting rsize and wsize to 8192, and increasing the timeo value. The increase of the timeo value (I've now got it set to 20, which is 2 seconds) has done the most to make my performance acceptable, but I can still trigger the problem by using NFS to write or read a large file. Also, I am using my RH 7.3 client to access a Solaris 8 NFS server. Today, I built 2.4.19-pre8 from kernel.org, and I'm still seeing some retransmissions, but nowhere near as many as I was with the RedHat-provided 2.4.18-4 kernel. I'm also going to build 2.4.18 stock and see what results I get. I'll post another update if 2.4.18 behaves any differently than 2.4.19-pre8, but for now, I'm thinking that something is amiss with 2.4.18-4.
Just wondering if this problem is related to that in bug # 64921, NFS version 3 hangs?
I think bug # 64921 is related. I tried the suggestion in that thread -- setting localoptions='nfsvers=2' in /etc/init.d/autofs. That appears to fix the problem for me. Thanks, guys. Phil
I know this is pretty much a pain in the ass but it might be worth the time and pain to patch in a good number of neil's and trond's server and client patches for 2.4.18 - they are all linked from nfs.sourceforge.net. Neil and Trond can both give a good set of recommendations on which patches to use and these problems only appear to be reported on red hat's kernels not stock ones patched with neil's patches.
As a follow-up, after working with my network people some more, it looks like all that stuff I spouted about collisions and stuff is nonsense, and not related to this. (i.e. I downgraded to RH 7.2 and checked and the collisions are still there after all.) Sorry, my bad. I'll try the fixes mentioned throughout this thread -- esp. Phil's since he seems to have a similar configuration to me. Not sure when I'll get time to do that though. Thanks, - Dan
This problem also affects me on a Dell Inspiron 7500 with a 3Com 3CCFE575CT PC card when mounting filesystems from Solaris 8 and Irix 6.5 servers. My NFS read performance is OK, but NFS writes are absolutely horrible -- a 56Mb file which I could write in about 13 seconds under RedHat 7.2 would now take well over an hour if I had the patience to let it complete. Other network writes are OK, e.g., I can write just fine to a remote system via ftp. After reading some of the preceding comments on this topic I was able to devise two independent workarounds: (1) Change NFS_MAX_FILE_IO_BUFFER_SIZE in /usr/src/linux/include/linux/nfs_fs.h from 32768 to 8192. Then recompile a new kernel. (2) Edit the localoptions line in /etc/rc.d/init.d/autofs to read localoptions='rsize=8192,wsize=8192' This does not require a kernel rebuild but its effectiveness is limited to filesystems mounted via the automounter. Filesystems which are mounted via 'mount' must have '-o rsize=8192,wsize=8192' specified as arguments to the mount command. Either of these two solutions will work independently for me, and I am currently using (2) alone as it's easier to maintain. I should note that we do have other 7.3 systems at our site which do *not* exhibit this problem.
I tried that suggestion (2) (2) Edit the localoptions line in /etc/rc.d/init.d/autofs to read localoptions='rsize=8192,wsize=8192' That works for me also. Thanks. Phil
A fix is being made to the kernel. The default [rw]size will be set to 8KB, but can still be configured to 32KB by the user. We're also adding several nfs patches.
On my system, changing the default size (NFS_DEF_FILE_IO_BUFFER_SIZE in nfs_fs.h, currently set at 4096) did absolutely nothing. I had to change the *maximum* size (NFS_MAX_FILE_IO_BUFFER_SIZE) to fix the problem, diminishing it from 32768 to 8192. My incredibly uninformed guess is that although my system has a default size of 4096, it somehow negotiates the larger 32768 transfer size with the NFS server regardless of its 4096 default, and then the whole thing hangs. I had to limit the maximum size to inhibit this behavior (or explicitly specify 8192 at mount time as the transfer size).
My "me,too" is different hw (ne2k-pci) and custom kernel (2.4.18) in both cases: -Server 7.3 and client 7.2: speed is ok -Both 7.3: starting and quitting pine with 5 (!) small mails takes forever. I will do a recheck to see what changed during the upgrade, but I guess it's clear the hw is not the culprit and maybe even RH's own kernels (I was not using them before upgrade and I'm not now, still the problem popped up after the upgrade).
Please excuse my previous unreliable bugreport and comment: I found out it was due to traffic-shaping the wrong interface (of all things!). Never do critical changes late at night, right after an upgrade. Never do... Never do...
*** This bug has been marked as a duplicate of 64921 ***
In the for what it's worth dept. We have been running 7.2 for some time. A recent thunderstorm took out a box. I replaced it with a higher speed box. Put the hard drive in the new box and kudzu did its thing. It always took about 25 seconds to print a ps page. Now its 1.5 minutes. Hmmm, lpr and lpd with debug show fast execution. Its the net(local) somehow. Local print from that box with the same info, 10 seconds! What tripped my trigger, when the power wiped out the linux box, started printing to a winders box via samba, wow, much faster. I figured it must be in the permission checking and such, but 1.5 minutes, hmmmm. Have beat that to death, ready to give up. Created a new lpd.conf or is that perm,whatever, default allow. Still 1.5 minutes. I tried the print to port from other boxes, without much luck, filter errors, and such, prolly just throwing too much at the problem at once to see the problem, obvious stupidity. Will load a fresh copy of Suse 9.1 on it and see what happens.