Red Hat Bugzilla – Bug 64984
Redhat 7.3: nfs writes very slow.
Last modified: 2014-01-21 17:48:03 EST
Description of Problem: Redhat 7.3: nfs writes very slow.
Version-Release number of selected component (if applicable):
Kernel version 2.4.18-4
How Reproducible: Always.
Steps to Reproduce:
1. Upgrade Dell Dimension 4100 800Mhz PIII with
3COM 3C905-C-TXM 10/100 ethernet card from Redhat 7.1 to 7.3.
Mount user files via nfs from a Network Appliance file server.
2. Try to run vi or mail on a 1.7 mb mail file. Delete some
messages, and try to write the file out.
3. After 5 or 6 minutes, the file will be written out. During this
time, other commands on the machine such as ps aux, dmesg, and top
will hang. After it writes file out, wakes back up, dmesg and
/var/log/messages will show repeated messages of form
nfs: server sinagua not responding, still trying
nfs: server sinagua OK
5 or 6 minutes to write out 1.7 mb file to nfs file server.
2 or 3 seconds.
I've observed this on the first 2 machines that i've put 7.3 on -
Dell Dimension 4100 800Mhz PIII
3COM 3C905-C-TXM 10/100 ethernet
I tried downloading, building and using kernel version 2.5.15 from
kernel.org, (the latest development version), and i don't seem to
have this problem with that kernel.
However, i have some 200 machines to upgrade to 7.3, and i dont know
if i want to put 2.5.15 on all of them.
Let me know if there is additional information that i can provide.
Dept. of Computer Science 520-621/2760
University of Arizona firstname.lastname@example.org
Tucson, Ariz. 85721
I think I ran into this myself, except that the result was
much more disastrous. Can you check on this info on your box,
and see if your case matches?
I installed (fresh install, not upgrade) 7.3 on a Dell
Optiplex GX-150, which is similar to your system. Notably, it
has the same Ethernet card: 3C905C (Vortex Boomerang.)
On previous RH versions (i.e. 7.1 and 7.2), NFS worked just
fine. However, it fails on 7.3, with the following symptoms:
- works fine for short reads
- longer reads (or writes) take a very very long time to
Unfortunately, eventually my network support people shut off
my ethernet port, because my box was basically DOS flooding
the NFS server!
After investigation, it turns out that somehow my 3c905C is
experiencing A LOT of collisions, on a switched network. The
collisions are causing dropped TCP/IP packets. With TCP, this
just reduces throughput, but on UDP (which NFS uses,) it
generated so many retries and resends between my box and the
server that I was flooding the NFS server. (The server was
complaining of "dropped IP fragments".)
I have noticed that the NFS issue is a red herring, in my
case: it's a symptom of this ethernet collision business (I
think.) The real question is what's up with ethernet.
This is the same hardware, and as near as I can tell it's the
same 3c59x.c driver version, so I'm not sure what changed.
Also, I have used 2.4.18 on 7.2, so I don't know what it is
about RH's 2.4.18-4 kernel in 7.3 that is causing this.
I tried creating a slimmed down 2.4.18-4custom kernel with all
multicast and other advanced TCP/IP options disabled, with no
My network analyst also had me try forcing the hardware into
10baseT/half duplex mode, just in case the collisions were the
result of autonegotiation failures. No effect there, either.
The odd thing is that this is a switched network, meaning that
I should never see ethernet collisions on my interface, yet
ifconfig reports exactly that. Also, the network guy says
that he is seeing FAR more collisions on his end than I see on
So, that's my brain dump. Phil: can you poke around and see
if this is the same problem you are having? (i.e. look for
collisions via ifconfig.) If it's not, I need to enter
another bug. :)
Oh, forgot to mention another couple details. I did a
"modprobe 3c59x debug=7; ifup eth0" to see what the driver had
to say. I got lots of messages like this:
eth0: vortex_error(): status 0xe081
eth0: vortex_error(): status 0xe481
(I think -- I'm reciting from memory. But the errors are
0xe081 and 0xe481.)
One error seemed to correspond with each collision recorded by
the device. Also, the ratio of 0xe081 to 0xe481 errors is
Yea - the networking people here have also encouraged me to not do this
(run Redhat 7.3 with the 2.4.18 kernel, as an nfs client, on our network.).
Doing the nfs writes of a 2 - 3 mb mail file appears to kill the response on our
main nfs server (a Network Appliance) for other users as well as myself.
(Before today i was just noticing it on the box i was running 7.3 on,
but today it appears that this is affecting other people.)
ifconfig eth0 on the client gives me this:
osp 80 ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:01:03:24:6B:91
inet addr:220.127.116.11 Bcast:18.104.22.168 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:30970 errors:0 dropped:0 overruns:0 frame:0
TX packets:2556692 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:13536614 (12.9 Mb) TX bytes:3673708034 (3503.5 Mb)
Interrupt:3 Base address:0xdc00
The machine is on a switched network port (on a Cisco catalyst 5000).
We ran a capture on that switched port using an Etherpeek sniffer for
several minutes. It shows a lot of fragment errors between my machine
and the file server: The message
"An IP datagram has been fragmented by the host application or a
router, and one of the fragments is missing".
10,555 errors on 11327 packets comprising 16,050,566 bytes.
(The host and the file server are on the same network (22.214.171.124),
so traffic is not going thru a router.)
So it sounds like client NFS is broken in 7.3.
The Cisco switch is not showing errors on the port my machine is connected to,
but there are collisions on the port that the file server is connected to.
Thanks for your help. Let me know if there is more information that i can
I did run " modprobe 3c59x debug=7 ; ifup eth0 " on the client.
/var/log/messages does give me a lot of output of the form
May 15 16:18:16 osprey kernel: eth0: vortex_error(), status=0xe481
May 15 16:18:24 osprey kernel: eth0: vortex_error(), status=0xe081
May 15 16:18:43 osprey kernel: eth0: vortex_error(), status=0xe481
May 15 16:19:16 osprey last message repeated 2 times
Just a quick me too. Only I wish you'd set the priority much higher (to
whatever the highest level is). I'm 99% sure it's an NFS issue. I have a RH7.3
client with a RH7.2 server and a Solaris 8 server. If I untar a 1.5M file from
the Solaris 8 server onto the RH7.3 client, I get ~20,000 client rpc retransmits
(according to nfsstat). If I untar the same file from the RH7.2 server onto the
RH7.3 client, I get an average of 0.5 client rpc retransmits.
Interestingly enough, if I untar the 1.5M file from the Solaris 8 server onto a
RH7.2 client, I get ~100 retransmits (versus an average of 0.5 if going from the
RH7.2 server to the RH7.2 client).
If there some setting we should be using?
This looks much more like the 3c59x driver is broken for some cards than NFS
Sorry, but I'm not using 3COM cards. The RH7.3 client is using an Intel
Ethernet Pro 100 (according to /proc/pci). The RH7.2 server is using an Intel
82544EI Gigabit Ethernet Controller. NFS works fine between the RH7.3 client
and RH7.2 server, but has serious problems between the RH7.3 client and the
Solaris 8 server. All previous version of RH (from 6.2-7.2) have basically no
problems with the Solaris 8 server (I get some retransmits, but no where near
the amount that I get with RH 7.3).
What rsize/wsize are you using? Does the problem go away if you set rsize=1024
and wsize=1024 on the client? Is there any packet loss when pinging the solaris
server with ping -s 8300 or ping -s 4300? This sounds like two seperate bugs
and should probably be entered seperately -- it's easier to mark bugs as
duplicate than to deal with two threads of information in a single bug.
For the person with the 3com, double check the duplex of the link and try the
above. You may need to force full duplex on 3com driver with full_duplex=1 and
I tried putting '-rsize=1024,-wsize=1024' on the lines for autofs in
and 'options 3c59x full_duplex=1' in /etc/modules.conf, and rebooting.
The cisco switch indicates the port that it is connected to is at auto-full
and auto-100 speed.
I again tried a copy of a 2 mb mail file, and edit of it using mail, d, and q.
It still takes
about 80 seconds to write it out, during which commands hang, and after, dmesg
nfs: server sinagua not responding, still trying
nfs: server sinagua ok'
I can't keep running these tests in a production environment here, because
of the effect it has on our main nfs file server, and on other users.
What about the results of the pings? How much packet loss is being seen?
When i start a ping sinagua (a Network Appliance running
NetApp Release 6.1.l), then run mail, d, q, as above, i get
ping -s 8300 sinagua
--- sinagua.cs.arizona.edu ping statistics ---
22 packets transmitted, 19 received, 13% loss, time 21193ms
rtt min/avg/max/mdev = 2.279/2.452/3.832/0.379 ms
ping -s 4300 sinagua
--- sinagua.cs.arizona.edu ping statistics ---
24 packets transmitted, 21 received, 12% loss, time 23198ms
rtt min/avg/max/mdev = 1.497/1.654/4.019/0.531 ms
Ok, the [rw]size=1024 fixed my problem. I tried various sizes. It looks like
RH changed the default from (I believe) 4096 to 32768??? I have no trouble up
to 8192. I start to see retransmits climb at 16384, but everything is still
usable. When I tried 32768 I see 10's of thousands of retransmits when
untar'ing a 1.5M file. So, did RH up the default to 32768?
BTW, I have 0 packet loss (on over a 1000 iterations) when I'm not trying to
access the NFS server (I didn't try it when I do access the NFS server).
When i try to ping , ping -s 8300, ping -s 4300 sinagua (the netapp)
when not doing that test (mail, d, q), on the client, i get 0% packet loss.
FYI, the NFS server I referred to in my case is also a Network
Appliance, and I see the same symptoms. Not sure what my rsize
and wsize are offhand.
Also, Alan: as near as I can tell (according to version) the
3c59x driver is the same as it was in 7.2, and it's the same
hardware. Not sure whether the collisions are actually
relevant or not, but I would guess that ethernet has to be
playing a role somewhere or Phil and I would not be getting the
same errors from 3c59x. Also, I'm on 10bT/half whereas Phil's
on 100bT/full, which may explain why I see actual collisions
and he doesn't.
I may attempt to build a stock 2.4.18 (which worked on 7.2) on
my box tomorrow and see if anything changes. Like Phil,
though, I can't test this much, since I'm in a production
I'd just like to add one more "me too" and share my experiences. I originally
suspected the 3c59x driver included with RH 7.3, so I built the driver from
scyld.com, as well as compared the driver that was included with RH 7.2, and
got the same results. I then reasoned that my card (3c905B) might have gone
bad and replaced it with an Intel EtherExpress Pro card. So I've experienced
this with both NICs. Also, I've tried tuning the NFS options (I had
previously used the default settings, which always worked find) by setting
rsize and wsize to 8192, and increasing the timeo value. The increase of the
timeo value (I've now got it set to 20, which is 2 seconds) has done the most
to make my performance acceptable, but I can still trigger the problem by
using NFS to write or read a large file. Also, I am using my RH 7.3 client to
access a Solaris 8 NFS server. Today, I built 2.4.19-pre8 from kernel.org,
and I'm still seeing some retransmissions, but nowhere near as many as I was
with the RedHat-provided 2.4.18-4 kernel. I'm also going to build 2.4.18
stock and see what results I get. I'll post another update if 2.4.18 behaves
any differently than 2.4.19-pre8, but for now, I'm thinking that something is
amiss with 2.4.18-4.
Just wondering if this problem is related to that in bug # 64921, NFS version 3
I think bug # 64921 is related.
I tried the suggestion in that thread --
setting localoptions='nfsvers=2' in /etc/init.d/autofs.
That appears to fix the problem for me.
I know this is pretty much a pain in the ass but it might be worth the time and
pain to patch in a good number of neil's and trond's server and client patches
for 2.4.18 - they are all linked from nfs.sourceforge.net. Neil and Trond can
both give a good set of recommendations on which patches to use and these
problems only appear to be reported on red hat's kernels not stock ones patched
with neil's patches.
As a follow-up, after working with my network people some more, it looks like all
that stuff I spouted about collisions and stuff is nonsense, and not related to
this. (i.e. I downgraded to RH 7.2 and checked and the collisions are still there
after all.) Sorry, my bad.
I'll try the fixes mentioned throughout this thread -- esp. Phil's since he seems
to have a similar configuration to me. Not sure when I'll get time to do that
This problem also affects me on a Dell Inspiron 7500 with a 3Com 3CCFE575CT
PC card when mounting filesystems from Solaris 8 and Irix 6.5 servers.
My NFS read performance is OK, but NFS writes are absolutely horrible --
a 56Mb file which I could write in about 13 seconds under RedHat 7.2 would
now take well over an hour if I had the patience to let it complete.
Other network writes are OK, e.g., I can write just fine to a remote system
After reading some of the preceding comments on this topic I was able to devise
two independent workarounds:
(1) Change NFS_MAX_FILE_IO_BUFFER_SIZE in /usr/src/linux/include/linux/nfs_fs.h
from 32768 to 8192. Then recompile a new kernel.
(2) Edit the localoptions line in /etc/rc.d/init.d/autofs to read
This does not require a kernel rebuild but its effectiveness is limited
to filesystems mounted via the automounter. Filesystems which are
mounted via 'mount' must have '-o rsize=8192,wsize=8192' specified as
arguments to the mount command.
Either of these two solutions will work independently for me, and I am
currently using (2) alone as it's easier to maintain. I should note that
we do have other 7.3 systems at our site which do *not* exhibit this problem.
I tried that suggestion (2)
(2) Edit the localoptions line in /etc/rc.d/init.d/autofs to read
That works for me also.
A fix is being made to the kernel. The default [rw]size will be set to 8KB, but
can still be configured to 32KB by the user. We're also adding several nfs patches.
On my system, changing the default size (NFS_DEF_FILE_IO_BUFFER_SIZE in
nfs_fs.h, currently set at 4096) did absolutely nothing. I had to change
the *maximum* size (NFS_MAX_FILE_IO_BUFFER_SIZE) to fix the problem,
diminishing it from 32768 to 8192.
My incredibly uninformed guess is that although my system has a default size
of 4096, it somehow negotiates the larger 32768 transfer size with the
NFS server regardless of its 4096 default, and then the whole thing hangs.
I had to limit the maximum size to inhibit this behavior (or explicitly
specify 8192 at mount time as the transfer size).
My "me,too" is different hw (ne2k-pci) and custom kernel (2.4.18) in both
-Server 7.3 and client 7.2: speed is ok
-Both 7.3: starting and quitting pine with 5 (!) small mails takes forever.
I will do a recheck to see what changed during the upgrade, but I guess it's
clear the hw is not the culprit and maybe even RH's own kernels (I was not
using them before upgrade and I'm not now, still the problem popped up after
Please excuse my previous unreliable bugreport and comment: I found
out it was due to traffic-shaping the wrong interface (of all things!).
Never do critical changes late at night, right after an upgrade.
*** This bug has been marked as a duplicate of 64921 ***
In the for what it's worth dept. We have been running 7.2 for some
time. A recent thunderstorm took out a box. I replaced it with a
higher speed box. Put the hard drive in the new box and kudzu did
its thing. It always took about 25 seconds to print a ps page. Now
its 1.5 minutes. Hmmm, lpr and lpd with debug show fast execution.
Its the net(local) somehow. Local print from that box with the same
info, 10 seconds! What tripped my trigger, when the power wiped out
the linux box, started printing to a winders box via samba, wow, much
faster. I figured it must be in the permission checking and such,
but 1.5 minutes, hmmmm. Have beat that to death, ready to give up.
Created a new lpd.conf or is that perm,whatever, default allow.
Still 1.5 minutes. I tried the print to port from other boxes,
without much luck, filter errors, and such, prolly just throwing too
much at the problem at once to see the problem, obvious stupidity.
Will load a fresh copy of Suse 9.1 on it and see what happens.