Red Hat Bugzilla – Bug 65069
Copy 300k data crashed NetApp
Last modified: 2007-04-18 12:42:37 EDT
When I was copying 300k data from local disk to NetApp
# ls -l /usr/local/bin
-rwxr-xr-x 1 root root 300969 May 16 14:27 rstlistend
lrwxrwxrwx 1 root root 25 May 16 14:27 rstterm ->
# cp -af rst* ~/tmp/
crashed NetApp. My home dir is mounted as
filerdude:/vol/vol0/home0/hjl /home/hjl nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=filerdude 0 0
The Linux NFS client machine was transmitting about 10MB/s over 100Mb interface
to NetApp when it happened.
Created attachment 57648 [details]
Kernel NFS debug message
I would like to know what version of DataOnTap your Netapp
was running when it crashed. (sudo rsh NETAPP sysconfig)
Red Hat 7.3 NFS Version 3 kernel 2.4.18-3 and/or 2.4.18-4 e100 driver.
I am experiencing similar problems except our NetApps just come to their
knees. Whenever we do any significant IO via NFS to a NetApp or Solaris 2.6
system, the systems come to a crawl because the 7.3 box just keeps hammering.
If I force the mount to be NFS version 2 the problem goes away. Also note that
the same client hardware running Red Hat 7.2 has no problems doing NFS version
3 to either NetApp or Solaris 2.6.
We got the same problem, when writing from any RH-7.3 client to any non-Linux
NFS server (Sparc Solaris 8, NetApp F760 6.1.1R2, NetApp F840 6.2R1). Write
process will hang, then after a while the NFS server will start to fail. There
is no way to kill write process on the client, because it is in a "D" state, but
after about 10 min. it will die. Client will start to write if start tcpdump on
it, if I stop tcpdump write process will hang in a couple of seconds again, I
resume tcpdump and writes will resume and I can kill it. It happens only on
RH-7.3. RH 7.2, 7.1, 6.2 and Mandrake 8.x are working w/o any problems.
mount -o wsize=8192 fixed the problem.
by default wsize=32768
# grep nfs /proc/mounts
vega:/vol/v0/home/user /home/user nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=nfsserver 0 0
We have had the same results here at Stanford. One only needs to force to nfs v2
or drop the w/rsize to 8k to solve it. But by default it is setting the buffers
to 32K which is too high of a default. This throttles the netapp in some yet
unknown way (it is unreponsive to all other requests, and the client that is
doing a sustained write goes up to 100% utilization of its network interface).
More information to back up email@example.com -- this obviously doesn't
affect only NetApps. All nfs servers I have test this against (Solaris, IRIX)
seem to be affected to some extent. We have put out a notice to our Stanford
users warning of this issue, as it looks just like a DoS attack.
*** This bug has been marked as a duplicate of 64921 ***
This sounds remarkably like the problem of GigE to FastEthernet thru a switch.
The large rsize & wsize of 32K gets dropped for lack of buffer space moving from Gig to Fast ethernet.
the smaller rsize wsize can squeeze thru with only slightly decreased performance.
This is a real pain but can the root cause be that GigE is not using flow control properly?