When I was copying 300k data from local disk to NetApp # ls -l /usr/local/bin -rwxr-xr-x 1 root root 300969 May 16 14:27 rstlistend lrwxrwxrwx 1 root root 25 May 16 14:27 rstterm -> /usr/local/bin/rs # cp -af rst* ~/tmp/ crashed NetApp. My home dir is mounted as filerdude:/vol/vol0/home0/hjl /home/hjl nfs rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=filerdude 0 0 The Linux NFS client machine was transmitting about 10MB/s over 100Mb interface to NetApp when it happened.
Created attachment 57648 [details] Kernel NFS debug message
I would like to know what version of DataOnTap your Netapp was running when it crashed. (sudo rsh NETAPP sysconfig) Thanks Quentin
Red Hat 7.3 NFS Version 3 kernel 2.4.18-3 and/or 2.4.18-4 e100 driver. I am experiencing similar problems except our NetApps just come to their knees. Whenever we do any significant IO via NFS to a NetApp or Solaris 2.6 system, the systems come to a crawl because the 7.3 box just keeps hammering. If I force the mount to be NFS version 2 the problem goes away. Also note that the same client hardware running Red Hat 7.2 has no problems doing NFS version 3 to either NetApp or Solaris 2.6.
We got the same problem, when writing from any RH-7.3 client to any non-Linux NFS server (Sparc Solaris 8, NetApp F760 6.1.1R2, NetApp F840 6.2R1). Write process will hang, then after a while the NFS server will start to fail. There is no way to kill write process on the client, because it is in a "D" state, but after about 10 min. it will die. Client will start to write if start tcpdump on it, if I stop tcpdump write process will hang in a couple of seconds again, I resume tcpdump and writes will resume and I can kill it. It happens only on RH-7.3. RH 7.2, 7.1, 6.2 and Mandrake 8.x are working w/o any problems.
mount -o wsize=8192 fixed the problem.
by default wsize=32768 # grep nfs /proc/mounts vega:/vol/v0/home/user /home/user nfs rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=nfsserver 0 0
We have had the same results here at Stanford. One only needs to force to nfs v2 or drop the w/rsize to 8k to solve it. But by default it is setting the buffers to 32K which is too high of a default. This throttles the netapp in some yet unknown way (it is unreponsive to all other requests, and the client that is doing a sustained write goes up to 100% utilization of its network interface).
More information to back up jlittle.edu -- this obviously doesn't affect only NetApps. All nfs servers I have test this against (Solaris, IRIX) seem to be affected to some extent. We have put out a notice to our Stanford users warning of this issue, as it looks just like a DoS attack.
*** This bug has been marked as a duplicate of 64921 ***
This sounds remarkably like the problem of GigE to FastEthernet thru a switch. The large rsize & wsize of 32K gets dropped for lack of buffer space moving from Gig to Fast ethernet. the smaller rsize wsize can squeeze thru with only slightly decreased performance. This is a real pain but can the root cause be that GigE is not using flow control properly?