Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 65069

Summary: Copy 300k data crashed NetApp
Product: [Retired] Red Hat Linux Reporter: hjl
Component: kernelAssignee: Ben LaHaise <bcrl>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 7.3CC: gedetil, gerry.morong, jenson, jortega, karlamrhein, kresa, quentin.fennessy, sasha, sysadmin
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-05-28 21:47:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Kernel NFS debug message none

Description hjl 2002-05-16 22:36:17 UTC
When I was copying 300k data from local disk to NetApp

# ls -l /usr/local/bin
-rwxr-xr-x    1 root     root       300969 May 16 14:27 rstlistend
lrwxrwxrwx    1 root     root           25 May 16 14:27 rstterm ->
/usr/local/bin/rs
# cp -af rst* ~/tmp/

crashed NetApp. My home dir is mounted as

filerdude:/vol/vol0/home0/hjl /home/hjl nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=filerdude 0 0

The Linux NFS client machine was transmitting about 10MB/s over 100Mb interface
to NetApp when it happened.

Comment 1 hjl 2002-05-16 22:36:55 UTC
Created attachment 57648 [details]
Kernel NFS debug message

Comment 2 Quentin Fennessy 2002-05-17 12:37:30 UTC
I would like to know what version of DataOnTap your Netapp
was running when it crashed. (sudo rsh NETAPP sysconfig)

Thanks
Quentin

Comment 3 gerry.morong 2002-05-17 18:44:37 UTC
Red Hat 7.3  NFS Version 3  kernel 2.4.18-3 and/or 2.4.18-4  e100 driver.

I am experiencing similar problems except our NetApps just come to their 
knees.  Whenever we do any significant IO via NFS to a NetApp or Solaris 2.6 
system, the systems come to a crawl because the 7.3 box just keeps hammering.  
If I force the mount to be NFS version 2 the problem goes away.  Also note that 
the same client hardware running Red Hat 7.2 has no problems doing NFS version 
3 to either NetApp or Solaris 2.6.


Comment 4 Need Real Name 2002-05-18 05:02:23 UTC
We got the same problem, when writing from any RH-7.3 client to any non-Linux
NFS server (Sparc Solaris 8, NetApp F760 6.1.1R2, NetApp F840 6.2R1). Write
process will hang, then after a while the NFS server will start to fail. There
is no way to kill write process on the client, because it is in a "D" state, but
after about 10 min. it will die. Client will start to write if start tcpdump on
it, if I stop tcpdump write process will hang in a couple of seconds again, I
resume tcpdump and writes will resume and I can kill it. It happens only on
RH-7.3. RH 7.2, 7.1, 6.2 and Mandrake 8.x are working w/o any problems.

Comment 5 Need Real Name 2002-05-18 05:15:51 UTC
mount -o wsize=8192 fixed the problem.

Comment 6 Need Real Name 2002-05-18 06:02:09 UTC
by default wsize=32768

# grep nfs /proc/mounts
vega:/vol/v0/home/user /home/user nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=nfsserver 0 0

Comment 7 Need Real Name 2002-05-20 17:49:15 UTC
We have had the same results here at Stanford. One only needs to force to nfs v2
or drop the w/rsize to 8k to solve it. But by default it is setting the buffers
to 32K which is too high of a default. This throttles the netapp in some yet
unknown way (it is unreponsive to all other requests, and the client that is
doing a sustained write goes up to 100% utilization of its network interface).

Comment 8 Need Real Name 2002-05-21 18:23:47 UTC
More information to back up jlittle.edu -- this obviously doesn't
affect only NetApps. All nfs servers I have test this against (Solaris, IRIX)
seem to be affected to some extent. We have put out a notice to our Stanford
users warning of this issue, as it looks just like a DoS attack.

Comment 9 Ben LaHaise 2002-05-28 21:48:06 UTC

*** This bug has been marked as a duplicate of 64921 ***

Comment 10 Need Real Name 2002-05-28 22:14:55 UTC
This sounds remarkably like the problem of GigE to FastEthernet thru a switch.

The large rsize & wsize of 32K gets dropped for lack of buffer space moving from Gig to Fast ethernet.

the smaller rsize wsize can squeeze thru with only slightly decreased performance.

This is a real pain but can the root cause be that GigE is not using flow control properly?