Bug 65069 - Copy 300k data crashed NetApp
Copy 300k data crashed NetApp
Status: CLOSED DUPLICATE of bug 64921
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i386 Linux
medium Severity high
: ---
: ---
Assigned To: Ben LaHaise
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-05-16 18:36 EDT by hjl
Modified: 2007-04-18 12:42 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-05-28 17:47:37 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Kernel NFS debug message (226.13 KB, text/plain)
2002-05-16 18:36 EDT, hjl
no flags Details

  None (edit)
Description hjl 2002-05-16 18:36:17 EDT
When I was copying 300k data from local disk to NetApp

# ls -l /usr/local/bin
-rwxr-xr-x    1 root     root       300969 May 16 14:27 rstlistend
lrwxrwxrwx    1 root     root           25 May 16 14:27 rstterm ->
/usr/local/bin/rs
# cp -af rst* ~/tmp/

crashed NetApp. My home dir is mounted as

filerdude:/vol/vol0/home0/hjl /home/hjl nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=filerdude 0 0

The Linux NFS client machine was transmitting about 10MB/s over 100Mb interface
to NetApp when it happened.
Comment 1 hjl 2002-05-16 18:36:55 EDT
Created attachment 57648 [details]
Kernel NFS debug message
Comment 2 Quentin Fennessy 2002-05-17 08:37:30 EDT
I would like to know what version of DataOnTap your Netapp
was running when it crashed. (sudo rsh NETAPP sysconfig)

Thanks
Quentin
Comment 3 gerry.morong 2002-05-17 14:44:37 EDT
Red Hat 7.3  NFS Version 3  kernel 2.4.18-3 and/or 2.4.18-4  e100 driver.

I am experiencing similar problems except our NetApps just come to their 
knees.  Whenever we do any significant IO via NFS to a NetApp or Solaris 2.6 
system, the systems come to a crawl because the 7.3 box just keeps hammering.  
If I force the mount to be NFS version 2 the problem goes away.  Also note that 
the same client hardware running Red Hat 7.2 has no problems doing NFS version 
3 to either NetApp or Solaris 2.6.
Comment 4 Need Real Name 2002-05-18 01:02:23 EDT
We got the same problem, when writing from any RH-7.3 client to any non-Linux
NFS server (Sparc Solaris 8, NetApp F760 6.1.1R2, NetApp F840 6.2R1). Write
process will hang, then after a while the NFS server will start to fail. There
is no way to kill write process on the client, because it is in a "D" state, but
after about 10 min. it will die. Client will start to write if start tcpdump on
it, if I stop tcpdump write process will hang in a couple of seconds again, I
resume tcpdump and writes will resume and I can kill it. It happens only on
RH-7.3. RH 7.2, 7.1, 6.2 and Mandrake 8.x are working w/o any problems.
Comment 5 Need Real Name 2002-05-18 01:15:51 EDT
mount -o wsize=8192 fixed the problem.
Comment 6 Need Real Name 2002-05-18 02:02:09 EDT
by default wsize=32768

# grep nfs /proc/mounts
vega:/vol/v0/home/user /home/user nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=nfsserver 0 0
Comment 7 Need Real Name 2002-05-20 13:49:15 EDT
We have had the same results here at Stanford. One only needs to force to nfs v2
or drop the w/rsize to 8k to solve it. But by default it is setting the buffers
to 32K which is too high of a default. This throttles the netapp in some yet
unknown way (it is unreponsive to all other requests, and the client that is
doing a sustained write goes up to 100% utilization of its network interface).
Comment 8 Need Real Name 2002-05-21 14:23:47 EDT
More information to back up jlittle@cs.stanford.edu -- this obviously doesn't
affect only NetApps. All nfs servers I have test this against (Solaris, IRIX)
seem to be affected to some extent. We have put out a notice to our Stanford
users warning of this issue, as it looks just like a DoS attack.
Comment 9 Ben LaHaise 2002-05-28 17:48:06 EDT

*** This bug has been marked as a duplicate of 64921 ***
Comment 10 Need Real Name 2002-05-28 18:14:55 EDT
This sounds remarkably like the problem of GigE to FastEthernet thru a switch.

The large rsize & wsize of 32K gets dropped for lack of buffer space moving from Gig to Fast ethernet.

the smaller rsize wsize can squeeze thru with only slightly decreased performance.

This is a real pain but can the root cause be that GigE is not using flow control properly?


Note You need to log in before you can comment on or make changes to this bug.