65069 – Copy 300k data crashed NetApp

Bug 65069 - Copy 300k data crashed NetApp

Summary: Copy 300k data crashed NetApp

Keywords:
Status:	CLOSED DUPLICATE of bug 64921
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Ben LaHaise
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-05-16 22:36 UTC by hjl
Modified:	2007-04-18 16:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2002-05-28 21:47:37 UTC
Embargoed:

Attachments	(Terms of Use)
Kernel NFS debug message (226.13 KB, text/plain) 2002-05-16 22:36 UTC, hjl	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2002:110	0	normal	SHIPPED_LIVE	Updated kernel with bugfixes available	2002-06-10 04:00:00 UTC

Description hjl 2002-05-16 22:36:17 UTC

When I was copying 300k data from local disk to NetApp

# ls -l /usr/local/bin
-rwxr-xr-x    1 root     root       300969 May 16 14:27 rstlistend
lrwxrwxrwx    1 root     root           25 May 16 14:27 rstterm ->
/usr/local/bin/rs
# cp -af rst* ~/tmp/

crashed NetApp. My home dir is mounted as

filerdude:/vol/vol0/home0/hjl /home/hjl nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=filerdude 0 0

The Linux NFS client machine was transmitting about 10MB/s over 100Mb interface
to NetApp when it happened.

Comment 1 hjl 2002-05-16 22:36:55 UTC

Created attachment 57648 [details]
Kernel NFS debug message

Comment 2 Quentin Fennessy 2002-05-17 12:37:30 UTC

I would like to know what version of DataOnTap your Netapp
was running when it crashed. (sudo rsh NETAPP sysconfig)

Thanks
Quentin

Comment 3 gerry.morong 2002-05-17 18:44:37 UTC

Red Hat 7.3  NFS Version 3  kernel 2.4.18-3 and/or 2.4.18-4  e100 driver.

I am experiencing similar problems except our NetApps just come to their 
knees.  Whenever we do any significant IO via NFS to a NetApp or Solaris 2.6 
system, the systems come to a crawl because the 7.3 box just keeps hammering.  
If I force the mount to be NFS version 2 the problem goes away.  Also note that 
the same client hardware running Red Hat 7.2 has no problems doing NFS version 
3 to either NetApp or Solaris 2.6.

Comment 4 Need Real Name 2002-05-18 05:02:23 UTC

We got the same problem, when writing from any RH-7.3 client to any non-Linux
NFS server (Sparc Solaris 8, NetApp F760 6.1.1R2, NetApp F840 6.2R1). Write
process will hang, then after a while the NFS server will start to fail. There
is no way to kill write process on the client, because it is in a "D" state, but
after about 10 min. it will die. Client will start to write if start tcpdump on
it, if I stop tcpdump write process will hang in a couple of seconds again, I
resume tcpdump and writes will resume and I can kill it. It happens only on
RH-7.3. RH 7.2, 7.1, 6.2 and Mandrake 8.x are working w/o any problems.

Comment 5 Need Real Name 2002-05-18 05:15:51 UTC

mount -o wsize=8192 fixed the problem.

Comment 6 Need Real Name 2002-05-18 06:02:09 UTC

by default wsize=32768

# grep nfs /proc/mounts
vega:/vol/v0/home/user /home/user nfs
rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=nfsserver 0 0

Comment 7 Need Real Name 2002-05-20 17:49:15 UTC

We have had the same results here at Stanford. One only needs to force to nfs v2
or drop the w/rsize to 8k to solve it. But by default it is setting the buffers
to 32K which is too high of a default. This throttles the netapp in some yet
unknown way (it is unreponsive to all other requests, and the client that is
doing a sustained write goes up to 100% utilization of its network interface).

Comment 8 Need Real Name 2002-05-21 18:23:47 UTC

More information to back up jlittle.edu -- this obviously doesn't
affect only NetApps. All nfs servers I have test this against (Solaris, IRIX)
seem to be affected to some extent. We have put out a notice to our Stanford
users warning of this issue, as it looks just like a DoS attack.

Comment 9 Ben LaHaise 2002-05-28 21:48:06 UTC


*** This bug has been marked as a duplicate of 64921 ***

Comment 10 Need Real Name 2002-05-28 22:14:55 UTC

This sounds remarkably like the problem of GigE to FastEthernet thru a switch.

The large rsize & wsize of 32K gets dropped for lack of buffer space moving from Gig to Fast ethernet.

the smaller rsize wsize can squeeze thru with only slightly decreased performance.

This is a real pain but can the root cause be that GigE is not using flow control properly?

Note You need to log in before you can comment on or make changes to this bug.