114021 – NFS read stalls in e.34 and e.35 kernels

Bug 114021 - NFS read stalls in e.34 and e.35 kernels

Summary: NFS read stalls in e.34 and e.35 kernels

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-01-21 15:03 UTC by Need Real Name
Modified:	2007-11-30 22:06 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-04-15 15:00:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Need Real Name 2004-01-21 15:03:33 UTC

Description of problem:

Copying a file from an NFS mount to local disk exhibits stalls as long
as 5 seconds:

[relevant output from strace -tt cp /path/to/nfs/file /tmp]
11:44:01.195024 read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
11:44:06.789237 fstat64(4, {st_mode=S_IFREG|0644, st_size=1347584,
...}) = 0

These stalls ultimately cause the copy of a 1.44MB floppy disk image
to take about a minute and a half to complete.  The strace -r output
did not show any unusual timings for the system calls, but the -tt
wall-clock output did show these long delays between read() and the
subsequent fstat64()/_llseek()/fcntl64()/write() calls.

Version-Release number of selected component (if applicable):

kernel-2.4.9-e.34smp
kernel-2.4.9-e.35smp

How reproducible:

In our testing, this was 100% reproducable with both kernels, and with
the default udp mount [rw]size, 4096, 8192, and 32768 set.  This was
against a NetApp 960 running OnTap 6.4.2P6.  During these tests, the
filer was showing ~10% cpu utilization, and the interface we were
testing against was not heavily used.  We did not test tcp transport.

Steps to Reproduce:
1. mount filer:/volX/export /mnt/filer
2. time cp /mnt/filer/file /tmp
  
Actual results:

Observe the throughput being much lower than it should be, and the
above strace results.

Expected results:

The file should copy in a time that is reasonable given the networking
speeds of the two hosts.

Additional info:

Falling back to the e.30smp kernel allows us to get consistent timings
in this simple test.

Comment 1 Mike Cooling 2004-01-29 21:33:51 UTC

I too am seeing attrocious performance at the e.35smp kernel. I too am
running a Netapp at release 6.4.2P6 (but it's an F740). Load average
reaches bursts of up to 140. and bad Apache response times. I am
reverting back to the e.30 kernel tomorrow.

Comment 2 Jason Baron 2004-04-15 15:00:58 UTC

This has long been fixed in the current erratum, e.38.

Comment 3 Mike Cooling 2004-04-15 16:08:48 UTC

I disagree. It is much better but performance is still much worse than
e.30, and I have already opened up support issue #312835 and returned
to e.30 once again.

Note You need to log in before you can comment on or make changes to this bug.