Bug 114021

Summary:	NFS read stalls in e.34 and e.35 kernels
Product:	Red Hat Enterprise Linux 2.1	Reporter:	Need Real Name <aander07>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	2.1	CC:	cooling, kambiz, riel
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-04-15 15:00:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Need Real Name 2004-01-21 15:03:33 UTC

Description of problem:

Copying a file from an NFS mount to local disk exhibits stalls as long
as 5 seconds:

[relevant output from strace -tt cp /path/to/nfs/file /tmp]
11:44:01.195024 read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
11:44:06.789237 fstat64(4, {st_mode=S_IFREG|0644, st_size=1347584,
...}) = 0

These stalls ultimately cause the copy of a 1.44MB floppy disk image
to take about a minute and a half to complete.  The strace -r output
did not show any unusual timings for the system calls, but the -tt
wall-clock output did show these long delays between read() and the
subsequent fstat64()/_llseek()/fcntl64()/write() calls.

Version-Release number of selected component (if applicable):

kernel-2.4.9-e.34smp
kernel-2.4.9-e.35smp

How reproducible:

In our testing, this was 100% reproducable with both kernels, and with
the default udp mount [rw]size, 4096, 8192, and 32768 set.  This was
against a NetApp 960 running OnTap 6.4.2P6.  During these tests, the
filer was showing ~10% cpu utilization, and the interface we were
testing against was not heavily used.  We did not test tcp transport.

Steps to Reproduce:
1. mount filer:/volX/export /mnt/filer
2. time cp /mnt/filer/file /tmp
  
Actual results:

Observe the throughput being much lower than it should be, and the
above strace results.

Expected results:

The file should copy in a time that is reasonable given the networking
speeds of the two hosts.

Additional info:

Falling back to the e.30smp kernel allows us to get consistent timings
in this simple test.

Comment 1 Mike Cooling 2004-01-29 21:33:51 UTC

I too am seeing attrocious performance at the e.35smp kernel. I too am
running a Netapp at release 6.4.2P6 (but it's an F740). Load average
reaches bursts of up to 140. and bad Apache response times. I am
reverting back to the e.30 kernel tomorrow.

Comment 2 Jason Baron 2004-04-15 15:00:58 UTC

This has long been fixed in the current erratum, e.38.

Comment 3 Mike Cooling 2004-04-15 16:08:48 UTC

I disagree. It is much better but performance is still much worse than
e.30, and I have already opened up support issue #312835 and returned
to e.30 once again.