Bug 114021

Summary: NFS read stalls in e.34 and e.35 kernels
Product: Red Hat Enterprise Linux 2.1 Reporter: Need Real Name <aander07>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: cooling, kambiz, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-04-15 15:00:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Need Real Name 2004-01-21 15:03:33 UTC
Description of problem:

Copying a file from an NFS mount to local disk exhibits stalls as long
as 5 seconds:

[relevant output from strace -tt cp /path/to/nfs/file /tmp]
11:44:01.195024 read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
11:44:06.789237 fstat64(4, {st_mode=S_IFREG|0644, st_size=1347584,
...}) = 0

These stalls ultimately cause the copy of a 1.44MB floppy disk image
to take about a minute and a half to complete.  The strace -r output
did not show any unusual timings for the system calls, but the -tt
wall-clock output did show these long delays between read() and the
subsequent fstat64()/_llseek()/fcntl64()/write() calls.

Version-Release number of selected component (if applicable):

kernel-2.4.9-e.34smp
kernel-2.4.9-e.35smp

How reproducible:

In our testing, this was 100% reproducable with both kernels, and with
the default udp mount [rw]size, 4096, 8192, and 32768 set.  This was
against a NetApp 960 running OnTap 6.4.2P6.  During these tests, the
filer was showing ~10% cpu utilization, and the interface we were
testing against was not heavily used.  We did not test tcp transport.

Steps to Reproduce:
1. mount filer:/volX/export /mnt/filer
2. time cp /mnt/filer/file /tmp
  
Actual results:

Observe the throughput being much lower than it should be, and the
above strace results.

Expected results:

The file should copy in a time that is reasonable given the networking
speeds of the two hosts.

Additional info:

Falling back to the e.30smp kernel allows us to get consistent timings
in this simple test.

Comment 1 Mike Cooling 2004-01-29 21:33:51 UTC
I too am seeing attrocious performance at the e.35smp kernel. I too am
running a Netapp at release 6.4.2P6 (but it's an F740). Load average
reaches bursts of up to 140. and bad Apache response times. I am
reverting back to the e.30 kernel tomorrow.

Comment 2 Jason Baron 2004-04-15 15:00:58 UTC
This has long been fixed in the current erratum, e.38.

Comment 3 Mike Cooling 2004-04-15 16:08:48 UTC
I disagree. It is much better but performance is still much worse than
e.30, and I have already opened up support issue #312835 and returned
to e.30 once again.