391021 – Slow performance writing to NetApp filer

Bug 391021 - Slow performance writing to NetApp filer

Summary: Slow performance writing to NetApp filer

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	8
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-11-19 20:09 UTC by Joshua Baker-LePain
Modified:	2009-02-25 00:45 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-01-09 07:27:53 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
tshark capture of NFS traffic (13.79 MB, application/x-bzip) 2007-11-20 17:02 UTC, Joshua Baker-LePain	no flags	Details
tshark capture of NFS traffic from CentOS-5 client (7.96 MB, application/x-bzip) 2007-11-20 18:54 UTC, Joshua Baker-LePain	no flags	Details
View All

Description Joshua Baker-LePain 2007-11-19 20:09:34 UTC

Description of problem:
We see very slow performance writing from Fedora 7 and 8 clients to our NetApp
FAS3020 filer.  I've tested with dd (from /dev/zero), bonnie++, iozone, tar and
simple cp.  In all cases, write speed is limited to ~5MB/s on a relatively
unloaded filer.  Tests on identical hardware running CentOS 5 achieve ~50MB/s. 
This is not a generic NFS problem, as pointing the Fedora boxes at a Panasas
filer and running similar tests yields write speeds on the order of 40MB/s.

Note that read performance is not similarly affected.

I've tested with 2 different client platforms:

1) HP DL140 G3 (dual quad-core Xeon) with NetXtreme BCM5721 NIC (tg3 driver)
running x86_64 OS
2) Dell PowerEdge 1855 (dual NetBurst Xeon) with Intel 82546GB NIC (e1000
driver) running i386 OS

All testing was done using a standard MTU of 1500 and a wide variety of NFS
mount options, none of which made any difference.  All networking hardware is
from Foundry.

Version-Release number of selected component (if applicable):
Most recent tested is 2.6.23.1-42.fc8.

How reproducible:
Every time.

Steps to Reproduce:
1.  Install Fedora 7 or 8.
2.  Write to NetApp filer.
  
Actual results:
Watch it crawl along.

Expected results:
Performance on par with CentOS on the same hardware.

Additional info:

Comment 1 Joshua Baker-LePain 2007-11-20 17:02:33 UTC

Created attachment 265131 [details]
tshark capture of NFS traffic

I've attached a tshark capture of a simple 'cp' of a 200MB file of random data
from a f8 client to the NetApp.  The tshark command used was 'tshark -w
/tmp/bz391021 -s 192 -i eth1 host netapp', and the NFS mount options used were
'rsize=32768,wsize=32768,hard,intr'.

Comment 2 Steve Dickson 2007-11-20 18:32:32 UTC

hmm... it appears you are having a lager number
of TCP retransmissions, which would explain for the
slowness... Now the question is why?

Has there a recent network topology change of some
kind? Maybe a mis-configured router or has there 
been a new NIC added the mix? Are both sides using
the same duplex mode?

Comment 3 Joshua Baker-LePain 2007-11-20 18:53:29 UTC

There was a network reorganization several months ago (before I got here) that
involved adding several VLANs and ACLs.  However, the bulk of the clients (which
are still running FC4 -- we're evaluating which distro to move them to) still
perform just fine.  It's only F7 and F8 that I've hit this issue with.  Also, if
it were a network topology problem, wouldn't we expect CentOS-5 to have the same
problem?  I'll attach another tshark capture of the same test from a CentOS-5
client identical (in the same rack even) to the f8 client above.

And, yes, the f8 client reports 1000Mb/s full duplex, as does everywhere between
it and the NetApp.

Comment 4 Joshua Baker-LePain 2007-11-20 18:54:12 UTC

Created attachment 265211 [details]
tshark capture of NFS traffic from CentOS-5 client

Comment 5 Steve Dickson 2007-11-20 19:35:15 UTC

> 'tshark -w /tmp/bz391021 -s 192 -i eth1 host netapp'
                                   ^^^^^^^
Why the -i flag? that stops be from see any NFS traffic?

> Also, if it were a network topology problem, wouldn't we expect CentOS-5 to 
> have the same problem?
Well CentOS is based off of RHEL and Fedora is much closer to
the upstream kernel (at the time). Since Upstream changes 
quite a bit more than RHEL (or CentOS), regressions
are always a possibility.

Comment 6 Joshua Baker-LePain 2007-11-20 21:42:40 UTC

Erm, doesn't -i just tell tshark which interface to listen on?  I guess it is
redundant, since eth1 is the only active non-loopback interface on these
systems.  But it shouldn't stop you from seeing *any* traffic between the client
and the netapp.

Let me know if you need me to re-run the packet captures (and any other flags I
may need to use).

Comment 7 Chuck Ebbert 2007-11-20 22:00:47 UTC

Can you set up a port mirror and capture the traffic from there for the slow
client? Packets might be getting delayed in the client's network adapter before
delivery.

Also, try turning off TCP window scaling on the client. You will proably need to
unmount and remount the filesystem after doing that.

  sysctl net.ipv4.tcp_window_scaling=0

Comment 8 Steve Dickson 2007-11-20 22:09:27 UTC

Oops... I meant the '-s 192' flag... my bad... 

Looking at the second post (Comment #4) I still can't
see any NFS traffic... Could you please drop the -s 
flag?

tia...

Comment 9 Joshua Baker-LePain 2007-11-20 23:00:54 UTC

Ah, that makes sense.  Sorry about that.  '-s 192' is a tcpdump holdover, where
it defaulted to only grabbing 68 bytes and actually recommended 192 for looking
at NFS traffic.

I've redone the captures, but they're too big to upload to bugzilla.  I've put
them at:

Fedora 8, tcp_window_scaling=1
http://www.duke.edu/~jlb17/bz391021.f8.ws1.bz2

Fedora 8, tcp_window_scaling=0
http://www.duke.edu/~jlb17/bz391021.f8.ws0.bz2

CentOS 5, tcp_window_scaling=1
http://www.duke.edu/~jlb17/bz391021.c5.ws1.bz2

On f8, turning off TCP window scaling made no difference in the write speed.

Comment 10 Steve Dickson 2007-11-21 00:26:19 UTC

wow they are huge... just out of curiosity (with out posting them)
could you capture a trace using CentOS to see if the traces
are as large?

Comment 11 Jon Stanley 2007-11-21 00:37:20 UTC

The CentOS capture is up there - third one in the list.  I'm guessing that the
huge increase in size is due to the dropping of the -s flag as requested, and
the fact that there is 200MB of 'random' data being copied there - probably not
too compressible.

Comment 12 Joshua Baker-LePain 2007-11-21 00:51:30 UTC

As pointed out, the CentOS trace is posted and it is large as well (although it
is a fair bit smaller than the Fedora traces).  I tried reducing the size of the
test file even further to get the trace size down, but going much smaller really
starts to introduce too much measurement error, IMO.

If the big size of the compressed trace files is an issue, let me know and I can
try sourcing the test files from /dev/zero rather than /dev/urandom -- those
might compress a bit better.

Comment 13 Christopher Brown 2008-01-23 23:40:56 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Comment 14 Joshua Baker-LePain 2008-01-25 17:26:53 UTC

The lack of activity, you may note, was due to the Fedora folks losing interest,
not me.  ;)  In any case, I haven't been able to track down any spare boxes to
install the most recent kernel on (all my test boxes got put into production),
but I will continue to try to do so and update again sometime next week.

Thanks.

Comment 15 Christopher Brown 2008-01-28 10:30:34 UTC

Okay Joshua, thanks for updating this anyway. We're just in the process of
prodding bugs that haven't seen much change - it may generate some renewed
interest. The 2.6.24 kernel has been released and updates-testing should have it
soon so it might be an idea to test with this when it arrives...?

Cheers
Chris

Comment 16 Joshua Baker-LePain 2008-02-04 23:12:53 UTC

I just tested with the most recent released kernel for F8 (2.6.23.14-107.fc8),
and this problem still exists.  It's a *bit* better (I get 10MB/s rather than
5), but still much slower than CentOS-5.  If/when a 2.6.24 kernel comes down the
pike, I'll try to give that a shot as well.

Comment 17 Jeff Burke 2008-07-15 13:44:22 UTC

Joshua,
   Our for our MRG product we create a "realtime" kernel variant
2.6.24-7.72.el5rt also for debug purposes we build a "vanilla" kernel variant
2.6.24-7.72.el5rtvanilla. These kernels can be installed on top of Red Hat
Enterprise Server 5.2.

   If you would like to try the 2.6.24-7.72rtvanilla for debug purposes let me
know I can put it up on my people page.

Jeff

Comment 18 Peter Robinson 2008-08-11 17:00:37 UTC

Out of interest what version of Data OnTap are you running on the FAS3020? We have a lot of these devices and have found that in some circumstances NetApp bug 226424 causes some interesting NFS performance issues.
http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=226424
(you need a NetApp login to access this BTW).

Comment 19 Joshua Baker-LePain 2008-08-11 17:15:02 UTC

At the time, we were running 7.0.3 (yeah, a bit old) on the FAS3020.  Since then and for unrelated reasons, we've upgraded to a FAS3070 which is running 7.2.4.  Looks like it's time to test again, using F9 this time.

There's not much detail in that NetApp bug, but it *still* seems odd to me that it would affect Fedora so much differently than RHEL/CentOS.

Comment 20 Bug Zapper 2008-11-26 08:35:03 UTC

This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 21 Bug Zapper 2009-01-09 07:27:53 UTC

Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 22 Joshua Baker-LePain 2009-02-25 00:45:06 UTC

Just to close the loop, I finally got a chance to test with Fedora 10, and this appears to have resolved itself.  *shrug*

Note You need to log in before you can comment on or make changes to this bug.