Bug 213377 - Slow network performance with e1000 driver and SMB traffic
Slow network performance with e1000 driver and SMB traffic
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
8
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Jeff Layton
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-11-01 02:36 EST by Josh P.
Modified: 2014-06-18 03:35 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-11-25 07:23:55 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
screenshot of systerm monitor when issue is taking place (69.58 KB, image/png)
2006-11-01 02:36 EST, Josh P.
no flags Details
lspci -vvv output (11.49 KB, application/octet-stream)
2008-02-12 22:41 EST, Josh P.
no flags Details
modinfo e1000 output (4.81 KB, application/octet-stream)
2008-02-12 22:42 EST, Josh P.
no flags Details
sysctl -a output (22.27 KB, application/octet-stream)
2008-02-12 22:42 EST, Josh P.
no flags Details

  None (edit)
Description Josh P. 2006-11-01 02:36:17 EST
Description of problem:
Slow network performance with e1000 driver and SMB traffic

Version-Release number of selected component (if applicable): Core 6.0

How reproducible: 100% reproducable

Load Core 6, use e1000 driver with Intel Corporation 82540EM Gigabit Ethernet
Controller.
 
Actual results: VERY slow performance.  ~2.5MB/s?  Check screenshot for graph.

Additional info: System is a Dell 400SC with an on board e1000 adapter.  I also
tried to load the Intel e1000 drive from Intel's website (7.2.9) and had the
same result.  Is it some I/O scheduling issue?  The target host is an
OpenSolaris 10 box with a standard samba setup.  Other OSs performed adequately,
i.e., OS X, on the same server.  Data is multiple 800MB .wmv files

lspci:

[root@wkst1 tmp]# lspci
00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub (rev 02)
00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #3 (rev 02)
00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #4 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI
Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface
Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller
(rev 02)
00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R)
AC'97 Audio Controller (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV43 [GeForce 6600/GeForce
6600 GT] (rev a2)
02:01.0 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)
02:01.1 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)
02:0c.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet
Controller (rev 02)

iostat shows nothing useful.  Attached .png shows the odd I/O behaviour.
Comment 1 Josh P. 2006-11-01 02:36:17 EST
Created attachment 139946 [details]
screenshot of systerm monitor when issue is taking place
Comment 2 Sammy 2007-01-04 19:53:11 EST
I can confirm frequent hangs with the same adapter FC6+updates.
The funny thing is that the system was working fine until I 
switched from a static ip to dhcp. I will try to get another
static ip and try it again tomorrow.
Comment 3 Sammy 2007-01-05 09:41:26 EST
It is working great with a static IP.
FYI
Comment 4 Jesse Brandeburg 2007-11-11 01:55:03 EST
switching from dhcp to static address turns off the timestamping option in the
packets.  I wonder if we can confirm this is related to the problem somehow. 
maybe do the AF_PACKET thing and turn on the timestamping option on your socket
while using a static address and see if the problem comes back?

see this thread for reference:
http://marc.info/?l=linux-netdev&m=118888412614400&w=2

Comment 5 Josh P. 2007-11-15 00:10:54 EST
Confirmed this is still an issue with FC8.

Setting to a static IP doesn't help.

Disabling TCP timestamping doesn't make a difference.

(sysctl -w net.ipv4.tcp_timestamps=0)
Comment 6 Jon Stanley 2007-12-31 16:39:53 EST
Changing AssignedTo and version to reflect that this is an F8 bug
Comment 7 John W. Linville 2008-01-07 15:01:21 EST
Could you try a rawhide kernel?  I believe the patch in the thread from 
comment 4 should already be available in those kernels...
Comment 8 Christopher Brown 2008-02-03 15:36:22 EST
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

As the original reporter has not added any information in a few months, this bug
has been marked for closure. Please could a rawhide kernel be tested as per
comment #7. Otherwise this bug will be closed as we are unable to troubleshoot
without further testing.

Cheers
Chris
Comment 9 Josh P. 2008-02-05 18:16:28 EST
I am the original reporter and I will test this fix within the next week so
please leave it open for now.

Sorry for the bug spam.
Comment 10 Josh P. 2008-02-12 22:41:59 EST
Created attachment 294721 [details]
lspci -vvv output
Comment 11 Josh P. 2008-02-12 22:42:32 EST
Created attachment 294722 [details]
modinfo e1000 output
Comment 12 Josh P. 2008-02-12 22:42:49 EST
Created attachment 294723 [details]
sysctl -a output
Comment 13 Josh P. 2008-02-12 22:45:27 EST
Latest rawhide exhibits the same behavior.  I am able to get only about 3MB/s on
this gigabit link.  Had to boot with enforcing=0 to get X to come up; unrelated,
but just a bit of info.

[root@sc400 ~]# uname -a
Linux sc400 2.6.24.1-26.fc9 #1 SMP Fri Feb 8 19:56:42 EST 2008 i686 i686 i386
GNU/Linux
Comment 14 Trevor Cordes 2008-02-16 11:30:29 EST
For years I've gotten horrible performance upload to samba(box A) from XP (box
B) on my main fileserver box (F8 now).  I also have another box C running F8 I
can test with using samba and nfs.  The samba hardware is reasonably fast (P4
2.4/533, 3GB RAM dc, Intel server-class NIC).  The XP box is P4 3.0, Intel
high-end desktop adapter.  Box C is PD 3.0, same NIC.  Jumbo frames are enabled
on all NICs.  Switch is Linksys SRW2016 gigabit webview w/jumbo enabled.

Here's some tests copying a music folder 7 files 558MB total, results in seconds
so lower is better:

secs
16 B->C samba -> saved to scsi (no raid)
51 B->A samba -> saved to raid6
36 A->C nfs   -> saved to scsi (no raid)
30 C->A samba -> saved to shm test share
25 C->A samba -> saved to raid6
40 C->A nfs   -> saved to raid6

Notice the vast difference in speeds copying the same files (from 10MB/s to
35MB/s)!  Why can the XP box move data so fast to one samba box(16) and so slow
to the other(51)?  Sure, C is a faster box than A, but not by that much, and
during my tests the box is not loaded.

I'll watch this bug with great interest.
Comment 15 John W. Linville 2008-04-29 15:59:52 EDT
Well, it has been a while...is this issue still relevant?
Comment 16 Josh P. 2008-04-30 03:16:56 EDT
It is still an issue as of 60 days ago when I last blew away everything on that
hardware and retested with a fresh FC9 and a rawhide kernel.  I have not heard
about anything reported subsequently that might fix it.

I'm willing to run several, even MANY tests on this hardware (I can redeploy
it's tasks to another machine) in order to help track this down.  I am happy to
use this machine to test any fixes for this issue, but it only seems that we
have people coming in every few months just to ask if this bug is still here...
 No one has indicated that they want this issue solved; just that they want the
issue gone.

If I'm not providing adequate data to solve the issue, please inform me of that
fact and I can gather whatever information is necessary.
Comment 17 John W. Linville 2008-04-30 11:15:00 EDT
Can you replicate this behavior with some other adapter (e.g. using the tg3 
driver)?  It is not at all clear to me how SMB traffic would trigger a 
slowdown on a specific adapter or driver.
Comment 18 Jesse Brandeburg 2008-04-30 14:28:27 EDT
I'll try not to ignore this bug any more, it doesn't easily reproduce here so
we're left asking you for lots of data, sorry:

can you tell us what TCP or UDP ports on the samba machine end up being used
(netstat -anlp during slowness)?  I'm also interested in a 10000 packet tcpdump
while it is being slow (tcpdump -i ethX -c 10000 -w e1000smb.tcpdump).  Windows
will transparently try port 445 TCP (cifs) and then fall back to port 139 (tcp
or udp smbfs)

if you're willing to repro again, you can try turning off, with ethtool:
1) tso (ethtool -K ethX tso off)
2) tx checksum offload (ethtool -K ethX tx off)
3) rx checksum offload (ethtool -K ethX rx off)

and see if that changes anything.  Have you checked ethtool -S ethX output on
the samba side (during slowness) to see if it reports anything (that output
would be good to attach)

We actually found a long standing bug in e1000 recently where UDP checksum
result 0000 was being treated incorrectly, but that wouldn't apply here unless
you were actually ending up using UDP.

so, bolapara, your test is to transfer 800MB data *from* linux F8 samba to
solaris?  is the result repeatable when talking to another linux samba or
windows system?  Are you sure you're not being disk limited due to slow disk
access or io-scheduler problems?

I would ask if you can repro the issue using netperf or nuttcp or even
dbench/tbench (from samba test suite) the goal here is to get the disk reads out
of the picture.
Comment 19 Josh P. 2008-05-01 23:35:09 EDT
Note:  In my previous comment (#16) I meant FC8 not FC9.

After my testing today, I'm starting to believe that the issue is basically
this:  Gnome's performance on Fedora with regards to network operations is poor
and System Monitor's default update interval in Fedora is too fine.  Read below
for details.  The problem with this conclusion is that while I was able to
replicate the harmonic, and the performance in some of the scenarios was poor,
these tests are going a lot better than the last couple times I replicated it. 
There are some variables to consider here which I documented below but there are
some differences in the testing this time around that I cannot change at this
time.  There is also potential that something was fixed in patches between the
time I last tested this and now that helped the performance a bit.

I reloaded the box with FC8 and all updates.  While the updates were loading I
noticed the system still exhibited the same odd harmonic as the screen shot of
System Monitor that I included with the original defect.  After the patch
download/install, it updated fairly quickly and I rebooted.

The test is this:  I am copying a 1GB file FROM my Fedora machine to my file server.

I created a 1GB file from /dev/zero.  I then cat'd that file to /dev/null to get
it in the system cache (System has 2GB of RAM).  I then performed my first test
using smbclient.  I had used only the Gnome interface for the original issue
submission and subsequent reproductions, so this is a variable.

Performance for the smbclient test was decent, but the harmonic was still
clearly visible.  With the System Monitor update interval at the default 0.5s, I
would see each spike at about ~60MB/s and then the next update would be 0b/s. 
Final throughput was 12260KB/s.

Second test was doing an scp.  Those of you commenting that the protocol
shouldn't matter were correct; I saw the same harmonic and the final throughput
was 12.2MB/s.

Using Gnome, I again saw the same harmonic.  Gnome doesn't show throughput but
with a stopwatch I saw that it took 161s to copy the 1GB file; a measly
6.36MB/s.  Peaks were much lower at about 28MB/s, but the seemed spaced the same
as the scp and smbclient tests.

Notes:

Dell PowerEdge 400SC specifications:  3.2GHz P4, 2GB RAM, U160 SCSI 10k RPM hard
drive.

Disk performance of Fedora machine:  This system has a 10k RPM SCSI U160 drive
attached to an Adaptec 39160.  After I cat'd the test file to /dev/null to get
it into cache, I ran three 'dd' time how long it took to send it to /dev/null. 
192MB/s, 201MB/s, and 189MB/s were the test results.  I don't think disk
performance is the bottleneck here.

Additional tests:  I used my Ubuntu 8.04 laptop (Dell Latitude D600.  1.6GHz
Pentium M, 2GB of RAM, tg3 GbE).  Lowering the System Monitor update interval to
0.5s (default was 1s) demonstrated the same harmonic!  This machine did the
smbclient test at 10135KB/s.  It's a significantly lower power machine than the
400SC although it does have GbE.  scp throughput was 12.2MB/s.  Gnome's
performance was the best at 12.5MB/s.  Since this is 8.04 please remember that
we are using gvfs here.  Is this the proof that the performance problem was
originally just Gnome?

Also tested scp from Leopard (iMac G5 2.0GHz) and it performed nearly as poorly
as Gnome on Fedora:  9.2MB/s

Another variable from the original tests:  My file server is now running Ubuntu
Server and not Solaris.  My apologies for this variable but my file server is a
'production' machine.  If we need further retests with Solaris as the fileserver
OS, I can see if I can find a solution to replicate the original scenario more
closely.

Last note:  With the exception of the Fedora Gnome test where I used a stopwatch
to figure out the throughput, I generally relied on the tool/application to give
me my throughput numbers.  I don't know if they all report MB/s or some report
MiB/s or what.  Just an FYI.
Comment 20 Jesse Brandeburg 2008-05-02 12:40:44 EDT
if you're having a problem with the "harmonic" it is because the stats are
updated only every two seconds in the driver.  Not really a bug, just the way
things always have been.
We actually have patched e1000 for this,
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ef90e4eca9fcade05dd03f853df75cf459a75422;hp=419886927796dfeca87c1fd11d1fe2ed442103cc


As for the performance problems, I'm glad you were able to eliminate the disk. 
the G5 probably isn't a good comparison, its slow, especially with scp.

were you able to try any of the offload disable options I suggested?

smbclient is not a very fast program, I would expect the mount -t cifs
//file-server/share /localmnt to be the fastest.

just as a reference, I have two linux boxes here connected over a netgear
gigabit switch, they are all PCIe + e1000e however.

can you please yum install nuttcp and try it, just so we can compare with a
network benchmark?

are you satisfied with the current levels of performance?  it alsmost seems like
that from you post.
Comment 21 Jesse Brandeburg 2008-05-02 13:30:42 EDT
forgot to post my nuttcp results:
on ich9:
# nuttcp -S

[root@ich9b ~]# nuttcp ich9
 1119.2995 MB /  10.03 sec =  936.2377 Mbps 6 %TX 17 %RX
Comment 22 Josh P. 2008-05-11 23:16:10 EDT
Below are my results using iperf.  All tests were run individually.

192.168.1.213 is a Dell Latitude D600 running Ubuntu 8.04.
192.168.1.250 is the Dell PowerEdge 400SC running Fedora 8.

josh@fs:~$ iperf -s -p 9999
------------------------------------------------------------
Server listening on TCP port 9999
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.1.105 port 9999 connected with 192.168.1.213 port 54297
[  4]  0.0-10.0 sec    676 MBytes    565 Mbits/sec
[  5] local 192.168.1.105 port 9999 connected with 192.168.1.213 port 54298
[  5]  0.0-10.0 sec    675 MBytes    565 Mbits/sec
[  4] local 192.168.1.105 port 9999 connected with 192.168.1.250 port 52264
[  4]  0.0-10.0 sec    976 MBytes    816 Mbits/sec
[  5] local 192.168.1.105 port 9999 connected with 192.168.1.250 port 52265
[  5]  0.0-10.0 sec    972 MBytes    812 Mbits/sec
[  4] local 192.168.1.105 port 9999 connected with 192.168.1.250 port 52266
[  4]  0.0-10.0 sec    964 MBytes    806 Mbits/sec

I don't know what to say now.  Clearly these results point away from the network
being the issue here...

I'm still not happy with the SMB/CIFS performance of this box but now I'm not
sure who is to blame here for it's poor performance.
Comment 23 Jeff Layton 2008-07-03 09:50:21 EDT
It's not clear to me what you're actually using to bench out SMB performance
here. It looks like you're just using userspace programs to do it (e.g.
smbclient or the gnome VFS smb stuff).

Have you done any testing with the in-kernel CIFS client? For instance:

mount -t cifs //server/share /mnt/cifs

...and then do your testing from there. That might give better results, and
would give us an indication of whether the problem is something confined to
userspace or in-kernel.

Comment 24 Andy Gospodarek 2008-10-07 10:08:15 EDT
Sounds like this is more of a fs issue, so I'm going to watch it, but assign it back to Jeff.
Comment 25 Jeff Layton 2008-10-08 08:26:05 EDT
Setting to NEEDINFO pending response to question in comment #23.
Comment 26 Jeff Layton 2008-11-25 07:23:55 EST
No response in several months to request to test using CIFS. Closing case. Please reopen if this is still an issue for you.

Note You need to log in before you can comment on or make changes to this bug.