|Summary:||Slow network performance with e1000 driver and SMB traffic|
|Product:||[Fedora] Fedora||Reporter:||Josh P. <bolapara>|
|Component:||kernel||Assignee:||Jeff Layton <jlayton>|
|Status:||CLOSED INSUFFICIENT_DATA||QA Contact:||Brian Brock <bbrock>|
|Version:||8||CC:||agospoda, cebbert, chris.brown, davej, jesse.brandeburg, jonstanley, steved, trevor, umar, wtogami|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2008-11-25 12:23:55 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Josh P. 2006-11-01 07:36:17 UTC
Description of problem: Slow network performance with e1000 driver and SMB traffic Version-Release number of selected component (if applicable): Core 6.0 How reproducible: 100% reproducable Load Core 6, use e1000 driver with Intel Corporation 82540EM Gigabit Ethernet Controller. Actual results: VERY slow performance. ~2.5MB/s? Check screenshot for graph. Additional info: System is a Dell 400SC with an on board e1000 adapter. I also tried to load the Intel e1000 drive from Intel's website (7.2.9) and had the same result. Is it some I/O scheduling issue? The target host is an OpenSolaris 10 box with a standard samba setup. Other OSs performed adequately, i.e., OS X, on the same server. Data is multiple 800MB .wmv files lspci: [root@wkst1 tmp]# lspci 00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub (rev 02) 00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller (rev 02) 00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02) 00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #4 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) 00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02) 00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02) 01:00.0 VGA compatible controller: nVidia Corporation NV43 [GeForce 6600/GeForce 6600 GT] (rev a2) 02:01.0 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01) 02:01.1 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01) 02:0c.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02) iostat shows nothing useful. Attached .png shows the odd I/O behaviour.
Comment 1 Josh P. 2006-11-01 07:36:17 UTC
Created attachment 139946 [details] screenshot of systerm monitor when issue is taking place
Comment 2 Sammy 2007-01-05 00:53:11 UTC
I can confirm frequent hangs with the same adapter FC6+updates. The funny thing is that the system was working fine until I switched from a static ip to dhcp. I will try to get another static ip and try it again tomorrow.
Comment 3 Sammy 2007-01-05 14:41:26 UTC
It is working great with a static IP. FYI
Comment 4 Jesse Brandeburg 2007-11-11 06:55:03 UTC
switching from dhcp to static address turns off the timestamping option in the packets. I wonder if we can confirm this is related to the problem somehow. maybe do the AF_PACKET thing and turn on the timestamping option on your socket while using a static address and see if the problem comes back? see this thread for reference: http://marc.info/?l=linux-netdev&m=118888412614400&w=2
Comment 5 Josh P. 2007-11-15 05:10:54 UTC
Confirmed this is still an issue with FC8. Setting to a static IP doesn't help. Disabling TCP timestamping doesn't make a difference. (sysctl -w net.ipv4.tcp_timestamps=0)
Comment 6 Jon Stanley 2007-12-31 21:39:53 UTC
Changing AssignedTo and version to reflect that this is an F8 bug
Comment 7 John W. Linville 2008-01-07 20:01:21 UTC
Could you try a rawhide kernel? I believe the patch in the thread from comment 4 should already be available in those kernels...
Comment 8 Christopher Brown 2008-02-03 20:36:22 UTC
Hello, I'm reviewing this bug as part of the kernel bug triage project, an attempt to isolate current bugs in the Fedora kernel. http://fedoraproject.org/wiki/KernelBugTriage As the original reporter has not added any information in a few months, this bug has been marked for closure. Please could a rawhide kernel be tested as per comment #7. Otherwise this bug will be closed as we are unable to troubleshoot without further testing. Cheers Chris
Comment 9 Josh P. 2008-02-05 23:16:28 UTC
I am the original reporter and I will test this fix within the next week so please leave it open for now. Sorry for the bug spam.
Comment 13 Josh P. 2008-02-13 03:45:27 UTC
Latest rawhide exhibits the same behavior. I am able to get only about 3MB/s on this gigabit link. Had to boot with enforcing=0 to get X to come up; unrelated, but just a bit of info. [root@sc400 ~]# uname -a Linux sc400 18.104.22.168-26.fc9 #1 SMP Fri Feb 8 19:56:42 EST 2008 i686 i686 i386 GNU/Linux
Comment 14 Trevor Cordes 2008-02-16 16:30:29 UTC
For years I've gotten horrible performance upload to samba(box A) from XP (box B) on my main fileserver box (F8 now). I also have another box C running F8 I can test with using samba and nfs. The samba hardware is reasonably fast (P4 2.4/533, 3GB RAM dc, Intel server-class NIC). The XP box is P4 3.0, Intel high-end desktop adapter. Box C is PD 3.0, same NIC. Jumbo frames are enabled on all NICs. Switch is Linksys SRW2016 gigabit webview w/jumbo enabled. Here's some tests copying a music folder 7 files 558MB total, results in seconds so lower is better: secs 16 B->C samba -> saved to scsi (no raid) 51 B->A samba -> saved to raid6 36 A->C nfs -> saved to scsi (no raid) 30 C->A samba -> saved to shm test share 25 C->A samba -> saved to raid6 40 C->A nfs -> saved to raid6 Notice the vast difference in speeds copying the same files (from 10MB/s to 35MB/s)! Why can the XP box move data so fast to one samba box(16) and so slow to the other(51)? Sure, C is a faster box than A, but not by that much, and during my tests the box is not loaded. I'll watch this bug with great interest.
Comment 15 John W. Linville 2008-04-29 19:59:52 UTC
Well, it has been a while...is this issue still relevant?
Comment 16 Josh P. 2008-04-30 07:16:56 UTC
It is still an issue as of 60 days ago when I last blew away everything on that hardware and retested with a fresh FC9 and a rawhide kernel. I have not heard about anything reported subsequently that might fix it. I'm willing to run several, even MANY tests on this hardware (I can redeploy it's tasks to another machine) in order to help track this down. I am happy to use this machine to test any fixes for this issue, but it only seems that we have people coming in every few months just to ask if this bug is still here... No one has indicated that they want this issue solved; just that they want the issue gone. If I'm not providing adequate data to solve the issue, please inform me of that fact and I can gather whatever information is necessary.
Comment 17 John W. Linville 2008-04-30 15:15:00 UTC
Can you replicate this behavior with some other adapter (e.g. using the tg3 driver)? It is not at all clear to me how SMB traffic would trigger a slowdown on a specific adapter or driver.
Comment 18 Jesse Brandeburg 2008-04-30 18:28:27 UTC
I'll try not to ignore this bug any more, it doesn't easily reproduce here so we're left asking you for lots of data, sorry: can you tell us what TCP or UDP ports on the samba machine end up being used (netstat -anlp during slowness)? I'm also interested in a 10000 packet tcpdump while it is being slow (tcpdump -i ethX -c 10000 -w e1000smb.tcpdump). Windows will transparently try port 445 TCP (cifs) and then fall back to port 139 (tcp or udp smbfs) if you're willing to repro again, you can try turning off, with ethtool: 1) tso (ethtool -K ethX tso off) 2) tx checksum offload (ethtool -K ethX tx off) 3) rx checksum offload (ethtool -K ethX rx off) and see if that changes anything. Have you checked ethtool -S ethX output on the samba side (during slowness) to see if it reports anything (that output would be good to attach) We actually found a long standing bug in e1000 recently where UDP checksum result 0000 was being treated incorrectly, but that wouldn't apply here unless you were actually ending up using UDP. so, bolapara, your test is to transfer 800MB data *from* linux F8 samba to solaris? is the result repeatable when talking to another linux samba or windows system? Are you sure you're not being disk limited due to slow disk access or io-scheduler problems? I would ask if you can repro the issue using netperf or nuttcp or even dbench/tbench (from samba test suite) the goal here is to get the disk reads out of the picture.
Comment 19 Josh P. 2008-05-02 03:35:09 UTC
Note: In my previous comment (#16) I meant FC8 not FC9. After my testing today, I'm starting to believe that the issue is basically this: Gnome's performance on Fedora with regards to network operations is poor and System Monitor's default update interval in Fedora is too fine. Read below for details. The problem with this conclusion is that while I was able to replicate the harmonic, and the performance in some of the scenarios was poor, these tests are going a lot better than the last couple times I replicated it. There are some variables to consider here which I documented below but there are some differences in the testing this time around that I cannot change at this time. There is also potential that something was fixed in patches between the time I last tested this and now that helped the performance a bit. I reloaded the box with FC8 and all updates. While the updates were loading I noticed the system still exhibited the same odd harmonic as the screen shot of System Monitor that I included with the original defect. After the patch download/install, it updated fairly quickly and I rebooted. The test is this: I am copying a 1GB file FROM my Fedora machine to my file server. I created a 1GB file from /dev/zero. I then cat'd that file to /dev/null to get it in the system cache (System has 2GB of RAM). I then performed my first test using smbclient. I had used only the Gnome interface for the original issue submission and subsequent reproductions, so this is a variable. Performance for the smbclient test was decent, but the harmonic was still clearly visible. With the System Monitor update interval at the default 0.5s, I would see each spike at about ~60MB/s and then the next update would be 0b/s. Final throughput was 12260KB/s. Second test was doing an scp. Those of you commenting that the protocol shouldn't matter were correct; I saw the same harmonic and the final throughput was 12.2MB/s. Using Gnome, I again saw the same harmonic. Gnome doesn't show throughput but with a stopwatch I saw that it took 161s to copy the 1GB file; a measly 6.36MB/s. Peaks were much lower at about 28MB/s, but the seemed spaced the same as the scp and smbclient tests. Notes: Dell PowerEdge 400SC specifications: 3.2GHz P4, 2GB RAM, U160 SCSI 10k RPM hard drive. Disk performance of Fedora machine: This system has a 10k RPM SCSI U160 drive attached to an Adaptec 39160. After I cat'd the test file to /dev/null to get it into cache, I ran three 'dd' time how long it took to send it to /dev/null. 192MB/s, 201MB/s, and 189MB/s were the test results. I don't think disk performance is the bottleneck here. Additional tests: I used my Ubuntu 8.04 laptop (Dell Latitude D600. 1.6GHz Pentium M, 2GB of RAM, tg3 GbE). Lowering the System Monitor update interval to 0.5s (default was 1s) demonstrated the same harmonic! This machine did the smbclient test at 10135KB/s. It's a significantly lower power machine than the 400SC although it does have GbE. scp throughput was 12.2MB/s. Gnome's performance was the best at 12.5MB/s. Since this is 8.04 please remember that we are using gvfs here. Is this the proof that the performance problem was originally just Gnome? Also tested scp from Leopard (iMac G5 2.0GHz) and it performed nearly as poorly as Gnome on Fedora: 9.2MB/s Another variable from the original tests: My file server is now running Ubuntu Server and not Solaris. My apologies for this variable but my file server is a 'production' machine. If we need further retests with Solaris as the fileserver OS, I can see if I can find a solution to replicate the original scenario more closely. Last note: With the exception of the Fedora Gnome test where I used a stopwatch to figure out the throughput, I generally relied on the tool/application to give me my throughput numbers. I don't know if they all report MB/s or some report MiB/s or what. Just an FYI.
Comment 20 Jesse Brandeburg 2008-05-02 16:40:44 UTC
if you're having a problem with the "harmonic" it is because the stats are updated only every two seconds in the driver. Not really a bug, just the way things always have been. We actually have patched e1000 for this, http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ef90e4eca9fcade05dd03f853df75cf459a75422;hp=419886927796dfeca87c1fd11d1fe2ed442103cc As for the performance problems, I'm glad you were able to eliminate the disk. the G5 probably isn't a good comparison, its slow, especially with scp. were you able to try any of the offload disable options I suggested? smbclient is not a very fast program, I would expect the mount -t cifs //file-server/share /localmnt to be the fastest. just as a reference, I have two linux boxes here connected over a netgear gigabit switch, they are all PCIe + e1000e however. can you please yum install nuttcp and try it, just so we can compare with a network benchmark? are you satisfied with the current levels of performance? it alsmost seems like that from you post.
Comment 21 Jesse Brandeburg 2008-05-02 17:30:42 UTC
forgot to post my nuttcp results: on ich9: # nuttcp -S [root@ich9b ~]# nuttcp ich9 1119.2995 MB / 10.03 sec = 936.2377 Mbps 6 %TX 17 %RX
Comment 22 Josh P. 2008-05-12 03:16:10 UTC
Below are my results using iperf. All tests were run individually. 192.168.1.213 is a Dell Latitude D600 running Ubuntu 8.04. 192.168.1.250 is the Dell PowerEdge 400SC running Fedora 8. josh@fs:~$ iperf -s -p 9999 ------------------------------------------------------------ Server listening on TCP port 9999 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 192.168.1.105 port 9999 connected with 192.168.1.213 port 54297 [ 4] 0.0-10.0 sec 676 MBytes 565 Mbits/sec [ 5] local 192.168.1.105 port 9999 connected with 192.168.1.213 port 54298 [ 5] 0.0-10.0 sec 675 MBytes 565 Mbits/sec [ 4] local 192.168.1.105 port 9999 connected with 192.168.1.250 port 52264 [ 4] 0.0-10.0 sec 976 MBytes 816 Mbits/sec [ 5] local 192.168.1.105 port 9999 connected with 192.168.1.250 port 52265 [ 5] 0.0-10.0 sec 972 MBytes 812 Mbits/sec [ 4] local 192.168.1.105 port 9999 connected with 192.168.1.250 port 52266 [ 4] 0.0-10.0 sec 964 MBytes 806 Mbits/sec I don't know what to say now. Clearly these results point away from the network being the issue here... I'm still not happy with the SMB/CIFS performance of this box but now I'm not sure who is to blame here for it's poor performance.
Comment 23 Jeff Layton 2008-07-03 13:50:21 UTC
It's not clear to me what you're actually using to bench out SMB performance here. It looks like you're just using userspace programs to do it (e.g. smbclient or the gnome VFS smb stuff). Have you done any testing with the in-kernel CIFS client? For instance: mount -t cifs //server/share /mnt/cifs ...and then do your testing from there. That might give better results, and would give us an indication of whether the problem is something confined to userspace or in-kernel.
Comment 24 Andy Gospodarek 2008-10-07 14:08:15 UTC
Sounds like this is more of a fs issue, so I'm going to watch it, but assign it back to Jeff.
Comment 25 Jeff Layton 2008-10-08 12:26:05 UTC
Setting to NEEDINFO pending response to question in comment #23.
Comment 26 Jeff Layton 2008-11-25 12:23:55 UTC
No response in several months to request to test using CIFS. Closing case. Please reopen if this is still an issue for you.