Bug 119055 - RHEL 3.0 v2.4.21 is dropping frames when reaching 70 MB on the network link.
Summary: RHEL 3.0 v2.4.21 is dropping frames when reaching 70 MB on the network link.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: John W. Linville
QA Contact: Brian Brock
URL:
Whiteboard:
: 119066 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-03-24 15:52 UTC by Heather Conway
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-01-10 15:23:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sysreport for test system (263.14 KB, application/octet-stream)
2004-09-28 19:51 UTC, Frank Jansen
no flags Details
lspci from Scott Magill (1.35 KB, text/plain)
2004-12-13 18:40 UTC, John W. Linville
no flags Details

Description Heather Conway 2004-03-24 15:52:19 UTC
Description of problem:
RHEL 3.0 v2.4.21 is dropping frames when reaching 70 MB on the link.
The 3Com 3C996-SX with BCM5700 v7.1.22 driver  has been tried and an 
Intel long wave gigabit ethernet card has also been used to confirm 
the frame drops.

There are no messages in /var/log/messages that indicated that the 
frame(s) being dropped unless we missed them.   We couldn't find any 
TCP dropped frames or frame retransmissions in /proc interfaces 
either, but we know for sure that the frame being dropped from 
analyzing the iSCSI trace captured with the Analyzer. 
 
For FTP has been used as a test to just simply transfer a big file in 
binary mode.
The iSCSI driver is the Cisco based iSCSI driver version 3.4.0.0 (the 
latest driver for 2.4.x kernel).


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Arjan van de Ven 2004-03-24 15:54:05 UTC
"The 3Com 3C996-SX with BCM5700 v7.1.22 driver  has been tried"
clearly extremely unsupported.

Comment 2 Arjan van de Ven 2004-03-24 16:51:56 UTC
*** Bug 119066 has been marked as a duplicate of this bug. ***

Comment 3 Heather Conway 2004-04-09 15:21:44 UTC
b

Comment 4 Heather Conway 2004-05-21 15:14:18 UTC
This is still an issue with the network stack regardless of the 
driver used - reopening the issue.

Comment 5 Tim Burke 2004-06-02 18:39:41 UTC
Can you please provide a detailed test case / reproducer which does
not involve the unsupported iscsi module?


Comment 7 Tom Coughlan 2004-06-15 21:32:02 UTC
We talked about this on a con-call with EMC on 6/8/04. I requested
more detailed information. 

You say that the kernel is dropping packets. Please explain what you
mean by this.  
How are you detecting dropped packets? 
Do you know which component in the network is dropping packets? 
What is the network hardware configuration? 
What else is running on the network at the time?

Please reply. Thanks.

Comment 8 Heather Conway 2004-07-01 14:38:11 UTC
Configuration:  (this is the configuration that we use to duplicate 
the problem of dropped TCP packets with the ftp). Please note, this 
is a clean network with no other traffic on the vlan

Host A -----------à Analyzer --à 6509 ---- 6509 ---à Analyzer ------à 
Host B

To answer the below questions:

1.        We detected the TCP packets being dropped, when the 
receiving host fails to acknowledge the TCP packets being sent by the 
transmitting host within a reasonable timeframe that triggers the 
transmitting host to retransmit the unacknowledged TCP packets. 
2.       Using the Analyzers in the above configuration, we verified 
that the suspected dropped TCP packets safely transmitted to the 
receiving host.    This ensures that the suspected packets arrives to 
the receiving host and not getting dropped along the transmission.

3.       See above diagram.



Comment 9 Heather Conway 2004-07-01 14:46:22 UTC
We were using two types of gig e adapters:

1:  3Com 3C996 Gigabit Fiber-SX Server NIC
2:  On board copper gig e adapter (BROADCOM)

tg3 drivers for both. We also used the bcm57xx driver for the test

Comment 11 John W. Linville 2004-09-09 17:58:53 UTC
Heather, have you tried using any non-BCM5700 cards in this setup? 
What were the results?

Do we have any indications of whether the packets are being dropped by
the hardware (i.e. not being pulled off fast enough) versus being
dropped somewhere in the kernel's networking stack?

Could you attach the results of running sysreport to this bug?  Thanks!

Comment 12 Frank Jansen 2004-09-28 19:44:45 UTC
This has also been tried with the Intel Longwave NIC.  The results 
were the same in that frames are being dropped.

The hardware was ruled out by running a non-Linux OS with the same 
hardware; under this OS no frame drop was encountered at data rates 
of 115MB/s.

If beneficial to the troubleshooting effort, we can provide a trace 
that shows the frames being dropped.

sysreport output will be attached asap.

Comment 13 Frank Jansen 2004-09-28 19:51:01 UTC
Created attachment 104457 [details]
sysreport for test system

Comment 14 John W. Linville 2004-09-29 14:40:00 UTC
Could I also see the output from running ethtool on the interface in
question while running this test (i.e. "ethtool ethX" where ethX is
the slow NIC)?

Comment 17 John W. Linville 2004-11-16 22:09:34 UTC
"The hardware was ruled out by running a non-Linux OS with the same 
hardware; under this OS no frame drop was encountered at data rates 
of 115MB/s."

Can you confirm that this was done with the exact same box (rather
than a "twin")?

What is the PCI bus speed (33/66) and width (32/64) the card is
running at?  Or is it PCI-X?

Are sender and receiver systems the same (or similar)?  If not, how do
they differ?

I can recreate the 70MB/s speed limit, but I do not see any signs of
dropped frames.  I think my sender may not be able to push frame any
faster -- working on that...

Comment 18 John W. Linville 2004-12-03 21:16:35 UTC
In my previous tests (w/ 70MB/s speed limit), both machines had 32-bit
PCI NICs.  I have since tracked-down some 64-bit PCI machines and
re-tested.

The machines were a 1.6GHz P4 and a (6-way) 866MHz PIII.  The P4 was
running FC2 and had a tg3, while the PIII was running RHEL3 and had an
e1000.

Running ttcp repeatedly between the two machines showed performance
reliably above 95MB/s with many runs as fast as 115MB/s.

Since these machines are so CPU-poor compared to the machines I was
using previously, I have to attribute the performance gap solely to
the PCI bus-widths involved.

Comment 19 John W. Linville 2004-12-03 21:18:54 UTC
FWIW, the previous testing (w/ the 32-bit PCI NICs) involved a 3.2GHz
P4 and a 3.4GHz P4/EM64T.

Comment 20 Tom Coughlan 2004-12-07 20:01:28 UTC
John,

I had a con-call with EMC today. They say that the frame drops are
very easy to see with a network analyzer. They would like to know
whether you have tried an analyzer, since other methods like ethtool
may not detect the problem. They said they will send an ethtool trace,
but feel that a network analyzer would be a more direct way to see the
problem. 

Back in comment 10 they were able to use "netstat -s" to see
Tcpdsackoldsent and Tcpdsackofosent counters increasing. These are
presumed to correspond to the frame drops. Have you checked that? 

EMC also notes that this problem can be worked around by setting
tcp_low_latency. Make sure it is zero to reproduce the problem.

Thanks.

Tom

Comment 21 John W. Linville 2004-12-07 21:02:35 UTC
I do not have a network analyzer.  But, I really don't think it is
necessary in this case -- no one questions whether or not the frames
are being lost.  The question is how they are being lost.

I have not seen the Tcpdsackoldsent and Tcpdsackofosent counters
increasing in my scenarios.  Presumably they are increasing as a
by-product of some frames being dropped, causing later frames that get
received to appear to be out of order.  After frames start getting
lost, much hilarity ensues...

I'm not sure why tcp_low_latency would effect the problem, but it
defaults to zero anyway.  So, I should have had no problem reproducing
the results described.  I'll have to ponder why/how this helps.

I have not received any of the information requested about the
machines involved regarding the PCI bus width/speed.  It may be
helpful to have that information.  AFAIK, this information is not
available from the output of sysreport.

Comment 23 John W. Linville 2004-12-13 14:52:22 UTC
I repeated the test with a 32-bit PCI NIC on one end and a 64-bit PCI
NIC on the other.  Not only did I recreate the 70MB/sec speed limit, I
also got the Tcpdsackoldsent counter to increment.

Please verify that 64-bit NICs (in 64-bit slots) are being used on
both sides of the connection.

Comment 25 John W. Linville 2004-12-13 18:39:08 UTC
On Mon, Dec 13, 2004 at 01:11:58PM -0500, magill, scott (BMC Eng) wrote:

> What is the exact Ethtool command you would like me to run while
frame drops
> occur?

Scott,

I've kinda changed theories now.  Still, it wouldn't hurt to have
the output.  A simple "ethtool eth0" (adjusted appropriately for the
interface involved) would suffice.  While you are at it, perhaps and
"ethtool -S eth0" and maybe "ethtool -g eth0" would be good as well.
Please include the output from each end of the connection.

Are these cards in question actual NICs?  Or on-board adapters?
If the former, what I think is most important would be a visual
verification that each end of the connection is a 64-bit NIC plugged
into a 64-bit slot.

Thanks,

John
--
John W. Linville
linville


Comment 26 John W. Linville 2004-12-13 18:40:34 UTC
Created attachment 108464 [details]
lspci from Scott Magill

Comment 27 John W. Linville 2004-12-13 18:56:52 UTC
Are the cards in question "normal" 32-bit PCI cards?  The card edge
looks something like this:

_____     _                 _     ____
     |   | |               | |   |
      ---   ---------------   ---

Whereas the 64-bit PCI cards look more like this:

_____     _                 _     _              ____
     |   | |               | |   | |            |
      ---   ---------------   ---   ------------

Please also verify that the 64-bit cards are plugged into 64-bit
slots.  This can be identified by the back part of the card edge
hanging over the back of the slot's connector.

I am having trouble recreating the problem unless at least one side
of the connection is using 32-bit cards (or 64-bit cards plugged into
32-bit slots).


Comment 28 John W. Linville 2004-12-17 01:59:30 UTC
Looking at the lspci file contained within the sysreport results from
comment 13, the capabilities of the PCI bridges on the box in question
show a "64bit-".  This indicates a lack of 64-bit capability.

So, it certainly looks to me like the performance problems are related
to the (effective?) bus width of the NICs involved.  Please refer back
to comment 18 where testing showed that even much slower CPUs achieved
much higher network performance when using 64-bit (rather than 32-bit)
NICs.

Please also refer back to comment 23 where further testing showed that
the combination of 32-bit and 64-bit NICs even showed the
Tcpdackoldsent counters incrementing, replicating one of the more
curious aspects of this defect.

I have little doubt that a mismatch in NIC bandwidth between sender
and receiver is at the root of this problem.

Comment 29 John W. Linville 2005-01-10 15:23:48 UTC
I'm going to close this one.  Please re-open if the behaviour can be
demonstrated using exclusively 64-bit hardware.


Note You need to log in before you can comment on or make changes to this bug.