Red Hat Bugzilla – Bug 119055
RHEL 3.0 v2.4.21 is dropping frames when reaching 70 MB on the network link.
Last modified: 2007-11-30 17:07:01 EST
Description of problem:
RHEL 3.0 v2.4.21 is dropping frames when reaching 70 MB on the link.
The 3Com 3C996-SX with BCM5700 v7.1.22 driver has been tried and an
Intel long wave gigabit ethernet card has also been used to confirm
the frame drops.
There are no messages in /var/log/messages that indicated that the
frame(s) being dropped unless we missed them. We couldn't find any
TCP dropped frames or frame retransmissions in /proc interfaces
either, but we know for sure that the frame being dropped from
analyzing the iSCSI trace captured with the Analyzer.
For FTP has been used as a test to just simply transfer a big file in
The iSCSI driver is the Cisco based iSCSI driver version 18.104.22.168 (the
latest driver for 2.4.x kernel).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
"The 3Com 3C996-SX with BCM5700 v7.1.22 driver has been tried"
clearly extremely unsupported.
*** Bug 119066 has been marked as a duplicate of this bug. ***
This is still an issue with the network stack regardless of the
driver used - reopening the issue.
Can you please provide a detailed test case / reproducer which does
not involve the unsupported iscsi module?
We talked about this on a con-call with EMC on 6/8/04. I requested
more detailed information.
You say that the kernel is dropping packets. Please explain what you
mean by this.
How are you detecting dropped packets?
Do you know which component in the network is dropping packets?
What is the network hardware configuration?
What else is running on the network at the time?
Please reply. Thanks.
Configuration: (this is the configuration that we use to duplicate
the problem of dropped TCP packets with the ftp). Please note, this
is a clean network with no other traffic on the vlan
Host A -----------Ã Analyzer --Ã 6509 ---- 6509 ---Ã Analyzer ------Ã
To answer the below questions:
1. We detected the TCP packets being dropped, when the
receiving host fails to acknowledge the TCP packets being sent by the
transmitting host within a reasonable timeframe that triggers the
transmitting host to retransmit the unacknowledged TCP packets.
2. Using the Analyzers in the above configuration, we verified
that the suspected dropped TCP packets safely transmitted to the
receiving host. This ensures that the suspected packets arrives to
the receiving host and not getting dropped along the transmission.
3. See above diagram.
We were using two types of gig e adapters:
1: 3Com 3C996 Gigabit Fiber-SX Server NIC
2: On board copper gig e adapter (BROADCOM)
tg3 drivers for both. We also used the bcm57xx driver for the test
Heather, have you tried using any non-BCM5700 cards in this setup?
What were the results?
Do we have any indications of whether the packets are being dropped by
the hardware (i.e. not being pulled off fast enough) versus being
dropped somewhere in the kernel's networking stack?
Could you attach the results of running sysreport to this bug? Thanks!
This has also been tried with the Intel Longwave NIC. The results
were the same in that frames are being dropped.
The hardware was ruled out by running a non-Linux OS with the same
hardware; under this OS no frame drop was encountered at data rates
If beneficial to the troubleshooting effort, we can provide a trace
that shows the frames being dropped.
sysreport output will be attached asap.
Created attachment 104457 [details]
sysreport for test system
Could I also see the output from running ethtool on the interface in
question while running this test (i.e. "ethtool ethX" where ethX is
the slow NIC)?
"The hardware was ruled out by running a non-Linux OS with the same
hardware; under this OS no frame drop was encountered at data rates
Can you confirm that this was done with the exact same box (rather
than a "twin")?
What is the PCI bus speed (33/66) and width (32/64) the card is
running at? Or is it PCI-X?
Are sender and receiver systems the same (or similar)? If not, how do
I can recreate the 70MB/s speed limit, but I do not see any signs of
dropped frames. I think my sender may not be able to push frame any
faster -- working on that...
In my previous tests (w/ 70MB/s speed limit), both machines had 32-bit
PCI NICs. I have since tracked-down some 64-bit PCI machines and
The machines were a 1.6GHz P4 and a (6-way) 866MHz PIII. The P4 was
running FC2 and had a tg3, while the PIII was running RHEL3 and had an
Running ttcp repeatedly between the two machines showed performance
reliably above 95MB/s with many runs as fast as 115MB/s.
Since these machines are so CPU-poor compared to the machines I was
using previously, I have to attribute the performance gap solely to
the PCI bus-widths involved.
FWIW, the previous testing (w/ the 32-bit PCI NICs) involved a 3.2GHz
P4 and a 3.4GHz P4/EM64T.
I had a con-call with EMC today. They say that the frame drops are
very easy to see with a network analyzer. They would like to know
whether you have tried an analyzer, since other methods like ethtool
may not detect the problem. They said they will send an ethtool trace,
but feel that a network analyzer would be a more direct way to see the
Back in comment 10 they were able to use "netstat -s" to see
Tcpdsackoldsent and Tcpdsackofosent counters increasing. These are
presumed to correspond to the frame drops. Have you checked that?
EMC also notes that this problem can be worked around by setting
tcp_low_latency. Make sure it is zero to reproduce the problem.
I do not have a network analyzer. But, I really don't think it is
necessary in this case -- no one questions whether or not the frames
are being lost. The question is how they are being lost.
I have not seen the Tcpdsackoldsent and Tcpdsackofosent counters
increasing in my scenarios. Presumably they are increasing as a
by-product of some frames being dropped, causing later frames that get
received to appear to be out of order. After frames start getting
lost, much hilarity ensues...
I'm not sure why tcp_low_latency would effect the problem, but it
defaults to zero anyway. So, I should have had no problem reproducing
the results described. I'll have to ponder why/how this helps.
I have not received any of the information requested about the
machines involved regarding the PCI bus width/speed. It may be
helpful to have that information. AFAIK, this information is not
available from the output of sysreport.
I repeated the test with a 32-bit PCI NIC on one end and a 64-bit PCI
NIC on the other. Not only did I recreate the 70MB/sec speed limit, I
also got the Tcpdsackoldsent counter to increment.
Please verify that 64-bit NICs (in 64-bit slots) are being used on
both sides of the connection.
On Mon, Dec 13, 2004 at 01:11:58PM -0500, magill, scott (BMC Eng) wrote:
> What is the exact Ethtool command you would like me to run while
I've kinda changed theories now. Still, it wouldn't hurt to have
the output. A simple "ethtool eth0" (adjusted appropriately for the
interface involved) would suffice. While you are at it, perhaps and
"ethtool -S eth0" and maybe "ethtool -g eth0" would be good as well.
Please include the output from each end of the connection.
Are these cards in question actual NICs? Or on-board adapters?
If the former, what I think is most important would be a visual
verification that each end of the connection is a 64-bit NIC plugged
into a 64-bit slot.
John W. Linville
Created attachment 108464 [details]
lspci from Scott Magill
Are the cards in question "normal" 32-bit PCI cards? The card edge
looks something like this:
_____ _ _ ____
| | | | | |
--- --------------- ---
Whereas the 64-bit PCI cards look more like this:
_____ _ _ _ ____
| | | | | | | |
--- --------------- --- ------------
Please also verify that the 64-bit cards are plugged into 64-bit
slots. This can be identified by the back part of the card edge
hanging over the back of the slot's connector.
I am having trouble recreating the problem unless at least one side
of the connection is using 32-bit cards (or 64-bit cards plugged into
Looking at the lspci file contained within the sysreport results from
comment 13, the capabilities of the PCI bridges on the box in question
show a "64bit-". This indicates a lack of 64-bit capability.
So, it certainly looks to me like the performance problems are related
to the (effective?) bus width of the NICs involved. Please refer back
to comment 18 where testing showed that even much slower CPUs achieved
much higher network performance when using 64-bit (rather than 32-bit)
Please also refer back to comment 23 where further testing showed that
the combination of 32-bit and 64-bit NICs even showed the
Tcpdackoldsent counters incrementing, replicating one of the more
curious aspects of this defect.
I have little doubt that a mismatch in NIC bandwidth between sender
and receiver is at the root of this problem.
I'm going to close this one. Please re-open if the behaviour can be
demonstrated using exclusively 64-bit hardware.