Bug 124057

Summary: tg3 driver intermittently blocking with certain switches
Product: [Fedora] Fedora Reporter: Richard Lloyd <rkl>
Component: kernelAssignee: David Miller <davem>
Status: CLOSED NEXTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2CC: djuran, jgarzik, john, shishz, tokarek
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-04-16 06:03:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Richard Lloyd 2004-05-23 16:06:02 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.6) Gecko/20040116

Description of problem:
I have a couple of Dell PowerEdge servers with dual Broadcom BCM5704
gigabit network cards in them running Fedora Core 2. On both machines,
if I wire the card to a particular switch and use the standard "tg3"
driver, I get intermittent blocking of the traffic (anything between
10 seconds and 10 minutes) and then resumption of the traffic, plus
lots of receive errors (no transmit errors though) reported by "ifconfig".

If I wire the cards to another (different) switch, there are no
network errors ! A different card (Intel one) in the same machines
works fine with any switch it's plugged into, suggesting some
interaction between the tg3 driver and certain brands of switch.

Version-Release number of selected component (if applicable):
kernel-2.6.5-1.358smp

How reproducible:
Always

Steps to Reproduce:
1. Install a Broadcom NetXtreme BCM5704 gigabit card.
2. Make sure it uses the FC2 "tg3" driver (/etc/modprobe.conf).
3. Plug net connection into a switch (brand will matter - more info to
follow).
4. Get a large (100's MBs) file from another machine onto the machine
with the Broadcom card.
5. Check "ifconfig <device>" (e.g. eth0) to see if there's any errors.
    

Actual Results:  It should report no errors from "ifconfig <device>".

Expected Results:  "ifconfig <device>" reported a lot of errors (inc.
frame problems):
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:3101000 errors:377040 dropped:0 overruns:0 frame:5231
TX packets:3946056 errors:0 dropped:0 overruns:0 carrier:0

Note the lack of transmit errors as well. However, an identical card
attached to a different switch works fine (as do other cards - e.g.
using the "e1000" driver - attached to any of the switches I've
tried). My guess is that the combination of the tg3 driver and
particular brands of switch just don't talk correctly to each other.

Additional info:

I've been trying all sorts of params to "ethtool" to see if anything
helps and it doesn't. My only saving grace here is that there's a
third card in the two machines ("Intel Corp. 82545GM Gigabit Ethernet
Controller") which works perfectly with the same switches ("e1000"
driver this time) that I had trouble with when using the "tg3" driver.
I'll see if I can dig out some info on the brand of switch that's
causing the problem and it to this bug report (I'm not at work at the
moment, so I can't post that up). It's bad enough that I don't think I
can recommend Broadcom cards to anyone using FC2...

Comment 1 Richard Lloyd 2004-05-26 21:51:39 UTC
Just a note that the BCM5704 card has no errors when attached to an
Edimax switch (but wrongly auto-neg's with it - I've added a comment
about this another bug that was closed, but may need to be reopened).
The problems I reported seem to occur only between that card and a
3com switch, so we're now using the Intel card I mentioned with the
3com switch instead and that works with no errors.

Comment 2 Ryan Tokarek 2004-06-01 19:41:51 UTC
I get a similar problem on a Dell PowerEdge 2550 (933MHz PIII w 1.5GB
RAM). With a bunch of network activity, the network drops out after
5-10 seconds on the gigabit ethernet port. With little to no activity,
it still happens, but takes 10-20 minutes. `ifdown eth1 ; ifup eth1`
works to bring the network back up for a short while. 

The ethernet chipset that exhibits the problem is "Broadcom
Corporation NetXtreme BCM5700 Gigabit Ethernet (rev 10)" from lspci.
The tg3 module is loaded. The ethernet port is plugged into a gigabit
switch (3Com) and negotiates at 1Gb. 

Let me know if I should provide more information.

-Ryan

dmesg reports:
irq 11: nobody cared! (screaming interrupt?)
Call Trace:
 [<021070c9>] __report_bad_irq+0x2b/0x67
 [<02107161>] note_interrupt+0x43/0x66
 [<02107327>] do_IRQ+0x109/0x169
 [<0223007b>] sock_ioctl+0x13e/0x280
 [<0211af64>] __do_softirq+0x2c/0x73
 [<021078f5>] do_softirq+0x46/0x4d
 =======================
 [<0210737b>] do_IRQ+0x15d/0x169
 [<0210403b>] default_idle+0x23/0x26
 [<0210408c>] cpu_idle+0x1f/0x34
 [<02318612>] start_kernel+0x174/0x176

handlers:
[<62c7da6a>] (tg3_interrupt+0x0/0xe8 [tg3])
Disabling IRQ #11
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is on for TX and on for RX.
eth1: no IPv6 routers present
irq 11: nobody cared! (screaming interrupt?)
[repeating ....]


Comment 3 Richard Lloyd 2004-06-01 20:08:00 UTC
Another solution I've found (if you've got only BCM57XX cards and no
alternative) is to switch to the BCM5700 "official" driver from
http://www.broadcom.com/drivers/downloaddrivers.php but remember that
you'll need the kernel source and gcc toolchain installed before you
try to build that driver from source on FC2.

Other things you'll need to know about the BCM5700 driver (version
7.1.22):

*  You need to change line 1763 of src/b57um.c to read:

                              dev->name, smp_processor_id());

   [i.e. change hard_smp_processor_id() to smp_processor_id()]

* "make install" the driver so that it's installed in the right place
and a "depmod -a" command is correctly issued.

* Edit /etc/modprobe.conf to change tg3 references to bcm5700.

* At this point, the easiest thing to do is to reboot to pick up the
new driver, but you might be able to pick it up with something like
"init 1" followed by "init 3", but no guarantees (cos I haven't tried
that - I chickened out and rebooted). A simple "/etc/init.d/network
restart" may not be good enough, BTW.

I've done this on a couple of PowerEdges and the errors with the tg3
driver went away when the bcm5700 driver was used and things are
looking good now. You should note that (now correctly closed) bug
#124857 has more info about why you have to edit the driver source -
see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=124857

Comment 4 Dave Jones 2005-04-16 06:03:03 UTC
Fedora Core 2 has now reached end of life, and no further updates will be
provided by Red Hat.  The Fedora legacy project will be producing further kernel
updates for security problems only.

If this bug has not been fixed in the latest Fedora Core 2 update kernel, please
try to reproduce it under Fedora Core 3, and reopen if necessary, changing the
product version accordingly.

Thank you.