Bug 216246 - bug in tg3 3.52 driver - both RHES 3 2.4.X kernel and RHES 4 2.6.X kernel
Summary: bug in tg3 3.52 driver - both RHES 3 2.4.X kernel and RHES 4 2.6.X kernel
Keywords:
Status: CLOSED DUPLICATE of bug 208922
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: i686
OS: Linux
medium
urgent
Target Milestone: ---
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-11-17 23:14 UTC by Scott Ramshaw
Modified: 2007-11-30 22:07 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-22 17:55:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Scott Ramshaw 2006-11-17 23:14:00 UTC
Description of problem:
warm reboots cause OS to only see one NIC, and the NIC it sees alternates
between NIC 1 and NIC2 every warm reboot.  Cold boot works fine.

Version-Release number of selected component (if applicable):


How reproducible:
Always, 100% of the time.

Steps to Reproduce:
1.install any recent kernel with 3.52 tg3 driver
2.reboot
3.walk out to server since it won't come back online and investigate!
  
Actual results:
tg3.c:v3.52RH (Mar 06, 2006)
divert: allocating divert_blk for eth0
eth0: Tigon3 [partno(BCM95703A30) rev 1002 PHY(5703)] (PCIX:100MHz:64-bit)
10/100/1000BaseT Ethernet 00:02:55:b7:b5:d3
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f4000]
tg3: Could not obtain valid ethernet address, aborting.
ip_tables: (C) 2000-2002 Netfilter core team
tg3: eth0: Link is up at 10 Mbps, half duplex.
tg3: eth0: Flow control is off for TX and off for RX.

Expected results:
tg3.c:v3.43RH (Oct 24, 2005)
divert: allocating divert_blk for eth0
eth0: Tigon3 [partno(BCM95703A30) rev 1002 PHY(5703)] (PCIX:100MHz:64-bit)
10/100/1000BaseT Ethernet 00:02:55:b7:b5:d3
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[0] 
eth0: dma_rwctrl[769f4000]
divert: allocating divert_blk for eth1
eth1: Tigon3 [partno(BCM95703A30) rev 1002 PHY(5703)] (PCIX:100MHz:64-bit)
10/100/1000BaseT Ethernet 00:02:55:b7:b5:d4
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[0] 
eth1: dma_rwctrl[769f4000]
ip_tables: (C) 2000-2002 Netfilter core team
tg3: eth0: Link is up at 10 Mbps, half duplex.
tg3: eth0: Flow control is off for TX and off for RX.

Additional info:
We have encountered an issue with some of the recent kernel updates for RHES 3
and RHES 4 related to the tg3 driver. 

The symptoms are identical on both operating systems.  What happens is when the
machine does a "warm" reboot via the reboot command, or shutdown -r now, when
the machines boots back up it only detects 1 of the 2 network cards on the
system.  Even more strange, is every warm reboot that occurs the one NIC that is
detected is swapped, so eth0 becomes eth1 and eth1 does not exist, next warm
reboot they swap and eth1 becomes eth0 and eth1 is not detected.

A "cold" reboot, performed by halting the machine, turning the power off for 30
seconds, then powering up, causes both NICs to be detected properly, every time.
 We are consistently able to reproduce this problem with the tg3 v3.52RH driver,
every warm boot the error occurs, every cold boot it does not occur.  We were
not able to reproduce the problem at all with tg3 3.43RH, that driver seems fine.

RHES 3 problem kernels:
RHSA-2006:0710 - kernel-2.4.21-47.0.1.EL
RHSA-2006:0437 - kernel-2.4.21-47.EL
 (NOTE: driver version is tg3.c:v3.52RH in both of the above kernels, this is
when the problem starts)

Last working kernel that cannot reproduce the problem:
RHSA-2006:0144 - kernel-2.4.21-40.EL
tg3 driver version: tg3.c:v3.43RH

Further note:  We tried broadcom's SRPM and built it for kernel-2.4.21-47.0.1.EL
AND for RHES 4 kernel-2.6.9-42.0.3.EL. successfully, and it also solved the
problem, it was version 3.66d from
www.broadcom.com/support/ethernet_nic/netxtreme.php.  However this replaced the
tg3.o module in /lib/modules and thus future kernel versions, unless they
contain a working driver, will cause the problem to re-occur again.

This is especially disasterous for data center models such as ours where
customers do remote reboots, they now will not come back onnline without human
intervention.

RHES 4:
tg3 driver was updated to 3.43 in kernel-2.6.9-34.EL - RHSA-2006:0132, this
kernel/modules _does_ work properly....
The next kernel release is RHSA-2006:0493, kernel-2.6.9-34.0.1.  tg driver
3.43-rh, works

Next kernel is RHSA-2006:0574. Kernel version kernel-2.6.9-34.0.2.EL, tg driver
v3.43-rh, works

The next kernel is RHSA-2006:0575, where the tg3 driver is updated to 3.52-rh,
and again, this and all future kernels contain 3.52 and we can consistently
reproduce the problem. 

With this driver version 3.52, on both 2.4 kernels with RHES 3 and 2.6 kernels
with RHES 4, warm reboots cause our NICs on the IBM 305 to only see one, and to
swap which ones it seems every time it warm reboots.

When it fails, it shows something like this in dmesg, notice the invalid MAC error.

tg3.c:v3.52RH (Mar 06, 2006)
divert: allocating divert_blk for eth0
eth0: Tigon3 [partno(BCM95703A30) rev 1002 PHY(5703)] (PCIX:100MHz:64-bit)
10/100/1000BaseT Ethernet 00:02:55:b7:b5:d3
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f4000]
tg3: Could not obtain valid ethernet address, aborting.
ip_tables: (C) 2000-2002 Netfilter core team
tg3: eth0: Link is up at 10 Mbps, half duplex.
tg3: eth0: Flow control is off for TX and off for RX.

Please release updated kernels for both OSes with updated working tg3 drivers!

Comment 1 Andy Gospodarek 2006-11-22 17:55:07 UTC

*** This bug has been marked as a duplicate of 208922 ***

Comment 2 Scott Ramshaw 2006-11-22 21:00:13 UTC
bug 208922 only addresses the issue on RHES 4 with 2.6 kernel, but the bug also
exists on the last few RHES 3 2.4.X kernels as stated in this bug.  Please
re-evaluate.


Note You need to log in before you can comment on or make changes to this bug.