Bug 99333

Summary: (NET TG3) bcm5701/SX @ 33Mhz PCI doesn't init after system reset
Product: Red Hat Enterprise Linux 2.1 Reporter: Glen A. Foster <glen.foster>
Component: kernelAssignee: Jason Baron <jbaron>
Status: CLOSED WORKSFORME QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: jgarzik, knoel, riel, shillman, tao
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-09-03 14:46:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 87937    

Description Glen A. Foster 2003-07-17 19:25:47 UTC
Intro: This defect is 1 of 3 critical defects found at HP running in system
validation test configurations.   Information from an internal HP defect
database will be cut-and-pasted for more background information.

Description of problem: After a system reset on an Everest (rx5670) with Madison
processors, the tg3 driver (for the Broadcom 5701 Fibre (SX) card) reports the
card as initialized, but in fact it's not (e.g., ping fails).

Version-Release number of selected component (if applicable):
# rpm -q kernel-smp
kernel-smp-2.4.18-e.31

How reproducible: very

Steps to Reproduce:
1. Perform a system reset (via firmware)
2. Watch the GigE/SX card report i/f as up yet ping fails
3. Reboot WITHOUT resetting system firmware and watch it work
    
Additional info: I will post the HP internal details of this defect (EM28) in
separate attachments.

Comment 1 Glen A. Foster 2003-07-17 19:27:45 UTC
======= ORIGINAL defect report =========

In environment: EV01 in SV
Bug reported on: Tuesday 06/03 2003 at 15:59 154
Abe revision: 20.01
-------------------------------------------------------
SYMPTOMS:

During RHAS2.1 testing on Madison for Everest, system hang. Reboot the 
system, OS found all the cards, but GigE-SX in slot 7 doesn't have connection. 
Sometimes another reboot fixed the problem and connection came back. Sometimes 
I had to pull out the cable, then plug in the cable again. Pull the 
cable out always fix the problem. I saw this problem several times and it comes
and go. 

-------------------------------------------------------
CONDITIONS WHICH RELIABLY REPRODUCE BUG SYMPTOMS:

1 Way Everest Madison with 1G memory.
SFW 3.1, BMC 1.30 and MP E.02.10
IO config:
slot 2: VGA/USB
slot 4: GigE-SX
slot 5: U160x2
slot 6: HVD/FW
slot 7: GigE-SX
slot 8: 10/100BT
slot 9: U160x2
slot 10: U160
slot 11: U160
slot 12: 10/100BT

Comment 2 Glen A. Foster 2003-07-17 19:30:29 UTC
======== UPDATED INFO 09-Jul-2003 ========

Updated on: Wednesday 07/09 2003 at 15:28 190
Abe revision: 20.02
-------------------------------------------------------
UPDATE:

With RHAS2.1 Q2 udpate, reboot hang and GigE-SX no connection still happens. 
Notice that the card is running at 33MHz and it shared with HVD/FW card. 
Following is the log. 

Jul  8 18:35:04 ev01 kernel: tg3.c:v1.4c (Feb 18, 2003)
Jul  8 18:35:04 ev01 kernel: PCI: Found IRQ 56 for device 21:04.0
Jul  8 18:35:04 ev01 kernel: eth0: Tigon3 [partno(A6794-60001) rev 0105 PHY(5701
)] (PCI:66MHz:64-bit) 10/100/1000BaseT Ethernet 00:30:6e:28:17:6b
Jul  8 18:35:04 ev01 kernel: PCI: Found IRQ 68 for device 80:01.0
Jul  8 18:35:04 ev01 kernel: eth1: Tigon3 [partno(A6847-60001) rev 0105 PHY(serd
es)] (PCI:66MHz:64-bit) 10/100/1000BaseT Ethernet 00:04:76:df:be:0d
Jul  8 18:35:04 ev01 kernel: PCI: Found IRQ 92 for device e0:02.0
Jul  8 18:35:04 ev01 kernel: eth2: Tigon3 [partno(A6847-60101) rev 0105 PHY(serd
es)] (PCI:33MHz:64-bit) 10/100/1000BaseT Ethernet 00:30:6e:49:98:7d
Jul  8 18:35:04 ev01 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
Jul  8 18:35:04 ev01 kernel: tg3: eth0: Flow control is off for TX and off for R
X.
Jul  8 18:35:04 ev01 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex.
Jul  8 18:35:04 ev01 kernel: tg3: eth1: Flow control is off for TX and off for R
X.
Jul  8 18:35:04 ev01 kernel: tg3: eth2: Link is up at 1000 Mbps, full duplex.
Jul  8 18:35:04 ev01 kernel: tg3: eth2: Flow control is off for TX and off for R
X.
tg3.c:v1.4c (Feb 18, 2003)
PCI: Found IRQ 56 for device 21:04.0
divert: allocating divert_blk for eth0
eth0: Tigon3 [partno(A6794-60001) rev 0105 PHY(5701)] (PCI:66MHz:64-bit) 10/100/
1000BaseT Ethernet 00:30:6e:28:17:6b
PCI: Found IRQ 68 for device 80:01.0
divert: allocating divert_blk for eth1
eth1: Tigon3 [partno(A6847-60001) rev 0105 PHY(serdes)] (PCI:66MHz:64-bit) 10/10
0/1000BaseT Ethernet 00:04:76:df:be:0d
PCI: Found IRQ 92 for device e0:02.0
divert: allocating divert_blk for eth2
eth2: Tigon3 [partno(A6847-60101) rev 0105 PHY(serdes)] (PCI:33MHz:64-bit) 10/10
0/1000BaseT Ethernet 00:30:6e:49:98:7d
tg3: eth0: Link is up at 100 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.
tg3: eth2: Link is up at 1000 Mbps, full duplex.
tg3: eth2: Flow control is off for TX and off for RX.
eepro100.c:v1.09j-t 9/29/99 Donald Becker http://www.scyld.com/network/eepro100.
html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <saw@sa
w.sw.com.sg> and others




ev01 -> ifconfig eth2
eth2      Link encap:Ethernet  HWaddr 00:30:6E:49:98:7D  
          inet addr:10.20.90.121  Bcast:10.20.90.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:22 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:1408 (1.3 Kb)
          Interrupt:92 
          
          
          
ev01 -> ping 10.20.90.69
PING 10.20.90.69 (10.20.90.69) from 10.20.90.121 : 56(84) bytes of data.
>From 10.20.90.121: Destination Host Unreachable
>From 10.20.90.121: Destination Host Unreachable
>From 10.20.90.121: Destination Host Unreachable


ev01 -> route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.10.14.0      0.0.0.0         255.255.255.0   U     0      0        0 eth3
10.20.91.0      0.0.0.0         255.255.255.0   U     0      0        0 eth1
10.10.15.0      0.0.0.0         255.255.255.0   U     0      0        0 eth4
10.20.90.0      0.0.0.0         255.255.255.0   U     0      0        0 eth2
10.0.0.0        0.0.0.0         255.255.254.0   U     0      0        0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U     0      0        0 lo
0.0.0.0         10.0.0.1        0.0.0.0         UG    0      0        0 eth0



Comment 3 Glen A. Foster 2003-07-17 19:31:38 UTC
======== UPDATED INFO on 09-Jul-2003 ========
Updated on: Wednesday 07/09 2003 at 17:17 190
Abe revision: 20.02
-------------------------------------------------------
UPDATE:

The symptom is one can't ping other hosts on the same subnet.

The tg3 driver *thinks* it properly initialized the card.
tg3 driver is v1.4c (RHAS 2.1 Q2) seems to be equivalent to 1.5 (Feb 18, 2003).

As shown in previous update, "route" has proper entries.

The same type of card in a 66Mhz slot (shared with 53c1010)
initializes just fine. The failing card is sharing the PCI bus
with a 53c875 (33Mhz only) SCSI Controller.

Seems to be yet another timing problem with tg3 Phy initialization.

Next steps:
o backport v1.5->v1.6 tg3 changes and see if that works better. 

o start inserting MMIO reads to flush pending MMIO writes. Last time
  I checked, the driver had about 53 instances of "write(); udelay()"
  and I was only allowed to fix three of them because those are the
  only ones I could demonstrate caused problems at the time.


Comment 4 Glen A. Foster 2003-07-17 19:33:50 UTC
======== UPDATED INFO 15-Jul-2003 ========
Updated on: Tuesday 07/15 2003 at 14:36 196
Abe revision: 20.02
-------------------------------------------------------
UPDATE:

I rmmod the tg3 driver after verifying "ping 10.20.90.69" fails.
ifconfig reports "eth2" (running 33Mhz) is using 10.20.90.121.

"insnmod /root/bcm5700.o" came up w/o errors:
Broadcom Gigabit Ethernet Driver bcm5700 with Broadcom NIC Extension (NICE) 
ver. 6.2.11 (05/16/03) 
divert: allocating divert_blk for eth0
PCI: Found IRQ 56 for device 21:04.0
eth0: Broadcom BCM5701 found at mem 90000000, IRQ 56, node addr 00306e28176b
eth0: Broadcom BCM5701 Integrated Copper transceiver found
eth0: Scatter-gather ON, 64-bit DMA ON, Tx Checksum ON, 
Rx Checksum ON, 802.1Q VLAN ON, NAPI ON 
divert: allocating divert_blk for eth1
PCI: Found IRQ 68 for device 80:01.0
eth1: Broadcom BCM5701 found at mem c0050000, IRQ 68, node addr 000476dfbe0d
eth1: Agilent HDMP-1636 SerDes transceiver found
eth1: Scatter-gather ON, 64-bit DMA ON, 
Tx Checksum ON, Rx Checksum ON, 802.1Q VLAN ON, NAPI ON 
divert: allocating divert_blk for eth2
PCI: Found IRQ 92 for device e0:02.0
eth2: Broadcom BCM5701 found at mem f0010000, IRQ 92, node addr 00306e49987d
eth2: Agilent HDMP-1636 SerDes transceiver found
eth2: Scatter-gather 
ON, 64-bit DMA ON, Tx Checksum ON, Rx Checksum ON, 802.1Q VLAN ON, NAPI ON 
bcm5700: eth1 NIC Link is UP, 1000 Mbps full duplex
bcm5700: eth2 NIC Link is UP, 1000 Mbps full duplex
bcm5700: eth0 NIC Link is UP, 100 Mbps full duplex
[root@ev01 root]# 

And then "ping" still fails with bcm5700.
Like before, after a reboot, tg3 can ping 10.20.90.69.

I mentioned this to my counter part at broadcom.
Michael Chan <mchan> wrote back:
| Since we have the rx5670 in our lab, I'm going to ask our lab guys to
| reproduce this problem and then debug it. This is going to take at least a
| few days. I will keep you posted.


Comment 5 Jason Baron 2003-07-17 20:22:42 UTC
did all these tests pass on qu1? e.25? 
to help narrow the focus.

Comment 6 Glen A. Foster 2003-07-17 20:27:14 UTC
Our internal records show this defect appears on the original "stock" AS2.1, so
that would be the 2.4.18-e.12 kernel.

Comment 7 Larry Troan 2003-07-22 16:30:56 UTC
ISSUE TRACKER 26123 OPENED AS SEV 1 -- QU3 BLOCKER

Comment 9 Larry Troan 2003-09-03 14:44:17 UTC
FROM ISSUE TRACKER
Event posted 08-25-2003 07:47pm by charline.polifka with duration of 0.00      
 HP Could not reproduce; NOT QU3 Blocker.

Status set to: Waiting on Client (Long Term)

Comment 10 Larry Troan 2003-09-03 14:46:18 UTC
Closing as WORKSFORME since HP can't recreate the problem. They can reopen the
Bug later if it shows up again.