Bug 233030

Summary: Broadcom Corporation NetXtreme BCM5721 (tg3 driver) goes AWOL under load after upgrade to 64-bit 2.6.9-42.0.10.ELsmp
Product: Red Hat Enterprise Linux 4 Reporter: David Tonhofer <bughunt>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED NOTABUG QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: glshank, jbaron, peterm
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-08 14:49:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lspci -vvv output of RX300S2 none

Description David Tonhofer 2007-03-19 23:05:04 UTC
Description of problem:

Since upgrading systems from kernel "2.6.9-42.0.3.ELsmp 64bit" to kernel
"2.6.9-42.0.10.ELsmp 64bit" the networking card goes AWOL under (somewhat heavy?
buffer-filling?) load.

Hardware involved: 

Fujitsu-Siemens RX300 S2 with (two) Ethernet connectors which are given as 
"Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)"
by "lspci".

What happens: 

I log in remotely over SSH, then go into a directory that has a _large_ number
of directories/files, i.e, several thousand.

I do an 'ls -lR'. The first few hundred entries scroll past, after which the
connection is broken. The machine becomes inaccessible.

After 10-20 minutes (or more, I'm currently waiting for machine 2 to get back on
the net), the machine is accessible again. No reboot happened - a look at the
kernel log (in which, incidentally, all the SYN packets are being logged) shows
that from the moment the machine became inaccessible, no incoming packets were
detected any longer. At the moment the machine became accessible again, the log
shows that the tg3 driver got some kick from a watchdog:

Mar 18 21:21:19 de_dusty kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 18 21:21:19 de_dusty kernel: tg3: eth0: transmit timed out, resetting
Mar 18 21:21:19 de_dusty kernel: tg3: tg3_stop_block timed out, ofs=2c00
enable_bit=2
Mar 18 21:21:19 de_dusty kernel: tg3: eth0: Link is down.
Mar 18 21:21:20 de_dusty kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
Mar 18 21:21:20 de_dusty kernel: tg3: eth0: Flow control is off for TX and off
for RX.

Version-Release number of selected component:

   2.6.9-42.0.10.ELsmp for x86_64 x86_64 x86_64

How reproducible:

   Easily. This happened on two separate machines that were upgraded 24 hours
   ago to to 2.6.9-42.0.10.ELsmp from 2.6.9-42.0.3.ELsmp.

   Listing the directory as described above reproduced the behaviour nicely.

   Will try to reboot under an old kernel once the machine is back.

Please advise on how to extract more info.

Comment 1 David Tonhofer 2007-03-19 23:11:18 UTC
Second machine is back, same problem (how do I shorten the NETDEV WATCHDOG timeout?)

cs_havana kernel: NETDEV WATCHDOG: eth0: transmit timed out
cs_havana kernel: tg3: eth0: transmit timed out, resetting
cs_havana kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
cs_havana kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
cs_havana kernel: tg3: eth0: Link is down.
cs_havana kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
cs_havana kernel: tg3: eth0: Flow control is off for TX and off for RX.

Will now try to list the directory under the old kernel.

Comment 2 David Tonhofer 2007-03-19 23:19:21 UTC
Listing works under 64-bit 2.6.9-42.0.8.ELsmp. Guess I will with that version
for a bit ;-)



Comment 3 David Tonhofer 2007-03-19 23:41:33 UTC
Unfortunately the case is not 100% clear-cut. I did find an old entry which
happened under the old kernel (inaccessibility ~2 minutes), exactly the same as
above really.



Comment 5 Andy Gospodarek 2007-04-18 21:14:50 UTC
David, Can you attach the full output from lspci -vvv on your system?  Thanks.

Comment 6 David Tonhofer 2007-04-18 22:46:52 UTC
Created attachment 152965 [details]
lspci -vvv output of RX300S2

As requested.

However, note that I cannot reproduce the problem any longer, at leat not at
the moment (kernel is "2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007
x86_64 x86_64 x86_64 GNU/Linux")

Could it be that network conditions could be trigger of this behaviour?

I will keep the machine running under 2.6.9-42.0.10.ELsmp and keep an eye on
it.

Best regards,

-- David

Comment 7 David Tonhofer 2007-04-19 22:09:18 UTC
Ok, got one event after about 20 "ls -lR" retries. The CPU is currently heavily
loaded with so-called "grid computing", I do not know whether that would make
any difference. Will now let the machine run and see whether an event occurs
just by itself.


Comment 8 Andy Gospodarek 2007-05-11 21:13:25 UTC
tg3_get_invariants has some specific configurations for several different
chip/bridge combinations and probably needs some tweaking for this system.  

These are the bridge chips in question:

00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c)
(prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
        Latency: 0, Cache Line Size 10
        Bus: primary=00, secondary=04, subordinate=04, sec-latency=0

00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c)
(prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
        Latency: 0, Cache Line Size 10
        Bus: primary=00, secondary=05, subordinate=05, sec-latency=0

since they seem to be the bridges that connect the ethernet devices.

Can you confirm/deny that you have only ever seen these issues with eth0?

I won't suggest specifically trying 4.5 (which use driver 3.64) or my test
kernels (which use driver 3.73)

http://people.redhat.com/agospoda/#rhel4

since neither seem to contain any additional changes in the tg3 driver that
would affect this specific system, so I doubt they will be instant fixes.

If you can try one of my test kernels on one of your systems and manage to
reliably recreate the issue that will be helpful too.  That way we'll know for
sure this is fixed when I add a patch to account for this specific hardware.

Comment 9 glshank 2007-07-09 19:26:50 UTC
(In reply to comment #8)

I have the same or similar issue with the BCM5700 and I am running 4.5. I'm not
sure I can try an experimental kernel since this is a production host. If you
need more info, I can probably provide it.

Greg

> tg3_get_invariants has some specific configurations for several different
> chip/bridge combinations and probably needs some tweaking for this system.  
> 
> These are the bridge chips in question:
> 
> 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c)
> (prog-if 00 [Normal decode])
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
> Stepping- SERR+ FastB2B-
>         Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> <MAbort- >SERR- <PERR-
>         Latency: 0, Cache Line Size 10
>         Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
> 
> 00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c)
> (prog-if 00 [Normal decode])
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
> Stepping- SERR+ FastB2B-
>         Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> <MAbort- >SERR- <PERR-
>         Latency: 0, Cache Line Size 10
>         Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
> 
> since they seem to be the bridges that connect the ethernet devices.
> 
> Can you confirm/deny that you have only ever seen these issues with eth0?
> 
> I won't suggest specifically trying 4.5 (which use driver 3.64) or my test
> kernels (which use driver 3.73)
> 
> http://people.redhat.com/agospoda/#rhel4
> 
> since neither seem to contain any additional changes in the tg3 driver that
> would affect this specific system, so I doubt they will be instant fixes.
> 
> If you can try one of my test kernels on one of your systems and manage to
> reliably recreate the issue that will be helpful too.  That way we'll know for
> sure this is fixed when I add a patch to account for this specific hardware.

Comment 10 glshank 2007-07-11 18:33:58 UTC
class: NETWORK
bus: PCI
detached: 0
device: eth1
driver: tg3
desc: "Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet"
network.hwaddr: 00:0B:DB:E6:C6:C6
vendorId: 14e4
deviceId: 1644
subVendorId: 1028
subDeviceId: 0109
pciType: 1
pcidom:    0
pcibus:  a
pcidev:  2
pcifn:  0
class: NETWORK
bus: PCI
detached: 0
device: eth0
driver: tg3
desc: "Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet"
network.hwaddr: 00:0B:DB:E6:C6:C5
vendorId: 14e4
deviceId: 1644
subVendorId: 1028
subDeviceId: 0109
pciType: 1
pcidom:    0
pcibus:  a
pcidev:  1
pcifn:  0

0a:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 14)
        Subsystem: Dell Broadcom BCM5700 1000Base-T
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <P
ERR-
        Latency: 32 (16000ns min), Cache Line Size 10
        Interrupt: pin A routed to IRQ 225
        Region 0: Memory at eff10000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [40] PCI-X non-bridge device.
                Command: DPERE- ERO- RBC=0 OST=0
                Status: Bus=255 Dev=31 Func=1 64bit+ 133MHz+ SCD- USC-,
DC=simple, DMMRBC=0, DMOST=0
, DMCRS=0, RSCEM-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: 1c7c706b84a912ac  Data: ae76


Comment 11 glshank 2007-07-23 18:11:47 UTC
I tried Broadcom's driver which is 3.71b and Dell's driver which is 3.75d and
they both have the same problem. Any suggestions?

Comment 12 Andy Gospodarek 2007-07-23 18:27:16 UTC
Well, my test kernels now have patches to 3.77.

http://people.redhat.com/agospoda/#rhel4

The latest upstream is 3.79 and the changes to get there are quite small, but
will hopefully appear in my test kernels soon.

Comment 13 glshank 2007-07-27 14:10:26 UTC
What is the largesmp kernel?

# up2date kernel-largesmp

Fetching Obsoletes list for channel: rhel-i386-as-4...

Fetching rpm headers...

Name                                    Version        Rel     
----------------------------------------------------------


The following packages you requested were not found:
kernel-largesmp


Comment 14 Andy Gospodarek 2007-07-27 14:19:38 UTC
(In reply to comment #13)
> What is the largesmp kernel?
> 
> # up2date kernel-largesmp
> 
> Fetching Obsoletes list for channel: rhel-i386-as-4...
> 
> Fetching rpm headers...
> 
> Name                                    Version        Rel     
> ----------------------------------------------------------
> 
> 
> The following packages you requested were not found:
> kernel-largesmp
> 

The test packages are not available on RHN, they are available in my people page:

http://people.redhat.com/agospoda/#rhel4

Comment 15 glshank 2007-07-27 14:30:18 UTC
I understand that but what is it? Is it for systems with > 4 GB but less thab 12
GB, or what?


Comment 16 glshank 2007-08-16 21:57:20 UTC
No answer?

Comment 17 Andy Gospodarek 2007-08-17 02:02:46 UTC
Sorry 'bout that.  

On RHEL4 largesmp kernels are for systems with a large number of processors.  On
x86_64 it increases the number of CPUs supported from 8 to 64, on ppc64 it
increases the number of CPUs supported from 64 to 128, and on ia64 it increases
the number of supported CPUs from 64 to 512.

Hope that helps.

Comment 18 Andy Gospodarek 2008-04-08 14:49:11 UTC
This bug has seen no activity in quite a while, so I can only presume it is no
longer a problem.  If there is still an issue that needs to be resolved, please
re-open this bug and I will be happy to help resolve it.  Thank you.

Comment 19 David Tonhofer 2008-04-08 16:48:32 UTC
Hi, Original Poster here. I could not reliably produce the bug. The machine (no
hardware modifications since report expcet additional RAM) is now running RH4
with  2.6.9-67.0.1.ELsmp, and the hardware is stable. 

Comment 20 Andy Gospodarek 2008-04-10 16:19:53 UTC
Thanks for the feedback, David.  

Don't hesitate to open another bug or reopen this one if the problem appears
again on your current or any later kernel.