Bug 233030
Summary: | Broadcom Corporation NetXtreme BCM5721 (tg3 driver) goes AWOL under load after upgrade to 64-bit 2.6.9-42.0.10.ELsmp | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | David Tonhofer <bughunt> | ||||
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.0 | CC: | glshank, jbaron, peterm | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-04-08 14:49:11 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
David Tonhofer
2007-03-19 23:05:04 UTC
Second machine is back, same problem (how do I shorten the NETDEV WATCHDOG timeout?) cs_havana kernel: NETDEV WATCHDOG: eth0: transmit timed out cs_havana kernel: tg3: eth0: transmit timed out, resetting cs_havana kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2 cs_havana kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2 cs_havana kernel: tg3: eth0: Link is down. cs_havana kernel: tg3: eth0: Link is up at 100 Mbps, full duplex. cs_havana kernel: tg3: eth0: Flow control is off for TX and off for RX. Will now try to list the directory under the old kernel. Listing works under 64-bit 2.6.9-42.0.8.ELsmp. Guess I will with that version for a bit ;-) Unfortunately the case is not 100% clear-cut. I did find an old entry which happened under the old kernel (inaccessibility ~2 minutes), exactly the same as above really. David, Can you attach the full output from lspci -vvv on your system? Thanks. Created attachment 152965 [details]
lspci -vvv output of RX300S2
As requested.
However, note that I cannot reproduce the problem any longer, at leat not at
the moment (kernel is "2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007
x86_64 x86_64 x86_64 GNU/Linux")
Could it be that network conditions could be trigger of this behaviour?
I will keep the machine running under 2.6.9-42.0.10.ELsmp and keep an eye on
it.
Best regards,
-- David
Ok, got one event after about 20 "ls -lR" retries. The CPU is currently heavily loaded with so-called "grid computing", I do not know whether that would make any difference. Will now let the machine run and see whether an event occurs just by itself. tg3_get_invariants has some specific configurations for several different chip/bridge combinations and probably needs some tweaking for this system. These are the bridge chips in question: 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size 10 Bus: primary=00, secondary=04, subordinate=04, sec-latency=0 00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size 10 Bus: primary=00, secondary=05, subordinate=05, sec-latency=0 since they seem to be the bridges that connect the ethernet devices. Can you confirm/deny that you have only ever seen these issues with eth0? I won't suggest specifically trying 4.5 (which use driver 3.64) or my test kernels (which use driver 3.73) http://people.redhat.com/agospoda/#rhel4 since neither seem to contain any additional changes in the tg3 driver that would affect this specific system, so I doubt they will be instant fixes. If you can try one of my test kernels on one of your systems and manage to reliably recreate the issue that will be helpful too. That way we'll know for sure this is fixed when I add a patch to account for this specific hardware. (In reply to comment #8) I have the same or similar issue with the BCM5700 and I am running 4.5. I'm not sure I can try an experimental kernel since this is a production host. If you need more info, I can probably provide it. Greg > tg3_get_invariants has some specific configurations for several different > chip/bridge combinations and probably needs some tweaking for this system. > > These are the bridge chips in question: > > 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c) > (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ > Stepping- SERR+ FastB2B- > Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > <MAbort- >SERR- <PERR- > Latency: 0, Cache Line Size 10 > Bus: primary=00, secondary=04, subordinate=04, sec-latency=0 > > 00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c) > (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ > Stepping- SERR+ FastB2B- > Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > <MAbort- >SERR- <PERR- > Latency: 0, Cache Line Size 10 > Bus: primary=00, secondary=05, subordinate=05, sec-latency=0 > > since they seem to be the bridges that connect the ethernet devices. > > Can you confirm/deny that you have only ever seen these issues with eth0? > > I won't suggest specifically trying 4.5 (which use driver 3.64) or my test > kernels (which use driver 3.73) > > http://people.redhat.com/agospoda/#rhel4 > > since neither seem to contain any additional changes in the tg3 driver that > would affect this specific system, so I doubt they will be instant fixes. > > If you can try one of my test kernels on one of your systems and manage to > reliably recreate the issue that will be helpful too. That way we'll know for > sure this is fixed when I add a patch to account for this specific hardware. class: NETWORK bus: PCI detached: 0 device: eth1 driver: tg3 desc: "Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet" network.hwaddr: 00:0B:DB:E6:C6:C6 vendorId: 14e4 deviceId: 1644 subVendorId: 1028 subDeviceId: 0109 pciType: 1 pcidom: 0 pcibus: a pcidev: 2 pcifn: 0 class: NETWORK bus: PCI detached: 0 device: eth0 driver: tg3 desc: "Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet" network.hwaddr: 00:0B:DB:E6:C6:C5 vendorId: 14e4 deviceId: 1644 subVendorId: 1028 subDeviceId: 0109 pciType: 1 pcidom: 0 pcibus: a pcidev: 1 pcifn: 0 0a:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet (rev 14) Subsystem: Dell Broadcom BCM5700 1000Base-T Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <P ERR- Latency: 32 (16000ns min), Cache Line Size 10 Interrupt: pin A routed to IRQ 225 Region 0: Memory at eff10000 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device. Command: DPERE- ERO- RBC=0 OST=0 Status: Bus=255 Dev=31 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=0, DMOST=0 , DMCRS=0, RSCEM- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable- Address: 1c7c706b84a912ac Data: ae76 I tried Broadcom's driver which is 3.71b and Dell's driver which is 3.75d and they both have the same problem. Any suggestions? Well, my test kernels now have patches to 3.77. http://people.redhat.com/agospoda/#rhel4 The latest upstream is 3.79 and the changes to get there are quite small, but will hopefully appear in my test kernels soon. What is the largesmp kernel? # up2date kernel-largesmp Fetching Obsoletes list for channel: rhel-i386-as-4... Fetching rpm headers... Name Version Rel ---------------------------------------------------------- The following packages you requested were not found: kernel-largesmp (In reply to comment #13) > What is the largesmp kernel? > > # up2date kernel-largesmp > > Fetching Obsoletes list for channel: rhel-i386-as-4... > > Fetching rpm headers... > > Name Version Rel > ---------------------------------------------------------- > > > The following packages you requested were not found: > kernel-largesmp > The test packages are not available on RHN, they are available in my people page: http://people.redhat.com/agospoda/#rhel4 I understand that but what is it? Is it for systems with > 4 GB but less thab 12 GB, or what? No answer? Sorry 'bout that. On RHEL4 largesmp kernels are for systems with a large number of processors. On x86_64 it increases the number of CPUs supported from 8 to 64, on ppc64 it increases the number of CPUs supported from 64 to 128, and on ia64 it increases the number of supported CPUs from 64 to 512. Hope that helps. This bug has seen no activity in quite a while, so I can only presume it is no longer a problem. If there is still an issue that needs to be resolved, please re-open this bug and I will be happy to help resolve it. Thank you. Hi, Original Poster here. I could not reliably produce the bug. The machine (no hardware modifications since report expcet additional RAM) is now running RH4 with 2.6.9-67.0.1.ELsmp, and the hardware is stable. Thanks for the feedback, David. Don't hesitate to open another bug or reopen this one if the problem appears again on your current or any later kernel. |