509759 – tg3 (BCM5906M Fast Ethernet) TX queue locks up

Bug 509759 - tg3 (BCM5906M Fast Ethernet) TX queue locks up

Summary: tg3 (BCM5906M Fast Ethernet) TX queue locks up

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	11
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	513462
TreeView+	depends on / blocked

Reported:	2009-07-05 20:41 UTC by Andreas Thienemann
Modified:	2010-06-28 13:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-06-28 13:29:37 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
pcap file from the local machine with the tg3 card experiencing the hang (49.92 KB, application/octet-stream) 2009-07-06 21:19 UTC, Andreas Thienemann	no flags	Details
pcap file from the remote machine (101.87 KB, application/octet-stream) 2009-07-06 21:19 UTC, Andreas Thienemann	no flags	Details
tg3: Check and report MTU violations (1.43 KB, patch) 2009-07-08 01:32 UTC, Matt Carlson	no flags	Details \| Diff
View All

Description Andreas Thienemann 2009-07-05 20:41:46 UTC

Description of problem:
When transmitting a file (e.g. scp) the connection stalls after some bytes and the machine is unable to send any data until the tg3 module is removed and loaded again.


Version-Release number of selected component (if applicable):
2.6.29.5-191.fc11.i686.PAE

How reproducible:
Always

Steps to Reproduce:
1. Verify network is working:
  [andreas@pfy F-11]$ ping 192.168.2.1
  PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
  64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.961 ms
  ^C
  --- 192.168.2.1 ping statistics ---
  1 packets transmitted, 1 received, 0% packet loss, time 454ms
  rtt min/avg/max/mdev = 0.961/0.961/0.961/0.000 ms
2. Transmit data:
  [andreas@pfy F-11]$ scp wine-1.1.25-1.fc11.1.src.rpm root@xenbuilder64:
  wine-1.1.25-1.fc11.1.src.rpm            6% 1056KB 150.7KB/s - stalled -
  ^CKilled by signal 2.
3. Verify network is down:
  [andreas@pfy F-11]$ ping 192.168.2.1
  PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
  ^C
  --- 192.168.2.1 ping statistics ---
  2 packets transmitted, 0 received, 100% packet loss, time 1677ms


Actual results:
Network down

Expected results:
File transmitted, network still up.

Additional info:
02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5906M Fast Ethernet PCI Express (rev 02)

dmesg is basically empty:

tg3.c:v3.97 (December 10, 2008)
tg3 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
tg3 0000:02:00.0: setting latency timer to 64
tg3 0000:02:00.0: PME# disabled
eth0: Tigon3 [partno(BCM95906) rev c002] (PCI Express) MAC address 00:23:8b:18:a5:9c
eth0: attached PHY is 5906 (10/100Base-TX Ethernet) (WireSpeed[0])
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[0]
eth0: dma_rwctrl[76180000] dma_mask[64-bit]
tg3 0000:02:00.0: PME# disabled
tg3 0000:02:00.0: irq 28 for MSI/MSI-X
ADDRCONF(NETDEV_UP): eth0: link is not ready
tg3: eth0: Link is up at 100 Mbps, full duplex.
tg3: eth0: Flow control is on for TX and on for RX.
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: no IPv6 routers present

Comment 1 Andy Gospodarek 2009-07-06 20:16:47 UTC

Do you know for sure this is a transmit problem and not a receive issue?  If you can capture the network traffic when the scp hangs we can help determine if that is the case.

Does disabling TSO (ethtool -K eth0 tso off) change this?

Comment 2 Andy Gospodarek 2009-07-06 20:23:37 UTC

Feel free to try any of the newer kernels located here as well if you like:

http://koji.fedoraproject.org/koji/packageinfo?packageID=8

Comment 3 Andreas Thienemann 2009-07-06 21:17:25 UTC

I think it is a transmit problem because netstat shows the send-queue filling up:

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State
[...]
tcp        0 101552 192.168.2.131:46798         193.7.176.205:22            ESTABLISHED
[...]

I have two pcap dump files attached, one called local.pcap captured on the local machine with the tg3 interface, the other one on the remote system which is unaffected called remote.pcap.

Setting the tcp offloading engine to off does't seem to change anything sadly.

Comment 4 Andreas Thienemann 2009-07-06 21:19:02 UTC

Created attachment 350684 [details]
pcap file from the local machine with the tg3 card experiencing the hang

Comment 5 Andreas Thienemann 2009-07-06 21:19:35 UTC

Created attachment 350685 [details]
pcap file from the remote machine

Comment 6 Andy Gospodarek 2009-07-06 21:21:32 UTC

Thanks, I'll check 'em out.

Comment 7 Andreas Thienemann 2009-07-06 21:53:43 UTC

Newer kernel (2.6.31-0.42.rc2.fc12.i686.PAE) doesn't seem to do any good either.

Comment 8 Andy Gospodarek 2009-07-07 01:24:49 UTC

Thanks for adding those details.  I'll get the folks at Broadcom involved as well since this sounds like something they will want to know about.

Comment 9 Matt Carlson 2009-07-07 01:41:38 UTC

Yes.  This seems to be a growing problem.  Andreas, to get you up and running, try turning off scatter gather through ethtool.  `ethtool -K eth0 sg off`.  I'm pretty sure you'll be able to pass traffic after that.

Can you tell me if you know of a kernel where this device worked without the above workaround though?

Comment 10 Andreas Thienemann 2009-07-07 08:50:37 UTC

Matt, thanks for the workaround. Works nicely. If it's any help to you: I found that the card only locks up when transfering data "directly" through the eth device. If I'm copying the same data through a OpenVPN tunnel which does fragmentation on it's own the data is transferred without problems.

I haven't yet found a kernel where the problem doesn't manifest itself. But it does not seem to be limited to Fedora/RH kernels but debian based distros have the same problem. In their case the kernel is 2.6.28.10.

Comment 11 Andy Gospodarek 2009-07-07 13:08:20 UTC

(In reply to comment #9)
> Yes.  This seems to be a growing problem.  Andreas, to get you up and running,
> try turning off scatter gather through ethtool.  `ethtool -K eth0 sg off`.  I'm
> pretty sure you'll be able to pass traffic after that.
> 
> Can you tell me if you know of a kernel where this device worked without the
> above workaround though?  

Thanks for the help on this, Matt.  Have you been able to reproduce it?

Comment 12 Matt Carlson 2009-07-07 16:43:30 UTC

Andreas, thanks for the VPN tip.  I'll look into that.

Andy, no I haven't been able to repro it.  It only seems to afflict certain platforms which I can't seem to get my hands on.

Comment 13 Andreas Thienemann 2009-07-07 23:32:45 UTC

Matt, I have a hunch that it could be fragmentation related. Take a look at the local pcap file, you can see the packages being sent out with a much larger packet size then the MTU would suggest. The VPN packets are confirming to the MTU.

About the platform, I'm hitting this issue on a Lenovo Ideabook S10e, that should be not that uncommon platform.

Comment 14 Matt Carlson 2009-07-08 01:32:53 UTC

Created attachment 350883 [details]
tg3: Check and report MTU violations

This patch detects MTU and mss transmit violations.

Andreas, can you apply this patch against a recent kernel tree and see if it tells you anything interesting?  (The patch was generated against the latest net-next tree.)  If it does, there is a problem somewhere upstream of the driver.  If not, then it would point to a DMA corruption problem.

Some other things to try (if we suspect a DMA problem).

1) If you can, turn off clkreq in the BIOS.
2) Start the kernel with ASPM disabled.

Comment 15 Andreas Thienemann 2009-07-10 23:51:37 UTC

Matt, thanks for the patch. I'll try it today.

About your suggestions: I cannot change the clkreq setting in the bios as the Lenovo ideapad's bios is locked down completely.
I'll try the aspm setting however later today.

In the meantime I've seen the following kernel oops but do not know if it is related:

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xb0/0x105() (Not tainted)
Hardware name: 40684WG         
NETDEV WATCHDOG: eth0 (tg3): transmit timed out
Modules linked in: fuse rfcomm sco bridge stp llc bnep l2cap sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_multipath uinput snd_hda_codec_realtek arc4 snd_hda_intel ecb snd_hda_codec iTCO_wdt uvcvideo snd_hwdep pcspkr joydev i2c_i801 iTCO_vendor_support usb_storage iwlagn videodev snd_pcm v4l1_compat iwlcore btusb snd_timer bluetooth lib80211 snd mac80211 tg3 soundcore cfg80211 snd_page_alloc ata_generic pata_acpi xts gf128mul aes_i586 aes_generic dm_crypt i915 drm i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
Pid: 2772, comm: firefox Not tainted 2.6.29.5-191.fc11.i686.PAE #1
Call Trace:
 [<c0435266>] warn_slowpath+0x7c/0xa4
 [<c0425c59>] ? kmap_atomic_prot+0x200/0x206
 [<c0716589>] ? _spin_lock+0xd/0x10
 [<c04ba1d4>] ? mnt_drop_write+0x61/0xf6
 [<c04b6e9b>] ? touch_atime+0xba/0xd5
 [<c04ae8d1>] ? pipe_read+0x2ca/0x2d5
 [<c044c9dc>] ? clocksource_read+0xc/0xf
 [<c042149f>] ? default_spin_lock_flags+0x8/0xd
 [<c07168e5>] ? _spin_lock_irqsave+0x30/0x37
 [<c06a0401>] dev_watchdog+0xb0/0x105
 [<c06a0351>] ? dev_watchdog+0x0/0x105
 [<c043da24>] ? run_timer_softirq+0x110/0x1c0
 [<c0449bb3>] ? ktime_get_ts+0x4f/0x53
 [<c062fc6b>] ? rh_timer_func+0x0/0xf
 [<c06a0351>] ? dev_watchdog+0x0/0x105
 [<c043da64>] run_timer_softirq+0x150/0x1c0
 [<c06a0351>] ? dev_watchdog+0x0/0x105
 [<c0439f39>] __do_softirq+0x99/0x139
 [<c043a02b>] do_softirq+0x52/0x7e
 [<c043a196>] irq_exit+0x49/0x77
 [<c0419de7>] smp_apic_timer_interrupt+0x6e/0x7c
 [<c0409c61>] apic_timer_interrupt+0x2d/0x34
---[ end trace 491e35de562cbfe7 ]---
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[0000000b] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
tg3: eth0: Link is up at 100 Mbps, full duplex.
tg3: eth0: Flow control is on for TX and on for RX.

Comment 16 Matt Carlson 2009-07-13 18:13:25 UTC

The MAC_TX_STATUS value is showing that the link is up, that the device is sending xoff flow control messages and the rx side is currently xoffed too.  This can be a normal condition, or it can be showing the hardware is wedged somehow.  The DMA status registers are not reporting any problems though.

The tg3 driver does not look like it is a part of the call trace.  The tx_timeout routine just happens to fire at the same time the problem was captured.  That isn't to say the two aren't somehow related.  I just don't have any visibility upon which to draw further insight.

If the problem is reproducable, try turning off flow control and see if it helps.

Comment 17 Adam Pribyl 2009-08-03 06:21:26 UTC

I have here also an IBM machine (Via C7, CN896, VT8251) with Broadcom Netlink BCM5906M PCI Express. Booting any newer linux kernel than 2.6.27.8 causes system to go completely crazy (hang, starts to write garbage to console). With 2.6.27.8 I can see

+------ PCI-Express Device Error ------+
Error Severity          : Uncorrected (Fatal)
PCIE Bus Error type     : Transaction Layer
Flow Control Protocol   : First
Receiver ID             : 8000
VendorID=1106h, DeviceID=287ch, Bus=80h, Device=00h, Function=00h
pcieport-driver 0000:80:00.0: broadcast error_detected message
tg3 0000:81:00.0: device has no AER-aware driver
pcieport-driver 0000:80:00.0: Root Port link has been reset
pcieport-driver 0000:80:00.0: broadcast mmio_enabled message
pcieport-driver 0000:80:00.0: broadcast resume message
pcieport-driver 0000:80:00.0: AER driver successfully recovered

but the system is at least running.

Comment 18 George Billios 2009-08-12 15:35:56 UTC

Hello, 

Maybe the following error I get with different driver is common?

ug 12 18:18:59 CPC464 kernel: ------------[ cut here ]------------
Aug 12 18:18:59 CPC464 kernel: WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xb0/0x105()
Aug 12 18:18:59 CPC464 kernel: Hardware name: System Product Name
Aug 12 18:18:59 CPC464 kernel: NETDEV WATCHDOG: eth0 (skge): transmit timed out
Aug 12 18:18:59 CPC464 kernel: Modules linked in: nfs lockd nfs_acl auth_rpcgss sunrpc fuse rfcomm sco bridge stp llc bnep l2cap autofs4 coretemp it87 hwmon_vid ipv6 dm_multipath uinput snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul snd_seq snd_emu10k1 nvidia(P) snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem btusb ppdev snd_hwdep firewire_ohci usb_storage snd i2c_nforce2 bluetooth parport_pc firewire_core asus_atk0110 skge emu10k1_gp serio_raw i2c_core parport gameport soundcore pata_amd hwmon crc_itu_t ata_generic pata_acpi sata_nv [last unloaded: microcode]
Aug 12 18:18:59 CPC464 kernel: Pid: 9, comm: events/0 Tainted: P           2.6.30.4 #3
Aug 12 18:18:59 CPC464 kernel: Call Trace:
Aug 12 18:18:59 CPC464 kernel: [<c042a0ec>] warn_slowpath_common+0x6a/0x81
Aug 12 18:18:59 CPC464 kernel: [<c067581a>] ? dev_watchdog+0xb0/0x105
Aug 12 18:18:59 CPC464 kernel: [<c042a141>] warn_slowpath_fmt+0x29/0x2c
Aug 12 18:18:59 CPC464 kernel: [<c067581a>] dev_watchdog+0xb0/0x105
Aug 12 18:18:59 CPC464 kernel: [<c04325a9>] ? __mod_timer+0xa6/0xb0
Aug 12 18:18:59 CPC464 kernel: [<c04320f0>] ? internal_add_timer+0x93/0x97
Aug 12 18:18:59 CPC464 kernel: [<c067576a>] ? dev_watchdog+0x0/0x105
Aug 12 18:18:59 CPC464 kernel: [<c067576a>] ? dev_watchdog+0x0/0x105
Aug 12 18:18:59 CPC464 kernel: [<c04322d5>] run_timer_softirq+0x141/0x1a4
Aug 12 18:18:59 CPC464 kernel: [<c067576a>] ? dev_watchdog+0x0/0x105
Aug 12 18:18:59 CPC464 kernel: [<c042e6ce>] __do_softirq+0x9a/0x14b
Aug 12 18:18:59 CPC464 kernel: [<c042e7af>] do_softirq+0x30/0x48
Aug 12 18:18:59 CPC464 kernel: [<c042e8ba>] irq_exit+0x3a/0x68
Aug 12 18:18:59 CPC464 kernel: [<c041132c>] smp_apic_timer_interrupt+0x6d/0x7b
Aug 12 18:18:59 CPC464 kernel: [<c0403407>] apic_timer_interrupt+0x2f/0x34
Aug 12 18:18:59 CPC464 kernel: [<c043b41e>] ? finish_wait+0x4a/0x4e
Aug 12 18:18:59 CPC464 kernel: [<c0437f50>] worker_thread+0xa2/0x1c0
Aug 12 18:18:59 CPC464 kernel: [<c05aff22>] ? flush_to_ldisc+0x0/0x160
Aug 12 18:18:59 CPC464 kernel: [<c043b2f1>] ? autoremove_wake_function+0x0/0x34
Aug 12 18:18:59 CPC464 kernel: [<c0437eae>] ? worker_thread+0x0/0x1c0
Aug 12 18:18:59 CPC464 kernel: [<c043aff7>] kthread+0x4b/0x6f
Aug 12 18:18:59 CPC464 kernel: [<c043afac>] ? kthread+0x0/0x6f
Aug 12 18:18:59 CPC464 kernel: [<c040355b>] kernel_thread_helper+0x7/0x10
Aug 12 18:18:59 CPC464 kernel: ---[ end trace bedb9a5728846903 ]---
Aug 12 18:19:07 CPC464 init: tty4 main process (1237) killed by TERM signal
Aug 12 18:19:07 CPC464 init: tty5 main process (1238) killed by TERM signal
Aug 12 18:19:07 CPC464 init: tty2 main process (1239) killed by TERM signal
Aug 12 18:19:07 CPC464 init: tty3 main process (1241) killed by TERM signal
Aug 12 18:19:07 CPC464 init: tty6 main process (1242) killed by TERM signal
Aug 12 18:19:07 CPC464 NetworkManager: <WARN>  nm_signal_handler(): Caught signal 15, shutting down normally.
Aug 12 18:19:07 CPC464 NetworkManager: <info>  (eth0): now unmanaged
Aug 12 18:19:07 CPC464 NetworkManager: <info>  (eth0): device state change: 8 -> 1 (reason 36)
Aug 12 18:19:07 CPC464 NetworkManager: <info>  (eth0): deactivating device (reason: 36).
Aug 12 18:19:08 CPC464 NetworkManager: <info>  (eth0): canceled DHCP transaction, dhcp client pid 1318
Aug 12 18:19:08 CPC464 NetworkManager: <WARN>  check_one_route(): (eth0) error -34 returned from rtnl_route_del(): Sucess#012
Aug 12 18:19:08 CPC464 NetworkManager: <info>  (eth0): cleaning up...
Aug 12 18:19:08 CPC464 NetworkManager: <info>  (eth0): taking down device.
Aug 12 18:19:08 CPC464 avahi-daemon[1044]: Withdrawing address record for 192.168.2.4 on eth0.
Aug 12 18:19:08 CPC464 avahi-daemon[1044]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.2.4.
Aug 12 18:19:08 CPC464 NetworkManager: <info>  exiting (success)
Aug 12 18:19:08 CPC464 kernel: skge eth0: disabling interface
Aug 12 18:19:08 CPC464 avahi-daemon[1044]: Interface eth0.IPv4 no longer relevant for mDNS.
Aug 12 18:19:08 CPC464 avahi-daemon[1044]: Withdrawing address record for fe80::222:b0ff:fee7:4771 on eth0.
Aug 12 18:19:09 CPC464 nmbd[996]: [2009/08/12 18:19:09,  0] nmbd/nmbd.c:reload_interfaces(288)
Aug 12 18:19:09 CPC464 nmbd[996]:   reload_interfaces: No subnets to listen to. Waiting..
Aug 12 18:19:09 CPC464 smartd[1232]: smartd received signal 15: Terminated
Aug 12 18:19:09 CPC464 smartd[1232]: smartd is exiting (exit status 0)
Aug 12 18:19:09 CPC464 avahi-daemon[1044]: Got SIGTERM, quitting.

After this I can only shutdown the pc by pressing the power button since everything is stalled and there is no network interface up.

Comment 19 George Billios 2009-08-16 18:11:00 UTC

Andreas I don't know what MTU you are using but in my case with 9000 bytes I get this error. The last days I'm testing with 8k and 7k and haven't been able to reproduce this. 

Can you also try it?

Comment 20 George Billios 2009-08-29 21:45:06 UTC

After all MTU was not the problem, I pinpointed the problem to the IRQ sharing between the ethernet card and the graphics card. 

The issue was solved by booting with the noapic parameter and changing the position of the pci ethernet card.

Comment 21 Chuck Ebbert 2009-11-02 22:33:30 UTC

Is this the same problem as bug 527209 in F-12?

Comment 22 Andreas Thienemann 2009-11-02 22:45:00 UTC

(In reply to comment #21)
> Is this the same problem as bug 527209 in F-12?  

Looks pretty much related.

Can try this out on one notebook here but will take a bit of time.

Comment 23 Matt Carlson 2009-11-05 02:29:10 UTC

Chuck, does the fix for bug 527209 also fix this problem?

Comment 24 Bug Zapper 2010-04-27 15:30:18 UTC

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 25 Bug Zapper 2010-06-28 13:29:37 UTC

Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.