Description of problem: When transmitting a file (e.g. scp) the connection stalls after some bytes and the machine is unable to send any data until the tg3 module is removed and loaded again. Version-Release number of selected component (if applicable): 2.6.29.5-191.fc11.i686.PAE How reproducible: Always Steps to Reproduce: 1. Verify network is working: [andreas@pfy F-11]$ ping 192.168.2.1 PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data. 64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.961 ms ^C --- 192.168.2.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 454ms rtt min/avg/max/mdev = 0.961/0.961/0.961/0.000 ms 2. Transmit data: [andreas@pfy F-11]$ scp wine-1.1.25-1.fc11.1.src.rpm root@xenbuilder64: wine-1.1.25-1.fc11.1.src.rpm 6% 1056KB 150.7KB/s - stalled - ^CKilled by signal 2. 3. Verify network is down: [andreas@pfy F-11]$ ping 192.168.2.1 PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data. ^C --- 192.168.2.1 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 1677ms Actual results: Network down Expected results: File transmitted, network still up. Additional info: 02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5906M Fast Ethernet PCI Express (rev 02) dmesg is basically empty: tg3.c:v3.97 (December 10, 2008) tg3 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 tg3 0000:02:00.0: setting latency timer to 64 tg3 0000:02:00.0: PME# disabled eth0: Tigon3 [partno(BCM95906) rev c002] (PCI Express) MAC address 00:23:8b:18:a5:9c eth0: attached PHY is 5906 (10/100Base-TX Ethernet) (WireSpeed[0]) eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[0] eth0: dma_rwctrl[76180000] dma_mask[64-bit] tg3 0000:02:00.0: PME# disabled tg3 0000:02:00.0: irq 28 for MSI/MSI-X ADDRCONF(NETDEV_UP): eth0: link is not ready tg3: eth0: Link is up at 100 Mbps, full duplex. tg3: eth0: Flow control is on for TX and on for RX. ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready eth0: no IPv6 routers present
Do you know for sure this is a transmit problem and not a receive issue? If you can capture the network traffic when the scp hangs we can help determine if that is the case. Does disabling TSO (ethtool -K eth0 tso off) change this?
Feel free to try any of the newer kernels located here as well if you like: http://koji.fedoraproject.org/koji/packageinfo?packageID=8
I think it is a transmit problem because netstat shows the send-queue filling up: Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State [...] tcp 0 101552 192.168.2.131:46798 193.7.176.205:22 ESTABLISHED [...] I have two pcap dump files attached, one called local.pcap captured on the local machine with the tg3 interface, the other one on the remote system which is unaffected called remote.pcap. Setting the tcp offloading engine to off does't seem to change anything sadly.
Created attachment 350684 [details] pcap file from the local machine with the tg3 card experiencing the hang
Created attachment 350685 [details] pcap file from the remote machine
Thanks, I'll check 'em out.
Newer kernel (2.6.31-0.42.rc2.fc12.i686.PAE) doesn't seem to do any good either.
Thanks for adding those details. I'll get the folks at Broadcom involved as well since this sounds like something they will want to know about.
Yes. This seems to be a growing problem. Andreas, to get you up and running, try turning off scatter gather through ethtool. `ethtool -K eth0 sg off`. I'm pretty sure you'll be able to pass traffic after that. Can you tell me if you know of a kernel where this device worked without the above workaround though?
Matt, thanks for the workaround. Works nicely. If it's any help to you: I found that the card only locks up when transfering data "directly" through the eth device. If I'm copying the same data through a OpenVPN tunnel which does fragmentation on it's own the data is transferred without problems. I haven't yet found a kernel where the problem doesn't manifest itself. But it does not seem to be limited to Fedora/RH kernels but debian based distros have the same problem. In their case the kernel is 2.6.28.10.
(In reply to comment #9) > Yes. This seems to be a growing problem. Andreas, to get you up and running, > try turning off scatter gather through ethtool. `ethtool -K eth0 sg off`. I'm > pretty sure you'll be able to pass traffic after that. > > Can you tell me if you know of a kernel where this device worked without the > above workaround though? Thanks for the help on this, Matt. Have you been able to reproduce it?
Andreas, thanks for the VPN tip. I'll look into that. Andy, no I haven't been able to repro it. It only seems to afflict certain platforms which I can't seem to get my hands on.
Matt, I have a hunch that it could be fragmentation related. Take a look at the local pcap file, you can see the packages being sent out with a much larger packet size then the MTU would suggest. The VPN packets are confirming to the MTU. About the platform, I'm hitting this issue on a Lenovo Ideabook S10e, that should be not that uncommon platform.
Created attachment 350883 [details] tg3: Check and report MTU violations This patch detects MTU and mss transmit violations. Andreas, can you apply this patch against a recent kernel tree and see if it tells you anything interesting? (The patch was generated against the latest net-next tree.) If it does, there is a problem somewhere upstream of the driver. If not, then it would point to a DMA corruption problem. Some other things to try (if we suspect a DMA problem). 1) If you can, turn off clkreq in the BIOS. 2) Start the kernel with ASPM disabled.
Matt, thanks for the patch. I'll try it today. About your suggestions: I cannot change the clkreq setting in the bios as the Lenovo ideapad's bios is locked down completely. I'll try the aspm setting however later today. In the meantime I've seen the following kernel oops but do not know if it is related: ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xb0/0x105() (Not tainted) Hardware name: 40684WG NETDEV WATCHDOG: eth0 (tg3): transmit timed out Modules linked in: fuse rfcomm sco bridge stp llc bnep l2cap sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_multipath uinput snd_hda_codec_realtek arc4 snd_hda_intel ecb snd_hda_codec iTCO_wdt uvcvideo snd_hwdep pcspkr joydev i2c_i801 iTCO_vendor_support usb_storage iwlagn videodev snd_pcm v4l1_compat iwlcore btusb snd_timer bluetooth lib80211 snd mac80211 tg3 soundcore cfg80211 snd_page_alloc ata_generic pata_acpi xts gf128mul aes_i586 aes_generic dm_crypt i915 drm i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan] Pid: 2772, comm: firefox Not tainted 2.6.29.5-191.fc11.i686.PAE #1 Call Trace: [<c0435266>] warn_slowpath+0x7c/0xa4 [<c0425c59>] ? kmap_atomic_prot+0x200/0x206 [<c0716589>] ? _spin_lock+0xd/0x10 [<c04ba1d4>] ? mnt_drop_write+0x61/0xf6 [<c04b6e9b>] ? touch_atime+0xba/0xd5 [<c04ae8d1>] ? pipe_read+0x2ca/0x2d5 [<c044c9dc>] ? clocksource_read+0xc/0xf [<c042149f>] ? default_spin_lock_flags+0x8/0xd [<c07168e5>] ? _spin_lock_irqsave+0x30/0x37 [<c06a0401>] dev_watchdog+0xb0/0x105 [<c06a0351>] ? dev_watchdog+0x0/0x105 [<c043da24>] ? run_timer_softirq+0x110/0x1c0 [<c0449bb3>] ? ktime_get_ts+0x4f/0x53 [<c062fc6b>] ? rh_timer_func+0x0/0xf [<c06a0351>] ? dev_watchdog+0x0/0x105 [<c043da64>] run_timer_softirq+0x150/0x1c0 [<c06a0351>] ? dev_watchdog+0x0/0x105 [<c0439f39>] __do_softirq+0x99/0x139 [<c043a02b>] do_softirq+0x52/0x7e [<c043a196>] irq_exit+0x49/0x77 [<c0419de7>] smp_apic_timer_interrupt+0x6e/0x7c [<c0409c61>] apic_timer_interrupt+0x2d/0x34 ---[ end trace 491e35de562cbfe7 ]--- tg3: eth0: transmit timed out, resetting tg3: DEBUG: MAC_TX_STATUS[0000000b] MAC_RX_STATUS[00000000] tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2 tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2 tg3: eth0: Link is down. tg3: eth0: Link is up at 100 Mbps, full duplex. tg3: eth0: Flow control is on for TX and on for RX.
The MAC_TX_STATUS value is showing that the link is up, that the device is sending xoff flow control messages and the rx side is currently xoffed too. This can be a normal condition, or it can be showing the hardware is wedged somehow. The DMA status registers are not reporting any problems though. The tg3 driver does not look like it is a part of the call trace. The tx_timeout routine just happens to fire at the same time the problem was captured. That isn't to say the two aren't somehow related. I just don't have any visibility upon which to draw further insight. If the problem is reproducable, try turning off flow control and see if it helps.
I have here also an IBM machine (Via C7, CN896, VT8251) with Broadcom Netlink BCM5906M PCI Express. Booting any newer linux kernel than 2.6.27.8 causes system to go completely crazy (hang, starts to write garbage to console). With 2.6.27.8 I can see +------ PCI-Express Device Error ------+ Error Severity : Uncorrected (Fatal) PCIE Bus Error type : Transaction Layer Flow Control Protocol : First Receiver ID : 8000 VendorID=1106h, DeviceID=287ch, Bus=80h, Device=00h, Function=00h pcieport-driver 0000:80:00.0: broadcast error_detected message tg3 0000:81:00.0: device has no AER-aware driver pcieport-driver 0000:80:00.0: Root Port link has been reset pcieport-driver 0000:80:00.0: broadcast mmio_enabled message pcieport-driver 0000:80:00.0: broadcast resume message pcieport-driver 0000:80:00.0: AER driver successfully recovered but the system is at least running.
Hello, Maybe the following error I get with different driver is common? ug 12 18:18:59 CPC464 kernel: ------------[ cut here ]------------ Aug 12 18:18:59 CPC464 kernel: WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xb0/0x105() Aug 12 18:18:59 CPC464 kernel: Hardware name: System Product Name Aug 12 18:18:59 CPC464 kernel: NETDEV WATCHDOG: eth0 (skge): transmit timed out Aug 12 18:18:59 CPC464 kernel: Modules linked in: nfs lockd nfs_acl auth_rpcgss sunrpc fuse rfcomm sco bridge stp llc bnep l2cap autofs4 coretemp it87 hwmon_vid ipv6 dm_multipath uinput snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul snd_seq snd_emu10k1 nvidia(P) snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem btusb ppdev snd_hwdep firewire_ohci usb_storage snd i2c_nforce2 bluetooth parport_pc firewire_core asus_atk0110 skge emu10k1_gp serio_raw i2c_core parport gameport soundcore pata_amd hwmon crc_itu_t ata_generic pata_acpi sata_nv [last unloaded: microcode] Aug 12 18:18:59 CPC464 kernel: Pid: 9, comm: events/0 Tainted: P 2.6.30.4 #3 Aug 12 18:18:59 CPC464 kernel: Call Trace: Aug 12 18:18:59 CPC464 kernel: [<c042a0ec>] warn_slowpath_common+0x6a/0x81 Aug 12 18:18:59 CPC464 kernel: [<c067581a>] ? dev_watchdog+0xb0/0x105 Aug 12 18:18:59 CPC464 kernel: [<c042a141>] warn_slowpath_fmt+0x29/0x2c Aug 12 18:18:59 CPC464 kernel: [<c067581a>] dev_watchdog+0xb0/0x105 Aug 12 18:18:59 CPC464 kernel: [<c04325a9>] ? __mod_timer+0xa6/0xb0 Aug 12 18:18:59 CPC464 kernel: [<c04320f0>] ? internal_add_timer+0x93/0x97 Aug 12 18:18:59 CPC464 kernel: [<c067576a>] ? dev_watchdog+0x0/0x105 Aug 12 18:18:59 CPC464 kernel: [<c067576a>] ? dev_watchdog+0x0/0x105 Aug 12 18:18:59 CPC464 kernel: [<c04322d5>] run_timer_softirq+0x141/0x1a4 Aug 12 18:18:59 CPC464 kernel: [<c067576a>] ? dev_watchdog+0x0/0x105 Aug 12 18:18:59 CPC464 kernel: [<c042e6ce>] __do_softirq+0x9a/0x14b Aug 12 18:18:59 CPC464 kernel: [<c042e7af>] do_softirq+0x30/0x48 Aug 12 18:18:59 CPC464 kernel: [<c042e8ba>] irq_exit+0x3a/0x68 Aug 12 18:18:59 CPC464 kernel: [<c041132c>] smp_apic_timer_interrupt+0x6d/0x7b Aug 12 18:18:59 CPC464 kernel: [<c0403407>] apic_timer_interrupt+0x2f/0x34 Aug 12 18:18:59 CPC464 kernel: [<c043b41e>] ? finish_wait+0x4a/0x4e Aug 12 18:18:59 CPC464 kernel: [<c0437f50>] worker_thread+0xa2/0x1c0 Aug 12 18:18:59 CPC464 kernel: [<c05aff22>] ? flush_to_ldisc+0x0/0x160 Aug 12 18:18:59 CPC464 kernel: [<c043b2f1>] ? autoremove_wake_function+0x0/0x34 Aug 12 18:18:59 CPC464 kernel: [<c0437eae>] ? worker_thread+0x0/0x1c0 Aug 12 18:18:59 CPC464 kernel: [<c043aff7>] kthread+0x4b/0x6f Aug 12 18:18:59 CPC464 kernel: [<c043afac>] ? kthread+0x0/0x6f Aug 12 18:18:59 CPC464 kernel: [<c040355b>] kernel_thread_helper+0x7/0x10 Aug 12 18:18:59 CPC464 kernel: ---[ end trace bedb9a5728846903 ]--- Aug 12 18:19:07 CPC464 init: tty4 main process (1237) killed by TERM signal Aug 12 18:19:07 CPC464 init: tty5 main process (1238) killed by TERM signal Aug 12 18:19:07 CPC464 init: tty2 main process (1239) killed by TERM signal Aug 12 18:19:07 CPC464 init: tty3 main process (1241) killed by TERM signal Aug 12 18:19:07 CPC464 init: tty6 main process (1242) killed by TERM signal Aug 12 18:19:07 CPC464 NetworkManager: <WARN> nm_signal_handler(): Caught signal 15, shutting down normally. Aug 12 18:19:07 CPC464 NetworkManager: <info> (eth0): now unmanaged Aug 12 18:19:07 CPC464 NetworkManager: <info> (eth0): device state change: 8 -> 1 (reason 36) Aug 12 18:19:07 CPC464 NetworkManager: <info> (eth0): deactivating device (reason: 36). Aug 12 18:19:08 CPC464 NetworkManager: <info> (eth0): canceled DHCP transaction, dhcp client pid 1318 Aug 12 18:19:08 CPC464 NetworkManager: <WARN> check_one_route(): (eth0) error -34 returned from rtnl_route_del(): Sucess#012 Aug 12 18:19:08 CPC464 NetworkManager: <info> (eth0): cleaning up... Aug 12 18:19:08 CPC464 NetworkManager: <info> (eth0): taking down device. Aug 12 18:19:08 CPC464 avahi-daemon[1044]: Withdrawing address record for 192.168.2.4 on eth0. Aug 12 18:19:08 CPC464 avahi-daemon[1044]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.2.4. Aug 12 18:19:08 CPC464 NetworkManager: <info> exiting (success) Aug 12 18:19:08 CPC464 kernel: skge eth0: disabling interface Aug 12 18:19:08 CPC464 avahi-daemon[1044]: Interface eth0.IPv4 no longer relevant for mDNS. Aug 12 18:19:08 CPC464 avahi-daemon[1044]: Withdrawing address record for fe80::222:b0ff:fee7:4771 on eth0. Aug 12 18:19:09 CPC464 nmbd[996]: [2009/08/12 18:19:09, 0] nmbd/nmbd.c:reload_interfaces(288) Aug 12 18:19:09 CPC464 nmbd[996]: reload_interfaces: No subnets to listen to. Waiting.. Aug 12 18:19:09 CPC464 smartd[1232]: smartd received signal 15: Terminated Aug 12 18:19:09 CPC464 smartd[1232]: smartd is exiting (exit status 0) Aug 12 18:19:09 CPC464 avahi-daemon[1044]: Got SIGTERM, quitting. After this I can only shutdown the pc by pressing the power button since everything is stalled and there is no network interface up.
Andreas I don't know what MTU you are using but in my case with 9000 bytes I get this error. The last days I'm testing with 8k and 7k and haven't been able to reproduce this. Can you also try it?
After all MTU was not the problem, I pinpointed the problem to the IRQ sharing between the ethernet card and the graphics card. The issue was solved by booting with the noapic parameter and changing the position of the pci ethernet card.
Is this the same problem as bug 527209 in F-12?
(In reply to comment #21) > Is this the same problem as bug 527209 in F-12? Looks pretty much related. Can try this out on one notebook here but will take a bit of time.
Chuck, does the fix for bug 527209 also fix this problem?
This message is a reminder that Fedora 11 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 11. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '11'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 11's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 11 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.