NetworkManager-0.7.996-4.git20091002.fc12.x86_64 Tried this build from koji: http://koji.fedoraproject.org/koji/buildinfo?buildID=134947 On my laptop, I unplugged the eth0 cable and then replugged it, and saw this in /var/log/messages: Oct 5 07:02:11 localhost NetworkManager: <info> (eth0): carrier now OFF (device state 3) Oct 5 07:02:11 localhost NetworkManager: <info> (eth0): device state change: 3 -> 2 (reason 40) Oct 5 07:02:11 localhost NetworkManager: <info> (eth0): deactivating device (reason: 40). Oct 5 07:02:11 localhost NetworkManager: <info> Policy set 'Auto FedoraGL' (wlan0) as default for routing and DNS. Oct 5 07:02:13 localhost NetworkManager: <info> (eth0): carrier now ON (device state 2) Oct 5 07:02:13 localhost kernel: tg3: eth0: Link is up at 100 Mbps, full duplex. Oct 5 07:02:13 localhost kernel: tg3: eth0: Flow control is on for TX and on for RX. Oct 5 07:02:13 localhost NetworkManager: <info> (eth0): device state change: 2 -> 3 (reason 40) Nothing further happened. The nm-applet retained its display of the wlan0 signal bars -- no sign of trying DHCP or any other feedback. I reverted to the previous build, NetworkManager-0.7.996-3.git20090928.fc12.x86_64, and everything returned to normal.
Is it possible this has something to do with the kernel's tg3 driver? I keep triggering the original problem (laptop appears to drop off the network) by trying to do a large file transfer via Nautilus gvfs/sftp over the wired connection of that machine. When I do the same transfer via wifi, no problems.
Hmm, have you hit 'disconnect' in the menu at any point for the wired connection?
OK, here's some further data: * I can reproduce the problem every time with Nautilus sftp. It doesn't seem to occur with e.g. scp from a terminal. * When I disconnect using NetworkManager, and reconnect "System eth0," NM appears not to get a DHCP address after that. * If I 'rmmod tg3' and then 'modprobe tg3' and reconnect, NM behaves normally, getting a DHCP address and happily chugging along. I'm happy to reproduce this and produce some logs if you tell me how to do it in the most useful manner.
Not sure if this is relevant, but: 09:00.0 Ethernet controller [0200]: Broadcom Corporation NetLink BCM5906M Fast Ethernet PCI Express [14e4:1713] (rev 02)
I've since found out that other large traffic transfers are causing this problem too, such as rsync from an Internet mirror to my local hard disk. Most, but not all, large transfers seem to cause the problem. I can't figure for the life of me how this can be NM at fault. The only solution when the problem occurs appears to be 'modprobe -r tg3' followed by 'modprobe tg3', and once I do that, NetworkManager sees the NIC, requests and receives DHCP info, and operates normally. I'm going to change this to component kernel, mark it F12Target, and ask the kernel team to weigh in. I apologize if that's not kosher or if this comes off as hubris... Just interested in seeing the bug fixed but I don't want it to override other priorities.
Comment #1 is not accurate -- reverting to previous NM build has not fixed the problem, so this bug was misfiled originally. Apologies to Dan and others.
(In reply to comment #7) > Oct 5 07:02:13 localhost NetworkManager: <info> (eth0): device state change: > 2 -> 3 (reason 40) Can someone please explain what state 2 and 3, and reason 40 mean? Also, does this happen using old-style config files with NetworkManager disabled?
That means NM is transitioning the device from state 2 (unavailable) to state 3 (disconnected but ready to be used). Ethernet devices will be in state 2 (unavailable) when there is no carrier, and will transition to state 3 (disconnected) when the carrier flips to ON, as seen in the logs above.
I have the same problem with old-style config files and NetworkManager disabled. (I set them up using system-config-network, disabled NetworkManager using chkconfig, and rebooted.) I haven't been able to duplicate this problem on my workstation which has a different tg3-supported interface -- Broadcom Corporation NetLink BCM5784M Gigabit Ethernet PCIe [14e4:1698] (rev 10).
Found someone on the forum suffering from what seems like the same problem: http://forums.fedoraforum.org/showthread.php?t=232485 I asked him to add a note to this bug. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
I have the same problem with my Inspiron 1420 and any other desktop computer with Broadcom ethernet chip. Every time I scp a large file, such as kernel-xxx.rpm, to others, the eth0 (created by tg3 driver) would stop working, and only reboot could restart the tg3. That means removing tg3 module and reinstall it with modprobe could not solve the problem. There is nothing to do with NM, because even I removed NM, the problem still take place. I have also review the changelog for kernel 2.6.31, it told us there are something upgrade with the network stack buffer and DMA process. Maybe the problem comes from the compatibility between tg3 and kernel upgrade.
Scratch build running here, can someone test this? http://koji.fedoraproject.org/koji/taskinfo?taskID=1770538
Chuck, I can't install that because of the (matching or greater) kernel-firmware requirement. (The --force option didn't work.) Do you have a package available for kernel-firmware?
click on the 'noarch' build link to get the kernel-firmware package, Paul. (in extreme cases you can actually --nodeps that requirement, you don't strictly _need_ the firmware package, though not having it will make a few drivers stop working). -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Chuck, I have tested your tg3 updated kernel in my Inspiron 1420 with BCM5906 chip, but the large file problem is still. It can't solve the problem.
The problem is not fixed with the new build, and in fact is worse -- even without engaging in a large file transfer, the interface rarely stays up long enough for me to open a terminal on that system and ping another local box (or be pinged). There's nothing in the dmesg or /var/log/messages to indicate what's happening, and the NetworkManager applet makes it look like things are working normally (as does 'ip addr show eth0'), but eth0 acts like it's down. Again, I can do 'modprobe -r tg3' and then 'modprobe tg3' to resuscitate it, but with this new kernel, it only stays up for a second or so before dying. Even the small amount of traffic that passes PulseAudio remote device discovery seems to be enough to kill the interface.
Adding tg3 maintainer from Broadcom for comment. Matt?
This sounds like a problem I have a partial fix for. Can you tell me if 'ethtool -K ethx sq off' fixes the problem?
Um, that's 'ethtool -K ethx sg off'.
I'm promoting this for consideration as a release blocker, as it could potentially make it very tricky to research the issue to discover a workaround or fix, on a system which only had an affected adapter. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Using 'ethtool -K eth0 sg off' and then running the same copy seems to make everything work fine. If I switch it, 'ethtool -K eth0 sg on' and then retry, I can again reproduce the lock up, and have to 'modprobe -r tg3' and 'modprobe tg3' to restore function.
Discussed at blocker meeting today; we're dropping it to F12Target on the basis that the workaround is available and tested to work. I will document the workaround on the common bugs page. Chuck Ebbert says if a patch is forthcoming from Broadcom quickly enough he can look at merging it for the final F12 kernel. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Chuck, feel free to SMS me (internal wiki has info) if you need a build tested pronto. Perhaps this problem is down to just one or a few specific model NICs and I just happened to get lucky. OTOH I'm wondering how people will find the workaround if their network dies unexpectedly. I stand ready to help!
Created attachment 366887 [details] tg3: Assign flags to fixes in start_xmit_dma_bug This patch adds a flag for each bug workaround in tg3_start_xmit_dma_bug(). This is prep work for the following patch.
Created attachment 366888 [details] tg3: Fix 5906 transmit hangs The 5906 has trouble with fragments that are less than 8 bytes in size. This patch works around the problem by pivoting the 5906's transmit routine to tg3_start_xmit_dma_bug() and introducing a new SHORT_DMA_BUG flag that enables code to detect and react to the problematic condition.
Can we generate a test kernel with the above two patches applied and see if it fixes everyone's problems? For the record, this bug doesn't seem to show up on every platform. I have yet to understand why, but ASPM seems to irritate the condition.
chuck, can you do a build to test this? -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
This looks wrong to me: @@ -12684,7 +12693,8 @@ static int __devinit tg3_get_invariants(struct tg3 *tp) if (!(tp->tg3_flags3 & TG3_FLG3_5755_PLUS)) { tp->tg3_flags3 |= TG3_FLG3_4G_DMA_BNDRY_BUG; tp->tg3_flags3 |= TG3_FLG3_40BIT_DMA_LIMIT_BUG; - } + } else if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906) + tp->tg3_flags3 |= TG3_FLG3_SHORT_DMA_BUG; if (!(tp->tg3_flags2 & TG3_FLG2_5705_PLUS) || (tp->tg3_flags2 & TG3_FLG2_5780_CLASS) || The 5906 is not included in the set of chips that have the 5755_PLUS flag set, so the "else" will never be taken. Shouldn't it be: if (!(tp->tg3_flags3 & TG3_FLG3_5755_PLUS)) { if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906) tp->tg3_flags3 |= TG3_FLG3_SHORT_DMA_BUG; else { tp->tg3_flags3 |= TG3_FLG3_4G_DMA_BNDRY_BUG; tp->tg3_flags3 |= TG3_FLG3_40BIT_DMA_LIMIT_BUG; } }
Build running here, with my change added to the Broadcom patches. Still awaiting confirmation from Broadcom that the change is correct, but I'm pretty sure it is...
http://koji.fedoraproject.org/koji/taskinfo?taskID=1782682
I just tried the koji builds in comment #30, and they appear to work fine (without the need for the ethtool workaround described in comment #19). I can now scp from a command line or via Nautilus/gvfs mount without difficulty or certain death. :-)
Created attachment 367165 [details] patch 1/7
Created attachment 367166 [details] patch 2/7
Created attachment 367167 [details] patch 3/7
Created attachment 367168 [details] patch 4/7
Created attachment 367169 [details] patch 5/7
Created attachment 367170 [details] patch 6/7
Created attachment 367171 [details] patch 7/7
I've attached all seven patches that went into the test kernel. The seventh fixes the problem I found with the patch "tg3: Fix 5906 transmit hangs".
Chuck, you are correct. I missed that in my haste to get the patch to you ASAP. Sorry. I'll be sending (the corrected version of) these patches upstream very soon.
great job, guys, thanks. It'd be great to have this in the final release, we may well need to tag a new kernel anyway to deal with the Radeon hang issue I'm currently investigating with Dave and Jerome. please co-ordinate with them before tag requesting anything, though, I'm not sure if they'd want the 97 / 104 / 105 / 106 / 107 changes going in without some cleanup... -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
Bill Nottingham voiced a question in #fedora-kernel this morning about the tg3 fixes: <notting> is there a reason that's not on the blocker list? I'm not against that obviously. Does it make sense to move this back to F12Blocker just to ensure that it's taken care of in the kernel that gets tagged for f12-final? I'm happy to be a guinea pig for testing if needed. The special 107.tg3 kernel that Chuck built works fine, but I verified that the -112 kernel (which he confirms didn't have his commits) is back to the broken behavior.
my bad, we did have a rationale for dropping this from the blocker list but it wasn't relayed to the bug report for some reason. the bug doesn't need to be on the blocker list for us to accept a tag for it, though. Chuck stuck the fixes into kernel -116; please work with the kernel team to file a tag request for that kernel or later (I'd quite like to have the fix from 117 too). -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
this should be closed, we wound up with 127 in the final release. assuming the problem didn't somehow magically come back between 117 and 127. please re-open if it did. -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
117 fixed the problem, and it's stayed resolved through 127. Thanks.
Hello, I would like to put a comment on this bug, as I have a problem that seems quite related.... eventually I will create a new one referring this. I tested all combinations of cables and duplex mode settings and I'm getting again a similar problem. My system is an XPS M1330 with up2date F12 x86_64 and kernel 2.6.31.12-174.2.3.fc12.x86_64 Resuming: after starting a large transfer, I soon get this in /var/log/messages Feb 9 15:40:14 tekkaman kernel: tg3: eth0: Link is down. Feb 9 15:40:14 tekkaman NetworkManager: <info> (eth0): carrier now OFF (device state 8, deferring action for 4 seconds) Feb 9 15:40:16 tekkaman kernel: tg3: eth0: Link is up at 100 Mbps, full duplex. Feb 9 15:40:16 tekkaman kernel: tg3: eth0: Flow control is off for TX and off for RX. Feb 9 15:40:16 tekkaman NetworkManager: <info> (eth0): carrier now ON (device state 8) The behavior of the system now is not as the original one where the connection dropped at all, but instead now the transfer rate suddenly drops from around 9.7MB/s to 4-5MB/s My lspci output: 09:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5906M Fast Ethernet PCI Express (rev 02) Applying the temporary workaround (see comments 19 and 21) that was in place before the supposed fix, now I continue to get the off/on message for the interface, but at least I can sustain 10.1MB/s during the overall 600MB transfer. My initial (default) setup with card set in auto sense and autonegotiation set up at 100mbit/s [root at tekkaman ~]# ethtool -k eth0 Offload parameters for eth0: Cannot get device flags: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: on tcp-segmentation-offload: off udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: off large-receive-offload: off Shouldn't be fixed this, so that I have to get sg=off by default now? Workaround: [root at tekkaman ~]# ethtool -K eth0 sg off verify changed settings: [root at tekkaman ~]# ethtool -k eth0 Offload parameters for eth0: Cannot get device flags: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: off tcp-segmentation-offload: off udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: off large-receive-offload: off transfer test: [gcecchi at tekkaman ~]$ time scp Downloads/SS830.2009_0817.103-x64.iso root at mysrv:/root root at mysrv's password: SS830.2009_0817.103-x64.iso 100% 603MB 10.6MB/s 00:57 real 0m59.383s user 0m17.708s sys 0m3.957s After the first 10-20 MBytes transferred I get these in /var/log/messages: Feb 9 15:40:14 tekkaman kernel: tg3: eth0: Link is down. Feb 9 15:40:14 tekkaman NetworkManager: <info> (eth0): carrier now OFF (device state 8, deferring action for 4 seconds) Feb 9 15:40:16 tekkaman kernel: tg3: eth0: Link is up at 100 Mbps, full duplex. Feb 9 15:40:16 tekkaman kernel: tg3: eth0: Flow control is off for TX and off for RX. Feb 9 15:40:16 tekkaman NetworkManager: <info> (eth0): carrier now ON (device state 8) Anyone with this model got this problem again? Possible a regression after the 127 kernel? Before re-opening the bugzilla I would like to share.... Thanks again for comments and input. Gianluca
When I upgraded fedora 12 kernel to 2.6.31.12-174.2.3 in my Inspiron 1420, I met the same problem as you described in last comment. But when I downloaded the source rpm for 2.6.32 from fedora development repository for F13, and rebuild it, the problem with tg3 was solved perfectly. You could give a try.
The bug seems to be back in kernel 2.6.32.9 See Bug 571638
I think an upstream kernel change allowed scatter/gather fragments to be sized less than or equal to 8 bytes. This exposed a 5906 chip bug. Commit 92c6b8d16a36df3f28b2537bed2a56491fb08f11 fixes the problem. This commit was integrated into the 3.103 version of the tg3 driver. I'm pretty sure this fix was integrated into RedHat's 2.6.31 kernel. I checked the 2.6.32.9 kernel and it does not yet contain the fix.
Reopening, based on Comment 49. Why isn't the tg3 fix upstream ?
It is. Like comment 49 says, it is in commit 92c6b8d16a36df3f28b2537bed2a56491fb08f11. :) Mike Pagano backported it into stable.
(In reply to comment #51) > It is. Like comment 49 says, it is in commit > 92c6b8d16a36df3f28b2537bed2a56491fb08f11. :) > Mike Pagano backported it into stable. Sorry for the confusion - I was referring to 2.6.32.x as "upstream". Just to make sure no more confusions -- the fix will be in 2.6.32.10 ?
Sorry for the delay guys. I went back to make sure it was queued up and for some reason it wasn't. Mike Pagano has resubmitted the patch and we are just waiting for the patch to be accepted. I was hoping it would be accepted quickly and I could give the thumbs-up here, but I guess we'll have to wait a little longer.
The patch just got accepted.
Fix for the completely different bug with the same symptoms as the original report went in 2.6.32.11 .
Hey guys, sorry to dig it out after 4 years.But I 'm still suffering from the connection lost, on my BCM5906M card, with tg3 driver.About 20+hours after bootup, ifconfig still shows the ip, but I the ping to gateway hangs.I cannot fix it by simply restart network interface. I have to unload tg3 module, and reload it, after which the card is online again. I cannot provide much info now, since the problem is hard to reproduce. I tried to send udp packet to make the card full load, but the matter doesn't occur, card is online still. I find my problem quite similar to this one, because I am running a bittorrent client too: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/404708 the dmesg looks the same as mine.except that I cannot solve it by turning off gso by 'ethtool -K enp4s0 gso off': Riaqn-Laptop ~ # ethtool -k enp4s0 | grep -i gso tx-gso-robust: off [fixed] Riaqn-Laptop ~ # ethtool -K enp4s0 gso on Riaqn-Laptop ~ # ethtool -k enp4s0 | grep -i gso tx-gso-robust: off [fixed] looks like the gso option is fixed to off, and cannot be changed. BTW, I had freebsd running on the machine about weeks ago, and it did lost network connection, too. So I was wondering if it is a hardware bug or design mistake? Please ask if more specific info is needed.