Bug 527209

Summary: Large file transfers are killing BCM5906M tg3 Ethernet card
Product: [Fedora] Fedora Reporter: Paul W. Frields <stickster>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: 12CC: agospoda, andriusb, apevec, awilliam, benlu, conradsand.fb, dcbw, dougsland, freedom2lee, gansalmon, gianluca.cecchi, itamar, jfeeney, jlaska, kernel-maint, mcarlson, mclasen, notting, stephane.raimbault, vbian, zesen.qian
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-04 11:26:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 473302    
Attachments:
Description Flags
tg3: Assign flags to fixes in start_xmit_dma_bug
none
tg3: Fix 5906 transmit hangs
none
patch 1/7
none
patch 2/7
none
patch 3/7
none
patch 4/7
none
patch 5/7
none
patch 6/7
none
patch 7/7 none

Description Paul W. Frields 2009-10-05 11:18:56 UTC
NetworkManager-0.7.996-4.git20091002.fc12.x86_64

Tried this build from koji: http://koji.fedoraproject.org/koji/buildinfo?buildID=134947

On my laptop, I unplugged the eth0 cable and then replugged it, and saw this in /var/log/messages:

Oct  5 07:02:11 localhost NetworkManager: <info>  (eth0): carrier now OFF (device state 3)
Oct  5 07:02:11 localhost NetworkManager: <info>  (eth0): device state change: 3 -> 2 (reason 40)
Oct  5 07:02:11 localhost NetworkManager: <info>  (eth0): deactivating device (reason: 40).
Oct  5 07:02:11 localhost NetworkManager: <info>  Policy set 'Auto FedoraGL' (wlan0) as default for routing and DNS.
Oct  5 07:02:13 localhost NetworkManager: <info>  (eth0): carrier now ON (device state 2)
Oct  5 07:02:13 localhost kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
Oct  5 07:02:13 localhost kernel: tg3: eth0: Flow control is on for TX and on for RX.
Oct  5 07:02:13 localhost NetworkManager: <info>  (eth0): device state change: 2 -> 3 (reason 40)

Nothing further happened.  The nm-applet retained its display of the wlan0 signal bars -- no sign of trying DHCP or any other feedback.

I reverted to the previous build, NetworkManager-0.7.996-3.git20090928.fc12.x86_64, and everything returned to normal.

Comment 1 Paul W. Frields 2009-10-05 11:43:20 UTC
Is it possible this has something to do with the kernel's tg3 driver?  I keep triggering the original problem (laptop appears to drop off the network) by trying to do a large file transfer via Nautilus gvfs/sftp over the wired connection of that machine.  When I do the same transfer via wifi, no problems.

Comment 2 Dan Williams 2009-10-05 15:54:47 UTC
Hmm, have you hit 'disconnect' in the menu at any point for the wired connection?

Comment 3 Paul W. Frields 2009-10-05 19:52:08 UTC
OK, here's some further data:

* I can reproduce the problem every time with Nautilus sftp.  It doesn't seem to occur with e.g. scp from a terminal.

* When I disconnect using NetworkManager, and reconnect "System eth0," NM appears not to get a DHCP address after that.

* If I 'rmmod tg3' and then 'modprobe tg3' and reconnect, NM behaves normally, getting a DHCP address and happily chugging along.

I'm happy to reproduce this and produce some logs if you tell me how to do it in the most useful manner.

Comment 4 Paul W. Frields 2009-10-05 20:40:18 UTC
Not sure if this is relevant, but:

09:00.0 Ethernet controller [0200]: Broadcom Corporation NetLink BCM5906M Fast Ethernet PCI Express [14e4:1713] (rev 02)

Comment 5 Paul W. Frields 2009-10-07 14:21:18 UTC
I've since found out that other large traffic transfers are causing this problem too, such as rsync from an Internet mirror to my local hard disk.  Most, but not all, large transfers seem to cause the problem.

I can't figure for the life of me how this can be NM at fault.  The only solution when the problem occurs appears to be 'modprobe -r tg3' followed by 'modprobe tg3', and once I do that, NetworkManager sees the NIC, requests and receives DHCP info, and operates normally.

I'm going to change this to component kernel, mark it F12Target, and ask the kernel team to weigh in.  I apologize if that's not kosher or if this comes off as hubris... Just interested in seeing the bug fixed but I don't want it to override other priorities.

Comment 6 Paul W. Frields 2009-10-07 14:23:32 UTC
Comment #1 is not accurate -- reverting to previous NM build has not fixed the problem, so this bug was misfiled originally.  Apologies to Dan and others.

Comment 7 Chuck Ebbert 2009-10-07 21:57:34 UTC
(In reply to comment #7)
> Oct  5 07:02:13 localhost NetworkManager: <info>  (eth0): device state change:
> 2 -> 3 (reason 40)

Can someone please explain what state 2 and 3, and reason 40 mean?

Also, does this happen using old-style config files with NetworkManager disabled?

Comment 8 Dan Williams 2009-10-08 14:52:58 UTC
That means NM is transitioning the device from state 2 (unavailable) to state 3 (disconnected but ready to be used).  Ethernet devices will be in state 2 (unavailable) when there is no carrier, and will transition to state 3 (disconnected) when the carrier flips to ON, as seen in the logs above.

Comment 9 Paul W. Frields 2009-10-17 18:11:15 UTC
I have the same problem with old-style config files and NetworkManager disabled.  (I set them up using system-config-network, disabled NetworkManager using chkconfig, and rebooted.)

I haven't been able to duplicate this problem on my workstation which has a different tg3-supported interface -- Broadcom Corporation NetLink BCM5784M Gigabit Ethernet PCIe [14e4:1698] (rev 10).

Comment 10 Adam Williamson 2009-10-24 16:16:38 UTC
Found someone on the forum suffering from what seems like the same problem:

http://forums.fedoraforum.org/showthread.php?t=232485

I asked him to add a note to this bug.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 11 Li Qi 2009-10-26 02:19:02 UTC
   I have the same problem with my Inspiron 1420 and any other desktop computer with Broadcom ethernet chip. Every time I scp a large file, such as kernel-xxx.rpm, to others, the eth0 (created by tg3 driver) would stop working, and only reboot could restart the tg3. That means removing tg3 module and reinstall it with modprobe could not solve the problem. There is nothing to do with NM, because even I removed NM, the problem still take place.

  I have also review the changelog for kernel 2.6.31, it told us there are something upgrade with the network stack buffer and DMA process. Maybe the problem comes from the compatibility between tg3 and kernel upgrade.

Comment 12 Chuck Ebbert 2009-10-26 20:37:10 UTC
Scratch build running here, can someone test this?

 http://koji.fedoraproject.org/koji/taskinfo?taskID=1770538

Comment 13 Paul W. Frields 2009-10-27 00:56:50 UTC
Chuck,

I can't install that because of the (matching or greater) kernel-firmware requirement. (The --force option didn't work.) Do you have a package available for kernel-firmware?

Comment 14 Adam Williamson 2009-10-27 01:14:39 UTC
click on the 'noarch' build link to get the kernel-firmware package, Paul.

(in extreme cases you can actually --nodeps that requirement, you don't strictly _need_ the firmware package, though not having it will make a few drivers stop working).

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 15 Li Qi 2009-10-27 01:59:39 UTC
Chuck,

    I have tested your tg3 updated kernel in my Inspiron 1420 with BCM5906 chip, but the large file problem is still. It can't solve the problem.

Comment 16 Paul W. Frields 2009-10-29 12:04:29 UTC
The problem is not fixed with the new build, and in fact is worse -- even without engaging in a large file transfer, the interface rarely stays up long enough for me to open a terminal on that system and ping another local box (or be pinged). There's nothing in the dmesg or /var/log/messages to indicate what's happening, and the NetworkManager applet makes it look like things are working normally (as does 'ip addr show eth0'), but eth0 acts like it's down.  Again, I can do 'modprobe -r tg3' and then 'modprobe tg3' to resuscitate it, but with this new kernel, it only stays up for a second or so before dying.  Even the small amount of traffic that passes PulseAudio remote device discovery seems to be enough to kill the interface.

Comment 17 Andrius Benokraitis 2009-10-29 13:43:28 UTC
Adding tg3 maintainer from Broadcom for comment. Matt?

Comment 18 Matt Carlson 2009-10-29 16:37:30 UTC
This sounds like a problem I have a partial fix for.  Can you tell me if
'ethtool -K ethx sq off' fixes the problem?

Comment 19 Matt Carlson 2009-10-29 16:38:32 UTC
Um, that's 'ethtool -K ethx sg off'.

Comment 20 Adam Williamson 2009-10-29 17:09:33 UTC
I'm promoting this for consideration as a release blocker, as it could potentially make it very tricky to research the issue to discover a workaround or fix, on a system which only had an affected adapter.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 21 Paul W. Frields 2009-10-29 20:05:31 UTC
Using 'ethtool -K eth0 sg off' and then running the same copy seems to make everything work fine.

If I switch it, 'ethtool -K eth0 sg on' and then retry, I can again reproduce the lock up, and have to 'modprobe -r tg3' and 'modprobe tg3' to restore function.

Comment 22 Adam Williamson 2009-10-30 18:30:21 UTC
Discussed at blocker meeting today; we're dropping it to F12Target on the basis that the workaround is available and tested to work. I will document the workaround on the common bugs page. Chuck Ebbert says if a patch is forthcoming from Broadcom quickly enough he can look at merging it for the final F12 kernel.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 23 Paul W. Frields 2009-10-30 19:32:21 UTC
Chuck, feel free to SMS me (internal wiki has info) if you need a build tested pronto.  Perhaps this problem is down to just one or a few specific model NICs and I just happened to get lucky.  OTOH I'm wondering how people will find the workaround if their network dies unexpectedly.  I stand ready to help!

Comment 24 Matt Carlson 2009-10-31 00:11:39 UTC
Created attachment 366887 [details]
tg3: Assign flags to fixes in start_xmit_dma_bug

This patch adds a flag for each bug workaround in
tg3_start_xmit_dma_bug().  This is prep work for the following patch.

Comment 25 Matt Carlson 2009-10-31 00:12:25 UTC
Created attachment 366888 [details]
tg3: Fix 5906 transmit hangs

The 5906 has trouble with fragments that are less than 8 bytes in size.
This patch works around the problem by pivoting the 5906's transmit
routine to tg3_start_xmit_dma_bug() and introducing a new SHORT_DMA_BUG
flag that enables code to detect and react to the problematic condition.

Comment 26 Matt Carlson 2009-10-31 00:14:34 UTC
Can we generate a test kernel with the above two patches applied and see if it fixes everyone's problems?

For the record, this bug doesn't seem to show up on every platform.  I have yet to understand why, but ASPM seems to irritate the condition.

Comment 27 Adam Williamson 2009-10-31 05:12:18 UTC
chuck, can you do a build to test this?

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 28 Chuck Ebbert 2009-11-01 02:07:56 UTC
This looks wrong to me:

@@ -12684,7 +12693,8 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
        if (!(tp->tg3_flags3 & TG3_FLG3_5755_PLUS)) {
                tp->tg3_flags3 |= TG3_FLG3_4G_DMA_BNDRY_BUG;
                tp->tg3_flags3 |= TG3_FLG3_40BIT_DMA_LIMIT_BUG;
-       }
+       } else if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
+               tp->tg3_flags3 |= TG3_FLG3_SHORT_DMA_BUG; 
  
        if (!(tp->tg3_flags2 & TG3_FLG2_5705_PLUS) ||
             (tp->tg3_flags2 & TG3_FLG2_5780_CLASS) ||

The 5906 is not included in the set of chips that have the 5755_PLUS flag set, so the "else" will never be taken.

Shouldn't it be:

        if (!(tp->tg3_flags3 & TG3_FLG3_5755_PLUS)) {
                if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
                        tp->tg3_flags3 |= TG3_FLG3_SHORT_DMA_BUG;
                else {
                        tp->tg3_flags3 |= TG3_FLG3_4G_DMA_BNDRY_BUG;
                        tp->tg3_flags3 |= TG3_FLG3_40BIT_DMA_LIMIT_BUG;
                }
        }

Comment 29 Chuck Ebbert 2009-11-02 15:36:08 UTC
Build running here, with my change added to the Broadcom patches. Still awaiting confirmation from Broadcom that the change is correct, but I'm pretty sure it is...

Comment 31 Paul W. Frields 2009-11-02 16:46:12 UTC
I just tried the koji builds in comment #30, and they appear to work fine (without the need for the ethtool workaround described in comment #19).  I can now scp from a command line or via Nautilus/gvfs mount without difficulty or certain death. :-)

Comment 32 Chuck Ebbert 2009-11-02 16:59:49 UTC
Created attachment 367165 [details]
patch 1/7

Comment 33 Chuck Ebbert 2009-11-02 17:00:22 UTC
Created attachment 367166 [details]
patch 2/7

Comment 34 Chuck Ebbert 2009-11-02 17:00:53 UTC
Created attachment 367167 [details]
patch 3/7

Comment 35 Chuck Ebbert 2009-11-02 17:01:22 UTC
Created attachment 367168 [details]
patch 4/7

Comment 36 Chuck Ebbert 2009-11-02 17:01:52 UTC
Created attachment 367169 [details]
patch 5/7

Comment 37 Chuck Ebbert 2009-11-02 17:02:22 UTC
Created attachment 367170 [details]
patch 6/7

Comment 38 Chuck Ebbert 2009-11-02 17:02:57 UTC
Created attachment 367171 [details]
patch 7/7

Comment 39 Chuck Ebbert 2009-11-02 17:06:18 UTC
I've attached all seven patches that went into the test kernel. The seventh fixes the problem I found with the patch "tg3: Fix 5906 transmit hangs".

Comment 40 Matt Carlson 2009-11-02 17:28:04 UTC
Chuck, you are correct.  I missed that in my haste to get the patch to you ASAP.  Sorry.

I'll be sending (the corrected version of) these patches upstream very soon.

Comment 41 Adam Williamson 2009-11-02 17:52:24 UTC
great job, guys, thanks. It'd be great to have this in the final release, we may well need to tag a new kernel anyway to deal with the Radeon hang issue I'm currently investigating with Dave and Jerome. please co-ordinate with them before tag requesting anything, though, I'm not sure if they'd want the 97 / 104 / 105 / 106 / 107 changes going in without some cleanup...

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 42 Paul W. Frields 2009-11-04 16:48:33 UTC
Bill Nottingham voiced a question in #fedora-kernel this morning about the tg3 fixes:

<notting> is there a reason that's not on the blocker list?

I'm not against that obviously.  Does it make sense to move this back to F12Blocker just to ensure that it's taken care of in the kernel that gets tagged for f12-final?

I'm happy to be a guinea pig for testing if needed.  The special 107.tg3 kernel that Chuck built works fine, but I verified that the -112 kernel (which he confirms didn't have his commits) is back to the broken behavior.

Comment 43 Adam Williamson 2009-11-04 17:16:42 UTC
my bad, we did have a rationale for dropping this from the blocker list but it wasn't relayed to the bug report for some reason. the bug doesn't need to be on the blocker list for us to accept a tag for it, though. Chuck stuck the fixes into kernel -116; please work with the kernel team to file a tag request for that kernel or later (I'd quite like to have the fix from 117 too).

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 44 Adam Williamson 2009-11-17 07:35:13 UTC
this should be closed, we wound up with 127 in the final release. assuming the problem didn't somehow magically come back between 117 and 127. please re-open if it did.

-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 45 Paul W. Frields 2009-11-17 13:04:38 UTC
117 fixed the problem, and it's stayed resolved through 127. Thanks.

Comment 46 Gianluca Cecchi 2010-02-15 11:04:27 UTC
Hello, I would like to put a comment on this bug, as I have a problem that seems quite related.... eventually I will create a new one referring this.
I tested all combinations of cables and duplex mode settings and I'm getting again a similar problem.
My system is an XPS M1330 with up2date F12 x86_64 and kernel 2.6.31.12-174.2.3.fc12.x86_64
Resuming: after starting a large transfer, I soon get this in /var/log/messages

Feb  9 15:40:14 tekkaman kernel: tg3: eth0: Link is down.
Feb  9 15:40:14 tekkaman NetworkManager: <info>  (eth0): carrier now
OFF (device state 8, deferring action for 4 seconds)
Feb  9 15:40:16 tekkaman kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
Feb  9 15:40:16 tekkaman kernel: tg3: eth0: Flow control is off for TX
and off for RX.
Feb  9 15:40:16 tekkaman NetworkManager: <info>  (eth0): carrier now
ON (device state 8)

The behavior of the system now is not as the original one where the connection dropped at all, but instead now the transfer rate suddenly drops from around 9.7MB/s to 4-5MB/s

My lspci output:
09:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5906M
Fast Ethernet PCI Express (rev 02)

Applying the temporary workaround (see comments 19 and 21) that was in place before the supposed fix, now I continue to get the off/on message for the interface, but at least I can sustain 10.1MB/s during the overall 600MB transfer.

My initial (default) setup with card set in auto sense and
autonegotiation set up at 100mbit/s
[root at tekkaman ~]# ethtool -k eth0
Offload parameters for eth0:
Cannot get device flags: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

Shouldn't be fixed this, so that I have to get sg=off by default now?
Workaround:
[root at tekkaman ~]# ethtool -K eth0 sg off

verify changed settings:
[root at tekkaman ~]# ethtool -k eth0
Offload parameters for eth0:
Cannot get device flags: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

transfer test:
[gcecchi at tekkaman ~]$ time scp Downloads/SS830.2009_0817.103-x64.iso
root at mysrv:/root
root at mysrv's password:
SS830.2009_0817.103-x64.iso                   100%  603MB  10.6MB/s   00:57

real	0m59.383s
user	0m17.708s
sys	0m3.957s

After the first 10-20 MBytes transferred  I get these in /var/log/messages:
Feb  9 15:40:14 tekkaman kernel: tg3: eth0: Link is down.
Feb  9 15:40:14 tekkaman NetworkManager: <info>  (eth0): carrier now
OFF (device state 8, deferring action for 4 seconds)
Feb  9 15:40:16 tekkaman kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
Feb  9 15:40:16 tekkaman kernel: tg3: eth0: Flow control is off for TX
and off for RX.
Feb  9 15:40:16 tekkaman NetworkManager: <info>  (eth0): carrier now
ON (device state 8)

Anyone with this model got this problem again? Possible a regression
after the 127 kernel?
Before re-opening the bugzilla I would like to share....

Thanks again for comments and input.
Gianluca

Comment 47 Li Qi 2010-02-15 17:12:32 UTC
When I upgraded fedora 12 kernel to 2.6.31.12-174.2.3 in my Inspiron 1420, I met the same problem as you described in last comment. But when I downloaded the source rpm for 2.6.32 from fedora development repository for F13, and rebuild it, the problem with tg3 was solved perfectly. You could give a try.

Comment 48 C Sand 2010-03-11 00:57:29 UTC
The bug seems to be back in kernel 2.6.32.9
See Bug 571638

Comment 49 Matt Carlson 2010-03-11 22:28:47 UTC
I think an upstream kernel change allowed scatter/gather fragments to be sized
less than or equal to 8 bytes.  This exposed a 5906 chip bug.  Commit
92c6b8d16a36df3f28b2537bed2a56491fb08f11 fixes the problem.  This commit was
integrated into the 3.103 version of the tg3 driver.  I'm pretty sure this fix
was integrated into RedHat's 2.6.31 kernel.

I checked the 2.6.32.9 kernel and it does not yet contain the fix.

Comment 50 C Sand 2010-03-11 23:57:14 UTC
Reopening, based on Comment 49.

Why isn't the tg3 fix upstream ?

Comment 51 Matt Carlson 2010-03-12 00:10:57 UTC
It is.  Like comment 49 says, it is in commit 92c6b8d16a36df3f28b2537bed2a56491fb08f11. :)  Mike Pagano backported it into stable.

Comment 52 C Sand 2010-03-12 00:46:42 UTC
(In reply to comment #51)
> It is.  Like comment 49 says, it is in commit
> 92c6b8d16a36df3f28b2537bed2a56491fb08f11. :)
> Mike Pagano backported it into stable.    

Sorry for the confusion - I was referring to 2.6.32.x as "upstream".

Just to make sure no more confusions -- the fix will be in 2.6.32.10 ?

Comment 53 Matt Carlson 2010-03-16 20:05:40 UTC
Sorry for the delay guys.  I went back to make sure it was queued up and for some reason it wasn't.  Mike Pagano has resubmitted the patch and we are just waiting  for the patch to be accepted.  I was hoping it would be accepted quickly and I could give the thumbs-up here, but I guess we'll have to wait a little longer.

Comment 54 Matt Carlson 2010-03-18 23:02:47 UTC
The patch just got accepted.

Comment 55 Chuck Ebbert 2010-08-04 11:26:40 UTC
Fix for the completely different bug with the same symptoms as the original report went in 2.6.32.11 .

Comment 56 Zesen Qian 2014-10-23 15:41:17 UTC
Hey guys, sorry to dig it out after 4 years.But I 'm still suffering from the connection lost, on my BCM5906M card, with tg3 driver.About 20+hours after bootup, ifconfig still shows the ip, but I the ping to gateway hangs.I cannot fix it by simply restart network interface. I have to unload tg3 module, and reload it, after which the card is online again.
I cannot provide much info now, since the problem is hard to reproduce. I tried to send udp packet to make the card full load, but the matter doesn't occur, card is online still. I find my problem quite similar to this one, because I am running a bittorrent client too: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/404708

the dmesg looks the same as mine.except that I cannot solve it by turning off gso by 'ethtool -K enp4s0 gso off':

Riaqn-Laptop ~ # ethtool -k enp4s0 | grep -i gso
tx-gso-robust: off [fixed]
Riaqn-Laptop ~ # ethtool -K enp4s0 gso on
Riaqn-Laptop ~ # ethtool -k enp4s0 | grep -i gso
tx-gso-robust: off [fixed]

looks like the gso option is fixed to off, and cannot be changed. 
BTW, I had freebsd running on the machine about weeks ago, and it did lost network connection, too. So I was wondering if it is a hardware bug or design mistake?

Please ask if more specific info is needed.