I've run into a situation where the sky2 driver kind-of-locks-up.
Machine in question is a Macbook Pro 4,1 with "lspci -vvnn" of the nic:
0c:00.0 Ethernet controller : Marvell Technology Group Ltd. Marvell Yukon 88E8058 PCI-E Gigabit Ethernet Controller [11ab:436a] (rev 13)
Subsystem: Marvell Technology Group Ltd. Device [11ab:00ba]
Physical Slot: 5
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin A routed to IRQ 29
Region 0: Memory at d7200000 (64-bit, non-prefetchable) [size=16K]
Region 2: I/O ports at 5000 [size=256]
Expansion ROM at dba00000 [disabled] [size=128K]
Capabilities:  Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities:  Vital Product Data
Product Name: Marvell Yukon 88E8058 Gigabit Ethernet Controller
[PN] Part number: Yukon 88E8058
[EC] Engineering changes: Rev. 1.3
[MN] Manufacture ID: 4d 61 72 76 65 6c 6c
[SN] Serial number: AbCdEfGE8127B
[CP] Extended capability: 01 10 cc 03
[RV] Reserved: checksum good, 9 byte(s) reserved
[RW] Read-write area: 121 byte(s) free
Capabilities: [5c] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee0300c Data: 41a1
Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 2048 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 unlimited
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM L0s L1 Enabled; RCB 128 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 1f, GenCap- CGenEn- ChkCap- ChkEn-
Kernel driver in use: sky2
Kernel modules: sky2
Messages in /var/log/messages:
(no messages for over half an hour)
May 14 10:34:15 nike kernel: sky2 eth0: rx length error: status 0x5ea0100 length 3018
May 14 10:34:18 nike kernel: eth0: hw csum failure.
May 14 10:34:18 nike kernel: Pid: 0, comm: swapper Not tainted 184.108.40.206-115.fc12.x86_64 #1
May 14 10:34:18 nike kernel: Call Trace:
May 14 10:34:18 nike kernel: <IRQ> [<ffffffff813b457f>] netdev_rx_csum_fault+0x3b/0x3f
May 14 10:34:18 nike kernel: [<ffffffff813ae4ba>] __skb_checksum_complete_head+0x51/0x65
May 14 10:34:18 nike kernel: [<ffffffff813ae4df>] __skb_checksum_complete+0x11/0x13
May 14 10:34:18 nike kernel: [<ffffffff81417f4d>] nf_ip_checksum+0xdd/0xe3
May 14 10:34:18 nike kernel: [<ffffffff813d7af3>] tcp_error+0x105/0x1a2
May 14 10:34:18 nike kernel: [<ffffffff813f5536>] ? tcp_rcv_established+0x555/0x6a6
May 14 10:34:18 nike kernel: [<ffffffff813d4b69>] nf_conntrack_in+0x17a/0x86e
May 14 10:34:18 nike kernel: [<ffffffff81047209>] ? enqueue_entity+0x25b/0x267
May 14 10:34:18 nike kernel: [<ffffffff8104810a>] ? enqueue_task_fair+0x2a/0x6d
May 14 10:34:18 nike kernel: [<ffffffff814185bb>] ipv4_conntrack_in+0x21/0x23
May 14 10:34:18 nike kernel: [<ffffffff813d1c22>] nf_iterate+0x46/0x89
May 14 10:34:18 nike kernel: [<ffffffff813e0fd6>] ? ip_rcv_finish+0x0/0x3ba
May 14 10:34:18 nike kernel: [<ffffffff813d1ccf>] nf_hook_slow+0x6a/0xcb
May 14 10:34:18 nike kernel: [<ffffffff813e0fd6>] ? ip_rcv_finish+0x0/0x3ba
May 14 10:34:18 nike kernel: [<ffffffff813e160c>] ip_rcv+0x27c/0x2c9
May 14 10:34:18 nike kernel: [<ffffffff813b3ce7>] netif_receive_skb+0x3fc/0x421
May 14 10:34:18 nike kernel: [<ffffffff813b3e6e>] napi_skb_finish+0x29/0x3d
May 14 10:34:18 nike kernel: [<ffffffff813b42c9>] napi_gro_receive+0x2f/0x34
May 14 10:34:18 nike kernel: [<ffffffffa014b002>] sky2_poll+0x813/0xa88 [sky2]
May 14 10:34:18 nike kernel: [<ffffffff81017e73>] ? native_sched_clock+0x2d/0x5f
May 14 10:34:18 nike kernel: [<ffffffff81017d45>] ? sched_clock+0x9/0xd
May 14 10:34:18 nike kernel: [<ffffffff813b43ff>] net_rx_action+0xaf/0x1c9
May 14 10:34:18 nike kernel: [<ffffffff8105d998>] __do_softirq+0xe5/0x1a9
May 14 10:34:18 nike kernel: [<ffffffff810acfbd>] ? handle_IRQ_event+0x60/0x121
May 14 10:34:18 nike kernel: [<ffffffff81012e6c>] call_softirq+0x1c/0x30
May 14 10:34:18 nike kernel: [<ffffffff810143ea>] do_softirq+0x46/0x86
May 14 10:34:18 nike kernel: [<ffffffff8105d7d6>] irq_exit+0x3b/0x7d
May 14 10:34:18 nike kernel: [<ffffffff8145ad3d>] do_IRQ+0xa5/0xbc
May 14 10:34:18 nike kernel: [<ffffffff81012693>] ret_from_intr+0x0/0x11
May 14 10:34:18 nike kernel: <EOI> [<ffffffff81295665>] ? acpi_idle_enter_bm+0x27e/0x2b2
May 14 10:34:18 nike kernel: [<ffffffff8129565e>] ? acpi_idle_enter_bm+0x277/0x2b2
May 14 10:34:18 nike kernel: [<ffffffff8138949e>] ? cpuidle_idle_call+0x99/0xf3
May 14 10:34:18 nike kernel: [<ffffffff81010cdd>] ? cpu_idle+0xaa/0xe4
May 14 10:34:18 nike kernel: [<ffffffff8144ea71>] ? start_secondary+0x1f2/0x233
This was the first occurrence, it is immediately followed by further hw csum failures (with differing stack traces).
In total 208 (logged) failures over the course of the next 10 minutes. Furthermore the network is mostly unusable (imagine network running at dial-up modem speeds with 90% packet loss) - but it is *not* _quite_ dead: occasionally stuff does seem to get through.
Unplugging and replugging the ethernet cable didn't help, however unloading and reloading the sky2 module (and reconfiguring networking) fixes the problem [ip link down/up not tried].
This is the first time I'm reporting this, but I've now seen this somewhere on the order of 3 times (ie. very very rarely) over the last 3-4 months on various fc12 kernels from koji.
It might be worth noting that (at least this time) this seems to have happened (pretty much?) immediately after reaching my laptop after not using it during the night (but it wasn't suspended or anything like that).
Also occurs on my Gigabyte 965P-S3 Core 2 Duo, on FC13 220.127.116.11-124.fc13.i686
First starts out as:
eth0: hw csum failure.
[<c076ebb6>] ? printk+0xf/0x11
[<c0706b22>] ? tcp_error+0x0/0x16a
[<c0679135>] ? usb_submit_urb+0x23f/0x2a0
[<f378f79c>] ? ftdi_submit_read_urb+0x3e/0x74 [ftdi_sio]
[<f378faf3>] ? ftdi_read_bulk_callback+0x321/0x332 [ftdi_sio]
[<c0679206>] ? usb_free_urb+0x11/0x13
[<c070e440>] ? ip_rcv_finish+0x0/0x2bd
[<c070e440>] ? ip_rcv_finish+0x0/0x2bd
[<c070e440>] ? ip_rcv_finish+0x0/0x2bd
[<f3722dc6>] sky2_poll+0x7c6/0x9d8 [sky2]
[<c04091f3>] ? mwait_idle+0x5c/0x67
then quickly degrades until we get other bad things happening:
BUG: soft lockup - CPU#1 stuck for 61s!
*** Bug 569779 has been marked as a duplicate of this bug. ***
Just happened again with
$ uname -a
Linux nike 18.104.22.168-112.fc13.x86_64 #1 SMP Thu May 27 02:28:31 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
fixed with "ip link set eth0 down; ip link set eth0 up; dhclient -x; dhclient eth0"
Jul 10 19:41:26 nike kernel: sky2 eth0: rx length error: status 0x5ea0100 length 3018
Jul 10 19:41:52 nike kernel: eth0: hw csum failure.
...1547 additional identical hw csum failure messages across 34 hours (I was away from the machine and when I came back I found the network was unusable and fixed it with the above).
Jul 12 05:47:35 nike kernel: eth0: hw csum failure.
Still happening with
$ uname -a
Linux nike 22.214.171.124-35.rc1.fc13.x86_64 #1 SMP Sat Aug 7 16:32:24 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
It actually appears to be happening more often of late (I think 2.6.32 was better than 2.6.33 or 2.6.34).
Once it starts happening your network is dead and you get error messages (with stack dump) about every 10 seconds (for example 700 'eth0: hw csum failure' messages over the course of an hour).
It looks like the error recovery of the sky2 driver in this case (nike kernel: sky2 0000:0c:00.0: eth0: rx length error: status 0x5d60100 length 2982) could use some love.
*** Bug 632501 has been marked as a duplicate of this bug. ***
I see you reported this to netdev but there was no fix:
Yeah, I think it's as 'simple' as adding some sort of driver reset triggered by the 'rx len error', but that's much simpler for someone that actually understands the driver code.
It's obviously a recoverable error since 'ip link set eth0 down && ip link set eth0 up' makes the problem go away (till the next time it happens).
I'm not 100% sure yet, but I don't think turning off all acceleration via 'ethtool -K eth0 ...' prevents the problem from happening (although it may decrease the probability of it triggering - hard to say, it's very sporadic...)
My last occurences (I'm reasonably sure the Sep 8 one was with all accel off):
cat /var/log/messages-* /var/log/messages | egrep 'rx len'
Aug 16 21:26:25 eth0: rx length error: status 0x5e50100 length 3013
Aug 17 11:16:18 eth0: rx length error: status 0x5ea0100 length 3018
Aug 18 02:14:27 eth0: rx length error: status 0x5ea0100 length 3018
Aug 24 21:34:37 eth0: rx length error: status 0x5ea0100 length 3018
Aug 24 22:35:00 eth0: rx length error: status 0x5e50100 length 3013
Aug 25 09:42:19 eth0: rx length error: status 0x5ea0100 length 3018
Sep 8 03:57:04 eth0: rx length error: status 0x5e50100 length 3013
I don't fully follow the corruption comments on the netdev thread and their true implications (and why they don't want to fix it?).
This is a very irritating problem. It was fixed in Fedora 11 and returned in Fedora 12 as I have reported to bug 514693. Since then, we cannot download a big file. Our ethernet connection works fine as soon as there is no big download involved.
I haven't seen this error in a long time:
# cat /var/log/messages-* /var/log/messages | egrep 'rx len'
Oct 12 16:10:22 nike kernel: sky2 0000:0c:00.0: eth0: rx length error: status 0x3c0300 length 1548
I wonder if this means it got fixed somehow in recent kernels, or if I've just been lucky?
# cat /var/log/messages-* /var/log/messages | egrep 'Linux version'
Oct 19 11:22:22 nike kernel: Linux version 126.96.36.199-61.fc13.x86_64 (firstname.lastname@example.org) (gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC) ) #1 SMP Tue Oct 19 04:06:30 UTC 2010
Nov 7 19:44:08 nike kernel: [ 0.000000] Linux version 188.8.131.52-48.fc14.x86_64 (email@example.com) (gcc version 4.5.1 20100924 (Red Hat 4.5.1-4) (GCC) ) #1 SMP Fri Oct 22 15:36:08 UTC 2010
So this suggests that just perhaps 184.108.40.206-61.fc13.x86_64 is good (or much better...)?
# last | egrep boot | tac
reboot system boot 220.127.116.11-59.fc13 Fri Oct 1 11:12 - 11:19 (18+00:07)
reboot system boot 18.104.22.168-61.fc13 Tue Oct 19 11:22 - 19:41 (19+09:19)
reboot system boot 22.214.171.124-48.fc14 Sun Nov 7 19:44 - 19:56 (00:12)
Based on kernel changelog seems doubtful that -61 was fixed compared to -59 though...:
* Mon Oct 18 2010 Kyle McMartin <firstname.lastname@example.org> 126.96.36.199-61
- Add Ricoh e822 support. (rhbz#596475) Thanks to sgruszka@ for
sending the patches in.
* Mon Oct 18 2010 Kyle McMartin <email@example.com> 188.8.131.52-60
- Quirk to disable DMAR with Ricoh card reader/firewire. (rhbz#605888)
* Mon Oct 18 2010 Kyle McMartin <firstname.lastname@example.org>
- Two networking fixes (skge, r8169) from sgruska. (rhbz#447489,629158)
* Thu Oct 14 2010 Neil Horman <email@example.com>
- Fix rcu warning in twsock_net (bz 642905)
* Wed Oct 06 2010 Neil Horman <firstname.lastname@example.org>
- Fix WARN_ON when you try to create an exiting bond in bond_masters
* Thu Sep 30 2010 Chuck Ebbert <email@example.com>
- CVE-2010-3432: sctp-do-not-reset-the-packet-during-sctp_packet_config.patch
* Thu Sep 30 2010 Ben Skeggs <firstname.lastname@example.org> 184.108.40.206-59
I think this means I've just been very lucky...
You're extremely lucky :) The only thing that has changed is that they just hid the error. Try "hw csum failure"! It's still out there. Try also to download Fedora DVD and enjoy. After about 290M it was reproduced on me.
$ uname -a
Linux nike 220.127.116.11-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
$ wget -O /dev/null http://mirrors.kernel.org/fedora/releases/14/Fedora/x86_64/iso/Fedora-14-x86_64-DVD.iso
--2010-11-08 02:31:16-- http://mirrors.kernel.org/fedora/releases/14/Fedora/x86_64/iso/Fedora-14-x86_64-DVD.iso
Resolving mirrors.kernel.org... 18.104.22.168
Connecting to mirrors.kernel.org|22.214.171.124|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3520802816 (3.3G) [application/x-iso9660-image]
Saving to: `/dev/null'
100%[====================================>] 3,520,802,816 22.1M/s in 2m 28s
2010-11-08 02:33:44 (22.6 MB/s) - `/dev/null' saved [3520802816/3520802816]
And I downloaded twice, so after 6.6 GB transferred, still no errors...
The bug is opened against Fedora 13 and not 14. I'm using Fedora 13 with the latest kernel:
$ uname -a
Linux bb229 126.96.36.199-61.fc13.x86_64 #1 SMP Tue Oct 19 04:06:30 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
If the bug is solved in Fedora 14, then that would be awesome news. I think I have to schedule the upgrade earlier.
I upgraded barely 2 days ago, so it is still _way_ too early to say whether it's truly stable.
Note that I saw no issues from Oct 19th till Nov 6th on the F13 kernel (188.8.131.52-61.fc13.x86_64), and I'm sure I did way more downloading than a mere couple GB's during that time (including downloading all of F14 during a yum upgrade).
Basically all I'm saying is that of late the probability of this triggering has gone drastically down for me. I have no idea what the actual cause of this apparent goodness is. Weather?
It looks like our bugs must be slightly different in some way...
[oh, wait, I just realized, I think I've turned off all hw optimizations on the nic (via 'ethtool -K' before link up), maybe that's the fix, I'll retry the downloads tomorrow with tso and the like turned on]
I copied (downloaded via scp -c arcfour) around 100GB of data with full acceleration turned on and saw no issues with the F14 kernel. But, then, my experience has always been that it has a tendency to break during periods of idleness, and not load...
I'll leave acceleration turned on and we'll see if it breaks by itself.
Maciej thank you very much for your detailed information. I've upgraded my system as well to Fedora 14:
# uname -a
Linux bb229 184.108.40.206-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
and it seems to work fine. I've downloaded Fedora DVD ISO from LAN by using scp, wget and browser with no problems. OK, on Firefox the browser freezes every time it reaches 2GB downloaded size, but I don't think that bug can be attributed to ethernet controller. It looks like firefox bug. Chrome browser downloaded the file without issues and much faster.
My problem was observed only after downloading a big file. I have never faced the problem on yum downloads for example. Only if I were trying to download a big file over Internet. I just hope that bug won't re-appear as it happened in the past. Fingers crossed :)
Famous last words:
Nov 10 02:13:38 bb229 kernel: [45909.915965] eth0: hw csum failure.
Nov 10 02:13:38 bb229 kernel: [45909.915974] Pid: 0, comm: swapper Not tainted 2
Nov 10 02:13:38 bb229 kernel: [45909.915977] Call Trace:
Nov 10 02:13:38 bb229 kernel: [45909.915980] <IRQ> [<ffffffff813be889>] netdev
Nov 10 02:13:38 bb229 kernel: [45909.915997] [<ffffffff813b9057>] __skb_checksu
Nov 10 02:13:38 bb229 kernel: [45909.916002] [<ffffffff813b907c>] __skb_checksu
Nov 10 02:13:38 bb229 kernel: [45909.916008] [<ffffffff814281aa>] nf_ip_checksu
Nov 10 02:13:38 bb229 kernel: [45909.916015] [<ffffffff813e7c73>] udp_error+0x1
Nov 10 02:13:38 bb229 kernel: [45909.916022] [<ffffffff81047a42>] ? select_idle
Nov 10 02:13:38 bb229 kernel: [45909.916028] [<ffffffff813e314f>] nf_conntrack_
Nov 10 02:13:38 bb229 kernel: [45909.916036] [<ffffffff81105c71>] ? __raw_local
Nov 10 02:13:38 bb229 kernel: [45909.916042] [<ffffffff81108682>] ? kmem_cache_
That problem was caused by the following crontab entry:
0 2 10 11 * /home/panos/bin/runwget 'ftp://ftp.ntua.gr/pub/linux/fedora/linux/releases/14/Fedora/i386/iso/Fedora-14-i386-DVD.iso'
> dir -h /opt/tmp/Fedora-14-i386-DVD.iso
-rw-r--r-- 1 panos users 391M 2010-11-10 02:13 /opt/tmp/Fedora-14-i386-DVD.iso
Only 391MB were able to be downloaded and after 13 minutes the problem appeared. Indeed, the problem has been limited and it does not occur so often, but for sure is still there.
Wee... hit the bug again in the current F14 kernel.
This time I had all acceleration enabled.
Will try running with all rx acceleration disabled, but tx acceleration turned on, ie:
sudo ip link set eth0 down
sudo ethtool -K eth0 rx off tx on sg on tso on gso on gro off lro off rxhash on
sudo ip link set eth0 up
I am beginning to think that this doesn't happen without acceleration (or is much much rarer).
Especially fishy is that this appears to happen with a length > ethernet MTU (ie. rx length error: status 0x... length ~3000). It's almost like an lro issue, except I do not believe lro to be available (I can't turn it on with ethtool)?
[252362.248169] sky2 0000:0c:00.0: eth0: rx length error: status 0x5b50100 length 2917
[252365.261951] eth0: hw csum failure.
[252365.261964] Pid: 0, comm: swapper Not tainted 220.127.116.11-48.fc14.x86_64 #1
[252365.261970] Call Trace:
[252365.261975] <IRQ> [<ffffffff813be889>] netdev_rx_csum_fault+0x3b/0x40
[252365.262000] [<ffffffff813b9057>] __skb_checksum_complete_head+0x51/0x65
[252365.262009] [<ffffffff813b907c>] __skb_checksum_complete+0x11/0x13
[252365.262019] [<ffffffff814281aa>] nf_ip_checksum+0xce/0xd4
[252365.262029] [<ffffffff813e7c73>] udp_error+0x142/0x1a0
[252365.262041] [<ffffffff81105c71>] ? __raw_local_irq_save+0x1d/0x23
[252365.262050] [<ffffffff81108682>] ? kmem_cache_free+0x7a/0xb9
[252365.262060] [<ffffffff813e314f>] nf_conntrack_in+0x14d/0x8b4
[252365.262071] [<ffffffff81047a42>] ? select_idle_sibling+0x3a/0xee
[252365.262080] [<ffffffff8142884d>] ipv4_conntrack_in+0x21/0x23
[252365.262088] [<ffffffff813e0529>] nf_iterate+0x46/0x89
[252365.262098] [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a
[252365.262106] [<ffffffff813e05d6>] nf_hook_slow+0x6a/0xd0
[252365.262114] [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a
[252365.262123] [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a
[252365.262132] [<ffffffff813f042b>] NF_HOOK.clone.8+0x46/0x58
[252365.262142] [<ffffffff813b6965>] ? __netdev_alloc_skb+0x34/0x51
[252365.262150] [<ffffffff813f07c5>] ip_rcv+0x21e/0x24d
[252365.262159] [<ffffffff813bda4c>] __netif_receive_skb+0x3ed/0x412
[252365.262168] [<ffffffff813be15d>] netif_receive_skb+0x57/0x5e
[252365.262177] [<ffffffff813be676>] napi_skb_finish+0x29/0x41
[252365.262186] [<ffffffff813be6bd>] napi_gro_receive+0x2f/0x34
[252365.262223] [<ffffffffa02011e2>] sky2_poll+0xa3b/0xc8c [sky2]
[252365.262234] [<ffffffff8101057c>] ? native_sched_clock+0x35/0x37
[252365.262243] [<ffffffff81010587>] ? sched_clock+0x9/0xd
[252365.262253] [<ffffffff8106b129>] ? sched_clock_local+0x12/0x75
[252365.262264] [<ffffffff8146912f>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[252365.262273] [<ffffffff81108682>] ? kmem_cache_free+0x7a/0xb9
[252365.262283] [<ffffffff813bf420>] net_rx_action+0xac/0x1bb
[252365.262291] [<ffffffff813bcd1f>] ? __napi_schedule+0x50/0x57
[252365.262302] [<ffffffff81053839>] __do_softirq+0xdd/0x199
[252365.262317] [<ffffffffa01fd627>] ? sky2_intr+0x35/0x3c [sky2]
[252365.262327] [<ffffffff8100abdc>] call_softirq+0x1c/0x30
[252365.262335] [<ffffffff8100c338>] do_softirq+0x46/0x82
[252365.262344] [<ffffffff81053999>] irq_exit+0x3b/0x7d
[252365.262352] [<ffffffff8146f075>] do_IRQ+0x9d/0xb4
[252365.262361] [<ffffffff81469593>] ret_from_intr+0x0/0x11
[252365.262366] <EOI> [<ffffffff8128f39c>] ? raw_local_irq_enable+0x10/0x12
[252365.262385] [<ffffffff8106b3a0>] ? sched_clock_idle_wakeup_event+0x17/0x1b
[252365.262394] [<ffffffff81290208>] acpi_idle_enter_bm+0x228/0x260
[252365.262405] [<ffffffff813939e1>] cpuidle_idle_call+0x8b/0xe9
[252365.262416] [<ffffffff81008325>] cpu_idle+0xaa/0xcc
[252365.262425] [<ffffffff81450e46>] rest_init+0x8a/0x8c
[252365.262436] [<ffffffff81ba1c49>] start_kernel+0x40b/0x416
[252365.262446] [<ffffffff81ba12c6>] x86_64_start_reservations+0xb1/0xb5
[252365.262455] [<ffffffff81ba13c2>] x86_64_start_kernel+0xf8/0x107
[252385.388560] eth0: hw csum failure.
[252385.388573] Pid: 0, comm: swapper Not tainted 18.104.22.168-48.fc14.x86_64 #1
[252395.271567] eth0: hw csum failure.
[252395.271575] Pid: 0, comm: swapper Not tainted 22.214.171.124-48.fc14.x86_64 #1
[252415.422920] eth0: hw csum failure.
[252415.422933] Pid: 3414, comm: chrome Not tainted 126.96.36.199-48.fc14.x86_64 #1
[252425.279133] eth0: hw csum failure.
[252425.279146] Pid: 3388, comm: chrome Not tainted 188.8.131.52-48.fc14.x86_64 #1
Perhaps this needs filing upstream?
$ uname -a
Linux nike 184.108.40.206-83.fc14.x86_64 #1 SMP Mon Feb 7 07:06:44 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
Mar 5 16:20:02 nike kernel: [263628.531087] sky2 0000:0c:00.0: eth0: rx length error: status 0x5860100 length 2822
$ ethtool -k eth0
Offload parameters for eth0:
I think this means that turning off rx accel makes the bug _much_ less likely to occur, but can still happen.
Very not clear to me how length > MTU (1500) happens with all rx offload turned off.
I've bugged upstream (via email to netdev) and was told it's not a bug in the driver, but rather in the bios/firmware or something. Even though clearly it can be fixed in the driver by handling the error with an appropriate reset, they're not willing to listen - they consider it not their problem.
I see the same on F15:
# uname -a
Linux redwood.eagercon.com 220.127.116.11-1.fc15.x86_64 #1 SMP Fri Nov 11 21:36:28 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
[ 242.545001] p35p1: hw csum failure.
[ 242.545006] Pid: 0, comm: kworker/0:0 Tainted: P 18.104.22.168-1.fc15.x86_64 #1
[ 242.545009] Call Trace:
[ 242.545011] <IRQ> [<ffffffff813ec311>] netdev_rx_csum_fault+0x38/0x3c
[ 242.545021] [<ffffffff813e625f>] __skb_checksum_complete_head+0x51/0x65
[ 242.545026] [<ffffffff813e6284>] __skb_checksum_complete+0x11/0x13
[ 242.545037] [<ffffffffa0f3fb14>] br_multicast_rcv+0x885/0xd52 [bridge]
[ 242.545042] [<ffffffff81117195>] ? virt_to_head_page+0xe/0x31
[ 242.545052] [<ffffffffa0f3d552>] ? br_nf_pre_routing+0x2f/0x3df [bridge]
[ 242.545060] [<ffffffffa0f37827>] ? br_multicast_flood+0x11d/0x12c [bridge]
[ 242.545066] [<ffffffff8140fc32>] ? nf_iterate+0x48/0x7d
[ 242.545074] [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge]
[ 242.545082] [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge]
[ 242.545087] [<ffffffff8140fcd9>] ? nf_hook_slow+0x72/0x115
[ 242.545096] [<ffffffffa0f38672>] br_handle_frame_finish+0x92/0x20f [bridge]
[ 242.545104] [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge]
[ 242.545113] [<ffffffffa0f385d9>] NF_HOOK.constprop.0+0x51/0x58 [bridge]
[ 242.545118] [<ffffffff8149c9ac>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[ 242.545126] [<ffffffffa0f38995>] br_handle_frame+0x1a6/0x1c1 [bridge]
[ 242.545135] [<ffffffffa0f387ef>] ? br_handle_frame_finish+0x20f/0x20f [bridge]
[ 242.545139] [<ffffffff813ea31e>] __netif_receive_skb+0x2c5/0x417
[ 242.545146] [<ffffffff813ed5ba>] netif_receive_skb+0x6c/0x73
[ 242.545152] [<ffffffff813ed650>] napi_skb_finish+0x27/0x3f
[ 242.545156] [<ffffffff813edaa7>] napi_gro_receive+0x2f/0x34
[ 242.545165] [<ffffffffa0022293>] sky2_poll+0x7db/0x9f7 [sky2]
[ 242.545170] [<ffffffff813edbd3>] net_rx_action+0xa9/0x1b8
[ 242.545175] [<ffffffff8105d683>] __do_softirq+0xc9/0x1b5
[ 242.545179] [<ffffffff81014b35>] ? paravirt_read_tsc+0x9/0xd
[ 242.545183] [<ffffffff81014fec>] ? sched_clock+0x9/0xd
[ 242.545188] [<ffffffff814a536c>] call_softirq+0x1c/0x30
[ 242.545192] [<ffffffff81010b45>] do_softirq+0x46/0x81
[ 242.545197] [<ffffffff8105d94b>] irq_exit+0x57/0xb1
[ 242.545201] [<ffffffff814a5c4e>] do_IRQ+0x8e/0xa5
[ 242.545205] [<ffffffff8149cd6e>] common_interrupt+0x6e/0x6e
[ 242.545208] <EOI> [<ffffffff81015d21>] ? mwait_idle+0x87/0xb4
[ 242.545216] [<ffffffff81015d14>] ? mwait_idle+0x7a/0xb4
[ 242.545220] [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
[ 242.545225] [<ffffffff8148baa3>] start_secondary+0x23f/0x241
Not related to MTU > 1500:
# ifconfig p35p1
p35p1 Link encap:Ethernet HWaddr 00:1D:60:44:09:8C
inet6 addr: fe80::21d:60ff:fe44:98c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:94054 errors:0 dropped:0 overruns:0 frame:0
TX packets:20449 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:138109575 (131.7 MiB) TX bytes:1919020 (1.8 MiB)
I don't want to disappoint you, but the same failures occurs in Fedora 16 on the latest kernel kernel-3.1.2-1.fc16.x86_64 as well. Nothing has changed. Can you please move the bug against F16?
Nov 25 16:06:55 bb229 kernel: [20038.066696] eth0: hw csum failure.
Nov 25 16:06:55 bb229 kernel: [20038.066706] Pid: 8864, comm: firefox Tainted: G C 3.1.2-1.fc16.x86_64 #1
Nov 25 16:06:55 bb229 kernel: [20038.066711] Call Trace:
Nov 25 16:06:55 bb229 kernel: [20038.066714] <IRQ> [<ffffffff813d7fde>] netdev_rx_csum_fault+0x38/0x3c
Nov 25 16:06:55 bb229 kernel: [20038.066734] [<ffffffff813d1f2b>] __skb_checksum_complete_head+0x51/0x65
Nov 25 16:06:55 bb229 kernel: [20038.066741] [<ffffffff813d1f50>] __skb_checksum_complete+0x11/0x13
Nov 25 16:06:55 bb229 kernel: [20038.066748] [<ffffffff8143d654>] nf_ip_checksum+0xcd/0xd3
Nov 25 16:06:55 bb229 kernel: [20038.066771] [<ffffffffa0468360>] udp_error+0x137/0x195 [nf_conntrack]
Nov 25 16:06:55 bb229 kernel: [20038.066780] [<ffffffff8110cb93>] ? dma_pool_alloc+0x22f/0x244
Nov 25 16:06:55 bb229 kernel: [20038.066797] [<ffffffffa04639ff>] nf_conntrack_in+0x174/0x7dc [nf_conntrack]
Nov 25 16:06:55 bb229 kernel: [20038.066807] [<ffffffff813539f2>] ? uhci_alloc_td+0x1f/0x4d
Nov 25 16:06:55 bb229 kernel: [20038.066813] [<ffffffff81353de9>] ? uhci_submit_common+0x2a7/0x341
Nov 25 16:06:55 bb229 kernel: [20038.066827] [<ffffffffa0488569>] ipv4_conntrack_in+0x21/0x23 [nf_conntrack_ipv4]
Nov 25 16:06:55 bb229 kernel: [20038.066837] [<ffffffff813fb903>] nf_iterate+0x48/0x7d
Nov 25 16:06:55 bb229 kernel: [20038.066844] [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35
Nov 25 16:06:55 bb229 kernel: [20038.066850] [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35
Nov 25 16:06:55 bb229 kernel: [20038.066857] [<ffffffff813fb9aa>] nf_hook_slow+0x72/0x114
Nov 25 16:06:55 bb229 kernel: [20038.066862] [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35
Nov 25 16:06:55 bb229 kernel: [20038.066869] [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35
Nov 25 16:06:55 bb229 kernel: [20038.066874] [<ffffffff81403aec>] NF_HOOK.constprop.3+0x46/0x58
Nov 25 16:06:55 bb229 kernel: [20038.066878] [<ffffffff81404089>] ip_rcv+0x239/0x268
Nov 25 16:06:55 bb229 kernel: [20038.066883] [<ffffffff813d60f2>] __netif_receive_skb+0x3cd/0x418
Nov 25 16:06:55 bb229 kernel: [20038.066887] [<ffffffff813d928a>] netif_receive_skb+0x6c/0x73
Nov 25 16:06:55 bb229 kernel: [20038.066891] [<ffffffff813d9320>] napi_skb_finish+0x27/0x3f
Nov 25 16:06:55 bb229 kernel: [20038.066894] [<ffffffff813d9778>] napi_gro_receive+0x2f/0x34
Nov 25 16:06:55 bb229 kernel: [20038.066906] [<ffffffffa00f3293>] sky2_poll+0x7db/0x9f7 [sky2]
Nov 25 16:06:55 bb229 kernel: [20038.066912] [<ffffffff81085d6f>] ? arch_local_irq_save+0x15/0x1b
Nov 25 16:06:55 bb229 kernel: [20038.066916] [<ffffffff813d98a4>] net_rx_action+0xa9/0x1b8
Nov 25 16:06:55 bb229 kernel: [20038.066922] [<ffffffff8105d67b>] __do_softirq+0xc9/0x1b5
Nov 25 16:06:55 bb229 kernel: [20038.066926] [<ffffffff81014b35>] ? paravirt_read_tsc+0x9/0xd
Nov 25 16:06:55 bb229 kernel: [20038.066930] [<ffffffff81014fec>] ? sched_clock+0x9/0xd
Nov 25 16:06:55 bb229 kernel: [20038.066936] [<ffffffff814bfb6c>] call_softirq+0x1c/0x30
Nov 25 16:06:55 bb229 kernel: [20038.066940] [<ffffffff81010b45>] do_softirq+0x46/0x81
Nov 25 16:06:55 bb229 kernel: [20038.066944] [<ffffffff8105d943>] irq_exit+0x57/0xb1
Nov 25 16:06:55 bb229 kernel: [20038.066948] [<ffffffff814c044e>] do_IRQ+0x8e/0xa5
Nov 25 16:06:55 bb229 kernel: [20038.066953] [<ffffffff814b756e>] common_interrupt+0x6e/0x6e
Nov 25 16:06:57 bb229 kernel: [20038.066956] <EOI>
Yeah, I also see it still occasionally on F15. The driver's buggy... and just doesn't have good error recovery.
If you put the physical nic (eth0, renamed to peth0) in a bond [or bridge] (which you then call eth0), you can monitor /var/log/messages [via script] and [automatically ip link down/up the physical nic to recover without losing network state.
It seems that it is fixed in kernel-3.2.2-1.fc16.x86_64 (or earlier) or at least I haven't reproduced it yet that issue. Download of big files have worked fine so far.
(In reply to comment #23)
> It seems that it is fixed in kernel-3.2.2-1.fc16.x86_64 (or earlier) or at
> least I haven't reproduced it yet that issue. Download of big files have worked
> fine so far.
Thanks for letting us know. If you see it again, please reopen.
Any possibility that this would be backported to Fedora 15?
The issue re-appeared on kernel-3.3.0-4.fc16.x86_64. Now it has an "improved" behaviour. Not only you have your network disconnected accompanied by kernel stacktraces of sky2 module, but also your computer hangs. You have to press the "reset" button to recover your system. Reloading of sky2 kernel module is not possible anymore.
Can we please re-open the issue and increase its priority?
Reopening per comment 26.
# Mass update to all open bugs.
Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.
Please retest with this kernel, and let us know if your problem has been fixed.
In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with. Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.
If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
I no longer have the laptop in question.
Created attachment 632552 [details]
Part of /var/log/messages
You might not having the hardware, but there are still people out there having it. We have suffered for years by this bug and to close it with a "NOTABUG" resolution is not helping at all.
Please, find attached the part of logs that the problem has occurred. It was reproduced on F17 with the following kernel package:
Do us also a favour and re-open the bug. Transfer it to Fedora 17 or Fedora 18.
Well, I understand this issue is not going to be fixed, due to being for old and not popular hardware. At least can we request to restore the old behaviour, where the issue only affected the network connection and not to cause a kernel panic that returns to text console and lose all your work?
"list_del corruption" messages from logs from comment 30 do not give enough information to find out where the problem is. Please install kernel-debug variant, it should print some other warnings, which should indicate better where the problem is.
Created attachment 666733 [details]
Issue reproduced with debug kernel:
Dec 20 15:21:00 bb229 kernel: [ 374.788380] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 1300
Dec 20 15:21:24 bb229 kernel: [ 399.529927] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 220
Dec 20 15:21:28 bb229 kernel: [ 402.719641] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 724
issue reproduced with normal kernel:
Dec 20 17:03:20 bb229 kernel: [ 2573.521475] sky2 0000:04:00.0: eth0: rx error, status 0xe3c4e3c4 length 0
Dec 20 17:04:01 bb229 kernel: [ 2614.600284] eth0: hw csum failure
Dec 20 17:04:01 bb229 kernel: [ 2614.600294] Pid: 0, comm: swapper/0 Tainted: G C O 3.6.10-2.fc17.x86_64 #1
It is worthy to say that debug kernel didn't cause a single issue, whereas normal kernel produced a kernel panic in a text console. How can debug kernel work find and normal have issue?
Please, ignore the many restart attempts, since I couldn't get rid of the debug kernel... keyboard was didn't respond on grub menu and I couldn't remove a running kernel with yum. I had to edit grub.cfg to load the normal kernel first.
Rsyslog might not log all kernel messages, please attach dmesg output from debug kernel when issue happens (i.e when 'dmesg | grep "rx error"' will shows entries). Thanks.
I also checked dmesg and there was nothing there that wasn't included in the /var/log/messages. Dmesg displayed only those lines printed in messages log. Does it require any other configuration to make the kernel more chatty?
Still I would like to see it.
Created attachment 667242 [details]
Part of /var/log/messages with debug kernel
As you wish!
Created attachment 667243 [details]
Dmesg with debug kernel
I think crash does not happen on debug-kernel because vbox modules are not loaded. Please remove or blacklist vbox modules on standard kernel and check if that prevent crashes.
Created attachment 674018 [details]
Trial patch - use 32 bit DMA only on sky2 device. Similar patch helped with random hw problems on skge driver, perhaps troubles here are also caused by 64 bit DMA. Please test, I lunched kernel build with patch here:
I apologise for the delay. I was confused how exactly am I going to test this. I am doing the following at the moment:
root@bb229: ~ # yum install http://kojipkgs.fedoraproject.org//work/tasks/5145/4845145/kernel-3.6.11-4.sky2.fc17.x86_64.rpm
if I need more packages to update, please let me know.
Created attachment 675588 [details]
Part of Jan 9, 2013 /var/log/messages
Unfortunately, the issue occurred on the suggested kernel as well.
At least those "list_del corruption" related crashes gone, that must be a bug on vbox modules.
Hw csum errors can be possibly related with some problems on PCIe bus. Do any of below (mutual exclusive) kernel boot options prevent them ?
Would be good also to try:
You can also try noapic kernel boot option.
This is nothing that could be easily fix
If upstream maintainer is not capable to fix the problem, the more we are not able to do this - closing as won't fix.
Thank you for your time to look into this matter. I agree with you, but my comment 31 is still valid. May I ask if it is possible to restore the old kernel behaviour where the issue didn't cause a kernel panic and we were able to overcome with a simple "rmmod sky2; modprobe sky2"?
As I wrote before I think kernel panic is caused by vbox modules. We do not see the panic when modules were not load, i.e. in kernel-debug. Or I'm wrong, panic happens also without vbox? Note: vbox are external modules and well know crap, we do not support them.
It was also reproduced on the custom built kernel kernel-3.6.11-4.sky2.fc17.x86_64.rpm, where the vbox modules where not loaded.
I am aware of vbox situation :)
Oh, I missed that, we have this in sky2 kernel logs:
Jan 9 16:27:02 bb229 kernel: [ 308.371941] sky2 0000:04:00.0: eth0: rx error, status 0xe3c4e3c4 length 0
Jan 9 16:27:28 bb229 kernel: [ 333.901176] thunderbird: Corrupted page table at address 7f01b9710b7c
Jan 9 16:27:28 bb229 kernel: [ 333.901234] PGD 9be76067 PUD 9e630067 PMD a884d067 PTE d71c433c690438cb
I think we have memory corruption caused by sky2 DMA as described here:
Looks like on some kernel builds or with some modules loaded, corruption can happen on vital data, on others (like kernel-debug) we corrupt non allocated memory.
I can not help here. My only advice is to stop using sky2 on your machine.
Created attachment 678275 [details]
Part of Jan 14, 2013 /var/log/messages
For completeness, I have also included the /var/log/messages with the suggested kernel options (pcie_aspm=off, pcie_aspm=force, pci=nocsr,noacpi,nomsi and noapic). None of them have worked.
Thanks Stanislaw for your efforts! I second your advise. I should upgrade my PC soon.
This seems similar: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1576816
I see this bug on Debian kernel 3.16.0-4-amd64 as well. Seems to be kernel or driver related. This is workstation class hardware I am running on...