I've run into a situation where the sky2 driver kind-of-locks-up. Kernel: 2.6.32.12-115.fc12.x86_64 Machine in question is a Macbook Pro 4,1 with "lspci -vvnn" of the nic: 0c:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. Marvell Yukon 88E8058 PCI-E Gigabit Ethernet Controller [11ab:436a] (rev 13) Subsystem: Marvell Technology Group Ltd. Device [11ab:00ba] Physical Slot: 5 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 256 bytes Interrupt: pin A routed to IRQ 29 Region 0: Memory at d7200000 (64-bit, non-prefetchable) [size=16K] Region 2: I/O ports at 5000 [size=256] Expansion ROM at dba00000 [disabled] [size=128K] Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Product Name: Marvell Yukon 88E8058 Gigabit Ethernet Controller Read-only fields: [PN] Part number: Yukon 88E8058 [EC] Engineering changes: Rev. 1.3 [MN] Manufacture ID: 4d 61 72 76 65 6c 6c [SN] Serial number: AbCdEfGE8127B [CP] Extended capability: 01 10 cc 03 [RV] Reserved: checksum good, 9 byte(s) reserved Read/write fields: [RW] Read-write area: 121 byte(s) free End Capabilities: [5c] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0300c Data: 41a1 Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 2048 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 unlimited ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM L0s L1 Enabled; RCB 128 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 1f, GenCap- CGenEn- ChkCap- ChkEn- Kernel driver in use: sky2 Kernel modules: sky2 Messages in /var/log/messages: (no messages for over half an hour) May 14 10:34:15 nike kernel: sky2 eth0: rx length error: status 0x5ea0100 length 3018 May 14 10:34:18 nike kernel: eth0: hw csum failure. May 14 10:34:18 nike kernel: Pid: 0, comm: swapper Not tainted 2.6.32.12-115.fc12.x86_64 #1 May 14 10:34:18 nike kernel: Call Trace: May 14 10:34:18 nike kernel: <IRQ> [<ffffffff813b457f>] netdev_rx_csum_fault+0x3b/0x3f May 14 10:34:18 nike kernel: [<ffffffff813ae4ba>] __skb_checksum_complete_head+0x51/0x65 May 14 10:34:18 nike kernel: [<ffffffff813ae4df>] __skb_checksum_complete+0x11/0x13 May 14 10:34:18 nike kernel: [<ffffffff81417f4d>] nf_ip_checksum+0xdd/0xe3 May 14 10:34:18 nike kernel: [<ffffffff813d7af3>] tcp_error+0x105/0x1a2 May 14 10:34:18 nike kernel: [<ffffffff813f5536>] ? tcp_rcv_established+0x555/0x6a6 May 14 10:34:18 nike kernel: [<ffffffff813d4b69>] nf_conntrack_in+0x17a/0x86e May 14 10:34:18 nike kernel: [<ffffffff81047209>] ? enqueue_entity+0x25b/0x267 May 14 10:34:18 nike kernel: [<ffffffff8104810a>] ? enqueue_task_fair+0x2a/0x6d May 14 10:34:18 nike kernel: [<ffffffff814185bb>] ipv4_conntrack_in+0x21/0x23 May 14 10:34:18 nike kernel: [<ffffffff813d1c22>] nf_iterate+0x46/0x89 May 14 10:34:18 nike kernel: [<ffffffff813e0fd6>] ? ip_rcv_finish+0x0/0x3ba May 14 10:34:18 nike kernel: [<ffffffff813d1ccf>] nf_hook_slow+0x6a/0xcb May 14 10:34:18 nike kernel: [<ffffffff813e0fd6>] ? ip_rcv_finish+0x0/0x3ba May 14 10:34:18 nike kernel: [<ffffffff813e160c>] ip_rcv+0x27c/0x2c9 May 14 10:34:18 nike kernel: [<ffffffff813b3ce7>] netif_receive_skb+0x3fc/0x421 May 14 10:34:18 nike kernel: [<ffffffff813b3e6e>] napi_skb_finish+0x29/0x3d May 14 10:34:18 nike kernel: [<ffffffff813b42c9>] napi_gro_receive+0x2f/0x34 May 14 10:34:18 nike kernel: [<ffffffffa014b002>] sky2_poll+0x813/0xa88 [sky2] May 14 10:34:18 nike kernel: [<ffffffff81017e73>] ? native_sched_clock+0x2d/0x5f May 14 10:34:18 nike kernel: [<ffffffff81017d45>] ? sched_clock+0x9/0xd May 14 10:34:18 nike kernel: [<ffffffff813b43ff>] net_rx_action+0xaf/0x1c9 May 14 10:34:18 nike kernel: [<ffffffff8105d998>] __do_softirq+0xe5/0x1a9 May 14 10:34:18 nike kernel: [<ffffffff810acfbd>] ? handle_IRQ_event+0x60/0x121 May 14 10:34:18 nike kernel: [<ffffffff81012e6c>] call_softirq+0x1c/0x30 May 14 10:34:18 nike kernel: [<ffffffff810143ea>] do_softirq+0x46/0x86 May 14 10:34:18 nike kernel: [<ffffffff8105d7d6>] irq_exit+0x3b/0x7d May 14 10:34:18 nike kernel: [<ffffffff8145ad3d>] do_IRQ+0xa5/0xbc May 14 10:34:18 nike kernel: [<ffffffff81012693>] ret_from_intr+0x0/0x11 May 14 10:34:18 nike kernel: <EOI> [<ffffffff81295665>] ? acpi_idle_enter_bm+0x27e/0x2b2 May 14 10:34:18 nike kernel: [<ffffffff8129565e>] ? acpi_idle_enter_bm+0x277/0x2b2 May 14 10:34:18 nike kernel: [<ffffffff8138949e>] ? cpuidle_idle_call+0x99/0xf3 May 14 10:34:18 nike kernel: [<ffffffff81010cdd>] ? cpu_idle+0xaa/0xe4 May 14 10:34:18 nike kernel: [<ffffffff8144ea71>] ? start_secondary+0x1f2/0x233 This was the first occurrence, it is immediately followed by further hw csum failures (with differing stack traces). In total 208 (logged) failures over the course of the next 10 minutes. Furthermore the network is mostly unusable (imagine network running at dial-up modem speeds with 90% packet loss) - but it is *not* _quite_ dead: occasionally stuff does seem to get through. Unplugging and replugging the ethernet cable didn't help, however unloading and reloading the sky2 module (and reconfiguring networking) fixes the problem [ip link down/up not tried]. This is the first time I'm reporting this, but I've now seen this somewhere on the order of 3 times (ie. very very rarely) over the last 3-4 months on various fc12 kernels from koji. It might be worth noting that (at least this time) this seems to have happened (pretty much?) immediately after reaching my laptop after not using it during the night (but it wasn't suspended or anything like that).
Also occurs on my Gigabyte 965P-S3 Core 2 Duo, on FC13 2.6.33.5-124.fc13.i686 First starts out as: eth0: hw csum failure. with trace: Call Trace: [<c076ebb6>] ? printk+0xf/0x11 [<c06ea443>] netdev_rx_csum_fault+0x29/0x30 [<c06e559a>] __skb_checksum_complete_head+0x42/0x57 [<c06e55ba>] __skb_checksum_complete+0xb/0xd [<c073dbe5>] nf_ip_checksum+0xcf/0xdb [<c0706bfa>] tcp_error+0xd8/0x16a [<c0706b22>] ? tcp_error+0x0/0x16a [<c07044af>] nf_conntrack_in+0xf4/0x688 [<c0679135>] ? usb_submit_urb+0x23f/0x2a0 [<f378f79c>] ? ftdi_submit_read_urb+0x3e/0x74 [ftdi_sio] [<f378faf3>] ? ftdi_read_bulk_callback+0x321/0x332 [ftdi_sio] [<c0679206>] ? usb_free_urb+0x11/0x13 [<c073e13c>] ipv4_conntrack_in+0x19/0x1e [<c0701f2b>] nf_iterate+0x2f/0x62 [<c070e440>] ? ip_rcv_finish+0x0/0x2bd [<c070203f>] nf_hook_slow+0x3b/0x91 [<c070e440>] ? ip_rcv_finish+0x0/0x2bd [<c070e8fe>] ip_rcv+0x201/0x234 [<c070e440>] ? ip_rcv_finish+0x0/0x2bd [<c06e9d69>] netif_receive_skb+0x3ae/0x3c9 [<c06e9e91>] napi_skb_finish+0x1e/0x34 [<c06ea207>] napi_gro_receive+0x20/0x24 [<f3722dc6>] sky2_poll+0x7c6/0x9d8 [sky2] [<c06ea301>] net_rx_action+0x92/0x188 [<c043c1d1>] __do_softirq+0xac/0x152 [<c043c2a8>] do_softirq+0x31/0x3c [<c043c3bc>] irq_exit+0x29/0x5c [<c040459d>] do_IRQ+0x86/0x9a [<c0403830>] common_interrupt+0x30/0x38 [<c04091f3>] ? mwait_idle+0x5c/0x67 [<c04024b8>] cpu_idle+0x91/0xad [<c075e39e>] rest_init+0x62/0x64 [<c09b78f1>] start_kernel+0x346/0x34b [<c09b7099>] i386_start_kernel+0x99/0xa0 then quickly degrades until we get other bad things happening: BUG: soft lockup - CPU#1 stuck for 61s!
*** Bug 569779 has been marked as a duplicate of this bug. ***
Just happened again with $ uname -a Linux nike 2.6.33.5-112.fc13.x86_64 #1 SMP Thu May 27 02:28:31 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux fixed with "ip link set eth0 down; ip link set eth0 up; dhclient -x; dhclient eth0" Jul 10 19:41:26 nike kernel: sky2 eth0: rx length error: status 0x5ea0100 length 3018 Jul 10 19:41:52 nike kernel: eth0: hw csum failure. ...1547 additional identical hw csum failure messages across 34 hours (I was away from the machine and when I came back I found the network was unusable and fixed it with the above). Jul 12 05:47:35 nike kernel: eth0: hw csum failure.
Still happening with $ uname -a Linux nike 2.6.34.3-35.rc1.fc13.x86_64 #1 SMP Sat Aug 7 16:32:24 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux It actually appears to be happening more often of late (I think 2.6.32 was better than 2.6.33 or 2.6.34). Once it starts happening your network is dead and you get error messages (with stack dump) about every 10 seconds (for example 700 'eth0: hw csum failure' messages over the course of an hour). It looks like the error recovery of the sky2 driver in this case (nike kernel: sky2 0000:0c:00.0: eth0: rx length error: status 0x5d60100 length 2982) could use some love.
*** Bug 632501 has been marked as a duplicate of this bug. ***
I see you reported this to netdev but there was no fix: http://marc.info/?t=128157418200003&r=1&w=4
Yeah, I think it's as 'simple' as adding some sort of driver reset triggered by the 'rx len error', but that's much simpler for someone that actually understands the driver code. It's obviously a recoverable error since 'ip link set eth0 down && ip link set eth0 up' makes the problem go away (till the next time it happens). I'm not 100% sure yet, but I don't think turning off all acceleration via 'ethtool -K eth0 ...' prevents the problem from happening (although it may decrease the probability of it triggering - hard to say, it's very sporadic...) My last occurences (I'm reasonably sure the Sep 8 one was with all accel off): cat /var/log/messages-* /var/log/messages | egrep 'rx len' Aug 16 21:26:25 eth0: rx length error: status 0x5e50100 length 3013 Aug 17 11:16:18 eth0: rx length error: status 0x5ea0100 length 3018 Aug 18 02:14:27 eth0: rx length error: status 0x5ea0100 length 3018 Aug 24 21:34:37 eth0: rx length error: status 0x5ea0100 length 3018 Aug 24 22:35:00 eth0: rx length error: status 0x5e50100 length 3013 Aug 25 09:42:19 eth0: rx length error: status 0x5ea0100 length 3018 Sep 8 03:57:04 eth0: rx length error: status 0x5e50100 length 3013 I don't fully follow the corruption comments on the netdev thread and their true implications (and why they don't want to fix it?).
This is a very irritating problem. It was fixed in Fedora 11 and returned in Fedora 12 as I have reported to bug 514693. Since then, we cannot download a big file. Our ethernet connection works fine as soon as there is no big download involved.
I haven't seen this error in a long time: # cat /var/log/messages-* /var/log/messages | egrep 'rx len' Oct 12 16:10:22 nike kernel: sky2 0000:0c:00.0: eth0: rx length error: status 0x3c0300 length 1548 I wonder if this means it got fixed somehow in recent kernels, or if I've just been lucky? # cat /var/log/messages-* /var/log/messages | egrep 'Linux version' Oct 19 11:22:22 nike kernel: Linux version 2.6.34.7-61.fc13.x86_64 (mockbuild.fedoraproject.org) (gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC) ) #1 SMP Tue Oct 19 04:06:30 UTC 2010 Nov 7 19:44:08 nike kernel: [ 0.000000] Linux version 2.6.35.6-48.fc14.x86_64 (mockbuild.fedoraproject.org) (gcc version 4.5.1 20100924 (Red Hat 4.5.1-4) (GCC) ) #1 SMP Fri Oct 22 15:36:08 UTC 2010 So this suggests that just perhaps 2.6.34.7-61.fc13.x86_64 is good (or much better...)? # last | egrep boot | tac reboot system boot 2.6.34.7-59.fc13 Fri Oct 1 11:12 - 11:19 (18+00:07) reboot system boot 2.6.34.7-61.fc13 Tue Oct 19 11:22 - 19:41 (19+09:19) reboot system boot 2.6.35.6-48.fc14 Sun Nov 7 19:44 - 19:56 (00:12) Based on kernel changelog seems doubtful that -61 was fixed compared to -59 though...: * Mon Oct 18 2010 Kyle McMartin <kyle> 2.6.34.7-61 - Add Ricoh e822 support. (rhbz#596475) Thanks to sgruszka@ for sending the patches in. * Mon Oct 18 2010 Kyle McMartin <kyle> 2.6.34.7-60 - Quirk to disable DMAR with Ricoh card reader/firewire. (rhbz#605888) * Mon Oct 18 2010 Kyle McMartin <kyle> - Two networking fixes (skge, r8169) from sgruska. (rhbz#447489,629158) * Thu Oct 14 2010 Neil Horman <nhorman> - Fix rcu warning in twsock_net (bz 642905) * Wed Oct 06 2010 Neil Horman <nhorman> - Fix WARN_ON when you try to create an exiting bond in bond_masters * Thu Sep 30 2010 Chuck Ebbert <cebbert> - CVE-2010-3432: sctp-do-not-reset-the-packet-during-sctp_packet_config.patch * Thu Sep 30 2010 Ben Skeggs <bskeggs> 2.6.34.7-59 I think this means I've just been very lucky...
You're extremely lucky :) The only thing that has changed is that they just hid the error. Try "hw csum failure"! It's still out there. Try also to download Fedora DVD and enjoy. After about 290M it was reproduced on me.
$ uname -a Linux nike 2.6.35.6-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux $ wget -O /dev/null http://mirrors.kernel.org/fedora/releases/14/Fedora/x86_64/iso/Fedora-14-x86_64-DVD.iso --2010-11-08 02:31:16-- http://mirrors.kernel.org/fedora/releases/14/Fedora/x86_64/iso/Fedora-14-x86_64-DVD.iso Resolving mirrors.kernel.org... 204.152.191.39 Connecting to mirrors.kernel.org|204.152.191.39|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 3520802816 (3.3G) [application/x-iso9660-image] Saving to: `/dev/null' 100%[====================================>] 3,520,802,816 22.1M/s in 2m 28s 2010-11-08 02:33:44 (22.6 MB/s) - `/dev/null' saved [3520802816/3520802816] And I downloaded twice, so after 6.6 GB transferred, still no errors...
The bug is opened against Fedora 13 and not 14. I'm using Fedora 13 with the latest kernel: $ uname -a Linux bb229 2.6.34.7-61.fc13.x86_64 #1 SMP Tue Oct 19 04:06:30 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux If the bug is solved in Fedora 14, then that would be awesome news. I think I have to schedule the upgrade earlier.
I upgraded barely 2 days ago, so it is still _way_ too early to say whether it's truly stable. Note that I saw no issues from Oct 19th till Nov 6th on the F13 kernel (2.6.34.7-61.fc13.x86_64), and I'm sure I did way more downloading than a mere couple GB's during that time (including downloading all of F14 during a yum upgrade). Basically all I'm saying is that of late the probability of this triggering has gone drastically down for me. I have no idea what the actual cause of this apparent goodness is. Weather? It looks like our bugs must be slightly different in some way... [oh, wait, I just realized, I think I've turned off all hw optimizations on the nic (via 'ethtool -K' before link up), maybe that's the fix, I'll retry the downloads tomorrow with tso and the like turned on]
I copied (downloaded via scp -c arcfour) around 100GB of data with full acceleration turned on and saw no issues with the F14 kernel. But, then, my experience has always been that it has a tendency to break during periods of idleness, and not load... I'll leave acceleration turned on and we'll see if it breaks by itself.
Maciej thank you very much for your detailed information. I've upgraded my system as well to Fedora 14: # uname -a Linux bb229 2.6.35.6-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux and it seems to work fine. I've downloaded Fedora DVD ISO from LAN by using scp, wget and browser with no problems. OK, on Firefox the browser freezes every time it reaches 2GB downloaded size, but I don't think that bug can be attributed to ethernet controller. It looks like firefox bug. Chrome browser downloaded the file without issues and much faster. My problem was observed only after downloading a big file. I have never faced the problem on yum downloads for example. Only if I were trying to download a big file over Internet. I just hope that bug won't re-appear as it happened in the past. Fingers crossed :)
Famous last words: Nov 10 02:13:38 bb229 kernel: [45909.915965] eth0: hw csum failure. Nov 10 02:13:38 bb229 kernel: [45909.915974] Pid: 0, comm: swapper Not tainted 2 .6.35.6-48.fc14.x86_64 #1 Nov 10 02:13:38 bb229 kernel: [45909.915977] Call Trace: Nov 10 02:13:38 bb229 kernel: [45909.915980] <IRQ> [<ffffffff813be889>] netdev _rx_csum_fault+0x3b/0x40 Nov 10 02:13:38 bb229 kernel: [45909.915997] [<ffffffff813b9057>] __skb_checksu m_complete_head+0x51/0x65 Nov 10 02:13:38 bb229 kernel: [45909.916002] [<ffffffff813b907c>] __skb_checksu m_complete+0x11/0x13 Nov 10 02:13:38 bb229 kernel: [45909.916008] [<ffffffff814281aa>] nf_ip_checksu m+0xce/0xd4 Nov 10 02:13:38 bb229 kernel: [45909.916015] [<ffffffff813e7c73>] udp_error+0x1 42/0x1a0 Nov 10 02:13:38 bb229 kernel: [45909.916022] [<ffffffff81047a42>] ? select_idle _sibling+0x3a/0xee Nov 10 02:13:38 bb229 kernel: [45909.916028] [<ffffffff813e314f>] nf_conntrack_ in+0x14d/0x8b4 Nov 10 02:13:38 bb229 kernel: [45909.916036] [<ffffffff81105c71>] ? __raw_local _irq_save+0x1d/0x23 Nov 10 02:13:38 bb229 kernel: [45909.916042] [<ffffffff81108682>] ? kmem_cache_ free+0x7a/0xb9 ... That problem was caused by the following crontab entry: 0 2 10 11 * /home/panos/bin/runwget 'ftp://ftp.ntua.gr/pub/linux/fedora/linux/releases/14/Fedora/i386/iso/Fedora-14-i386-DVD.iso' > dir -h /opt/tmp/Fedora-14-i386-DVD.iso -rw-r--r-- 1 panos users 391M 2010-11-10 02:13 /opt/tmp/Fedora-14-i386-DVD.iso Only 391MB were able to be downloaded and after 13 minutes the problem appeared. Indeed, the problem has been limited and it does not occur so often, but for sure is still there.
Wee... hit the bug again in the current F14 kernel. This time I had all acceleration enabled. Will try running with all rx acceleration disabled, but tx acceleration turned on, ie: sudo ip link set eth0 down sudo ethtool -K eth0 rx off tx on sg on tso on gso on gro off lro off rxhash on sudo ip link set eth0 up I am beginning to think that this doesn't happen without acceleration (or is much much rarer). Especially fishy is that this appears to happen with a length > ethernet MTU (ie. rx length error: status 0x... length ~3000). It's almost like an lro issue, except I do not believe lro to be available (I can't turn it on with ethtool)? [252362.248169] sky2 0000:0c:00.0: eth0: rx length error: status 0x5b50100 length 2917 [252365.261951] eth0: hw csum failure. [252365.261964] Pid: 0, comm: swapper Not tainted 2.6.35.6-48.fc14.x86_64 #1 [252365.261970] Call Trace: [252365.261975] <IRQ> [<ffffffff813be889>] netdev_rx_csum_fault+0x3b/0x40 [252365.262000] [<ffffffff813b9057>] __skb_checksum_complete_head+0x51/0x65 [252365.262009] [<ffffffff813b907c>] __skb_checksum_complete+0x11/0x13 [252365.262019] [<ffffffff814281aa>] nf_ip_checksum+0xce/0xd4 [252365.262029] [<ffffffff813e7c73>] udp_error+0x142/0x1a0 [252365.262041] [<ffffffff81105c71>] ? __raw_local_irq_save+0x1d/0x23 [252365.262050] [<ffffffff81108682>] ? kmem_cache_free+0x7a/0xb9 [252365.262060] [<ffffffff813e314f>] nf_conntrack_in+0x14d/0x8b4 [252365.262071] [<ffffffff81047a42>] ? select_idle_sibling+0x3a/0xee [252365.262080] [<ffffffff8142884d>] ipv4_conntrack_in+0x21/0x23 [252365.262088] [<ffffffff813e0529>] nf_iterate+0x46/0x89 [252365.262098] [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a [252365.262106] [<ffffffff813e05d6>] nf_hook_slow+0x6a/0xd0 [252365.262114] [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a [252365.262123] [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a [252365.262132] [<ffffffff813f042b>] NF_HOOK.clone.8+0x46/0x58 [252365.262142] [<ffffffff813b6965>] ? __netdev_alloc_skb+0x34/0x51 [252365.262150] [<ffffffff813f07c5>] ip_rcv+0x21e/0x24d [252365.262159] [<ffffffff813bda4c>] __netif_receive_skb+0x3ed/0x412 [252365.262168] [<ffffffff813be15d>] netif_receive_skb+0x57/0x5e [252365.262177] [<ffffffff813be676>] napi_skb_finish+0x29/0x41 [252365.262186] [<ffffffff813be6bd>] napi_gro_receive+0x2f/0x34 [252365.262223] [<ffffffffa02011e2>] sky2_poll+0xa3b/0xc8c [sky2] [252365.262234] [<ffffffff8101057c>] ? native_sched_clock+0x35/0x37 [252365.262243] [<ffffffff81010587>] ? sched_clock+0x9/0xd [252365.262253] [<ffffffff8106b129>] ? sched_clock_local+0x12/0x75 [252365.262264] [<ffffffff8146912f>] ? _raw_spin_unlock_irqrestore+0x17/0x19 [252365.262273] [<ffffffff81108682>] ? kmem_cache_free+0x7a/0xb9 [252365.262283] [<ffffffff813bf420>] net_rx_action+0xac/0x1bb [252365.262291] [<ffffffff813bcd1f>] ? __napi_schedule+0x50/0x57 [252365.262302] [<ffffffff81053839>] __do_softirq+0xdd/0x199 [252365.262317] [<ffffffffa01fd627>] ? sky2_intr+0x35/0x3c [sky2] [252365.262327] [<ffffffff8100abdc>] call_softirq+0x1c/0x30 [252365.262335] [<ffffffff8100c338>] do_softirq+0x46/0x82 [252365.262344] [<ffffffff81053999>] irq_exit+0x3b/0x7d [252365.262352] [<ffffffff8146f075>] do_IRQ+0x9d/0xb4 [252365.262361] [<ffffffff81469593>] ret_from_intr+0x0/0x11 [252365.262366] <EOI> [<ffffffff8128f39c>] ? raw_local_irq_enable+0x10/0x12 [252365.262385] [<ffffffff8106b3a0>] ? sched_clock_idle_wakeup_event+0x17/0x1b [252365.262394] [<ffffffff81290208>] acpi_idle_enter_bm+0x228/0x260 [252365.262405] [<ffffffff813939e1>] cpuidle_idle_call+0x8b/0xe9 [252365.262416] [<ffffffff81008325>] cpu_idle+0xaa/0xcc [252365.262425] [<ffffffff81450e46>] rest_init+0x8a/0x8c [252365.262436] [<ffffffff81ba1c49>] start_kernel+0x40b/0x416 [252365.262446] [<ffffffff81ba12c6>] x86_64_start_reservations+0xb1/0xb5 [252365.262455] [<ffffffff81ba13c2>] x86_64_start_kernel+0xf8/0x107 [252385.388560] eth0: hw csum failure. [252385.388573] Pid: 0, comm: swapper Not tainted 2.6.35.6-48.fc14.x86_64 #1 [252395.271567] eth0: hw csum failure. [252395.271575] Pid: 0, comm: swapper Not tainted 2.6.35.6-48.fc14.x86_64 #1 [252415.422920] eth0: hw csum failure. [252415.422933] Pid: 3414, comm: chrome Not tainted 2.6.35.6-48.fc14.x86_64 #1 [252425.279133] eth0: hw csum failure. [252425.279146] Pid: 3388, comm: chrome Not tainted 2.6.35.6-48.fc14.x86_64 #1
Perhaps this needs filing upstream?
$ uname -a Linux nike 2.6.35.11-83.fc14.x86_64 #1 SMP Mon Feb 7 07:06:44 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux Mar 5 16:20:02 nike kernel: [263628.531087] sky2 0000:0c:00.0: eth0: rx length error: status 0x5860100 length 2822 $ ethtool -k eth0 Offload parameters for eth0: rx-checksumming: off tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: off large-receive-offload: off ntuple-filters: off receive-hashing: on I think this means that turning off rx accel makes the bug _much_ less likely to occur, but can still happen. Very not clear to me how length > MTU (1500) happens with all rx offload turned off. I've bugged upstream (via email to netdev) and was told it's not a bug in the driver, but rather in the bios/firmware or something. Even though clearly it can be fixed in the driver by handling the error with an appropriate reset, they're not willing to listen - they consider it not their problem.
I see the same on F15: # uname -a Linux redwood.eagercon.com 2.6.41.1-1.fc15.x86_64 #1 SMP Fri Nov 11 21:36:28 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux [ 242.545001] p35p1: hw csum failure. [ 242.545006] Pid: 0, comm: kworker/0:0 Tainted: P 2.6.41.1-1.fc15.x86_64 #1 [ 242.545009] Call Trace: [ 242.545011] <IRQ> [<ffffffff813ec311>] netdev_rx_csum_fault+0x38/0x3c [ 242.545021] [<ffffffff813e625f>] __skb_checksum_complete_head+0x51/0x65 [ 242.545026] [<ffffffff813e6284>] __skb_checksum_complete+0x11/0x13 [ 242.545037] [<ffffffffa0f3fb14>] br_multicast_rcv+0x885/0xd52 [bridge] [ 242.545042] [<ffffffff81117195>] ? virt_to_head_page+0xe/0x31 [ 242.545052] [<ffffffffa0f3d552>] ? br_nf_pre_routing+0x2f/0x3df [bridge] [ 242.545060] [<ffffffffa0f37827>] ? br_multicast_flood+0x11d/0x12c [bridge] [ 242.545066] [<ffffffff8140fc32>] ? nf_iterate+0x48/0x7d [ 242.545074] [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge] [ 242.545082] [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge] [ 242.545087] [<ffffffff8140fcd9>] ? nf_hook_slow+0x72/0x115 [ 242.545096] [<ffffffffa0f38672>] br_handle_frame_finish+0x92/0x20f [bridge] [ 242.545104] [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge] [ 242.545113] [<ffffffffa0f385d9>] NF_HOOK.constprop.0+0x51/0x58 [bridge] [ 242.545118] [<ffffffff8149c9ac>] ? _raw_spin_unlock_irqrestore+0x17/0x19 [ 242.545126] [<ffffffffa0f38995>] br_handle_frame+0x1a6/0x1c1 [bridge] [ 242.545135] [<ffffffffa0f387ef>] ? br_handle_frame_finish+0x20f/0x20f [bridge] [ 242.545139] [<ffffffff813ea31e>] __netif_receive_skb+0x2c5/0x417 [ 242.545146] [<ffffffff813ed5ba>] netif_receive_skb+0x6c/0x73 [ 242.545152] [<ffffffff813ed650>] napi_skb_finish+0x27/0x3f [ 242.545156] [<ffffffff813edaa7>] napi_gro_receive+0x2f/0x34 [ 242.545165] [<ffffffffa0022293>] sky2_poll+0x7db/0x9f7 [sky2] [ 242.545170] [<ffffffff813edbd3>] net_rx_action+0xa9/0x1b8 [ 242.545175] [<ffffffff8105d683>] __do_softirq+0xc9/0x1b5 [ 242.545179] [<ffffffff81014b35>] ? paravirt_read_tsc+0x9/0xd [ 242.545183] [<ffffffff81014fec>] ? sched_clock+0x9/0xd [ 242.545188] [<ffffffff814a536c>] call_softirq+0x1c/0x30 [ 242.545192] [<ffffffff81010b45>] do_softirq+0x46/0x81 [ 242.545197] [<ffffffff8105d94b>] irq_exit+0x57/0xb1 [ 242.545201] [<ffffffff814a5c4e>] do_IRQ+0x8e/0xa5 [ 242.545205] [<ffffffff8149cd6e>] common_interrupt+0x6e/0x6e [ 242.545208] <EOI> [<ffffffff81015d21>] ? mwait_idle+0x87/0xb4 [ 242.545216] [<ffffffff81015d14>] ? mwait_idle+0x7a/0xb4 [ 242.545220] [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8 [ 242.545225] [<ffffffff8148baa3>] start_secondary+0x23f/0x241 Not related to MTU > 1500: # ifconfig p35p1 p35p1 Link encap:Ethernet HWaddr 00:1D:60:44:09:8C inet6 addr: fe80::21d:60ff:fe44:98c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:94054 errors:0 dropped:0 overruns:0 frame:0 TX packets:20449 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:138109575 (131.7 MiB) TX bytes:1919020 (1.8 MiB) Interrupt:19
I don't want to disappoint you, but the same failures occurs in Fedora 16 on the latest kernel kernel-3.1.2-1.fc16.x86_64 as well. Nothing has changed. Can you please move the bug against F16? Nov 25 16:06:55 bb229 kernel: [20038.066696] eth0: hw csum failure. Nov 25 16:06:55 bb229 kernel: [20038.066706] Pid: 8864, comm: firefox Tainted: G C 3.1.2-1.fc16.x86_64 #1 Nov 25 16:06:55 bb229 kernel: [20038.066711] Call Trace: Nov 25 16:06:55 bb229 kernel: [20038.066714] <IRQ> [<ffffffff813d7fde>] netdev_rx_csum_fault+0x38/0x3c Nov 25 16:06:55 bb229 kernel: [20038.066734] [<ffffffff813d1f2b>] __skb_checksum_complete_head+0x51/0x65 Nov 25 16:06:55 bb229 kernel: [20038.066741] [<ffffffff813d1f50>] __skb_checksum_complete+0x11/0x13 Nov 25 16:06:55 bb229 kernel: [20038.066748] [<ffffffff8143d654>] nf_ip_checksum+0xcd/0xd3 Nov 25 16:06:55 bb229 kernel: [20038.066771] [<ffffffffa0468360>] udp_error+0x137/0x195 [nf_conntrack] Nov 25 16:06:55 bb229 kernel: [20038.066780] [<ffffffff8110cb93>] ? dma_pool_alloc+0x22f/0x244 Nov 25 16:06:55 bb229 kernel: [20038.066797] [<ffffffffa04639ff>] nf_conntrack_in+0x174/0x7dc [nf_conntrack] Nov 25 16:06:55 bb229 kernel: [20038.066807] [<ffffffff813539f2>] ? uhci_alloc_td+0x1f/0x4d Nov 25 16:06:55 bb229 kernel: [20038.066813] [<ffffffff81353de9>] ? uhci_submit_common+0x2a7/0x341 Nov 25 16:06:55 bb229 kernel: [20038.066827] [<ffffffffa0488569>] ipv4_conntrack_in+0x21/0x23 [nf_conntrack_ipv4] Nov 25 16:06:55 bb229 kernel: [20038.066837] [<ffffffff813fb903>] nf_iterate+0x48/0x7d Nov 25 16:06:55 bb229 kernel: [20038.066844] [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35 Nov 25 16:06:55 bb229 kernel: [20038.066850] [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35 Nov 25 16:06:55 bb229 kernel: [20038.066857] [<ffffffff813fb9aa>] nf_hook_slow+0x72/0x114 Nov 25 16:06:55 bb229 kernel: [20038.066862] [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35 Nov 25 16:06:55 bb229 kernel: [20038.066869] [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35 Nov 25 16:06:55 bb229 kernel: [20038.066874] [<ffffffff81403aec>] NF_HOOK.constprop.3+0x46/0x58 Nov 25 16:06:55 bb229 kernel: [20038.066878] [<ffffffff81404089>] ip_rcv+0x239/0x268 Nov 25 16:06:55 bb229 kernel: [20038.066883] [<ffffffff813d60f2>] __netif_receive_skb+0x3cd/0x418 Nov 25 16:06:55 bb229 kernel: [20038.066887] [<ffffffff813d928a>] netif_receive_skb+0x6c/0x73 Nov 25 16:06:55 bb229 kernel: [20038.066891] [<ffffffff813d9320>] napi_skb_finish+0x27/0x3f Nov 25 16:06:55 bb229 kernel: [20038.066894] [<ffffffff813d9778>] napi_gro_receive+0x2f/0x34 Nov 25 16:06:55 bb229 kernel: [20038.066906] [<ffffffffa00f3293>] sky2_poll+0x7db/0x9f7 [sky2] Nov 25 16:06:55 bb229 kernel: [20038.066912] [<ffffffff81085d6f>] ? arch_local_irq_save+0x15/0x1b Nov 25 16:06:55 bb229 kernel: [20038.066916] [<ffffffff813d98a4>] net_rx_action+0xa9/0x1b8 Nov 25 16:06:55 bb229 kernel: [20038.066922] [<ffffffff8105d67b>] __do_softirq+0xc9/0x1b5 Nov 25 16:06:55 bb229 kernel: [20038.066926] [<ffffffff81014b35>] ? paravirt_read_tsc+0x9/0xd Nov 25 16:06:55 bb229 kernel: [20038.066930] [<ffffffff81014fec>] ? sched_clock+0x9/0xd Nov 25 16:06:55 bb229 kernel: [20038.066936] [<ffffffff814bfb6c>] call_softirq+0x1c/0x30 Nov 25 16:06:55 bb229 kernel: [20038.066940] [<ffffffff81010b45>] do_softirq+0x46/0x81 Nov 25 16:06:55 bb229 kernel: [20038.066944] [<ffffffff8105d943>] irq_exit+0x57/0xb1 Nov 25 16:06:55 bb229 kernel: [20038.066948] [<ffffffff814c044e>] do_IRQ+0x8e/0xa5 Nov 25 16:06:55 bb229 kernel: [20038.066953] [<ffffffff814b756e>] common_interrupt+0x6e/0x6e Nov 25 16:06:57 bb229 kernel: [20038.066956] <EOI>
Yeah, I also see it still occasionally on F15. The driver's buggy... and just doesn't have good error recovery. If you put the physical nic (eth0, renamed to peth0) in a bond [or bridge] (which you then call eth0), you can monitor /var/log/messages [via script] and [automatically ip link down/up the physical nic to recover without losing network state.
It seems that it is fixed in kernel-3.2.2-1.fc16.x86_64 (or earlier) or at least I haven't reproduced it yet that issue. Download of big files have worked fine so far.
(In reply to comment #23) > It seems that it is fixed in kernel-3.2.2-1.fc16.x86_64 (or earlier) or at > least I haven't reproduced it yet that issue. Download of big files have worked > fine so far. Thanks for letting us know. If you see it again, please reopen.
Any possibility that this would be backported to Fedora 15?
The issue re-appeared on kernel-3.3.0-4.fc16.x86_64. Now it has an "improved" behaviour. Not only you have your network disconnected accompanied by kernel stacktraces of sky2 module, but also your computer hangs. You have to press the "reset" button to recover your system. Reloading of sky2 kernel module is not possible anymore. Can we please re-open the issue and increase its priority?
Reopening per comment 26.
# Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
I no longer have the laptop in question.
Created attachment 632552 [details] Part of /var/log/messages You might not having the hardware, but there are still people out there having it. We have suffered for years by this bug and to close it with a "NOTABUG" resolution is not helping at all. Please, find attached the part of logs that the problem has occurred. It was reproduced on F17 with the following kernel package: kernel-3.6.2-4.fc17.x86_64 Do us also a favour and re-open the bug. Transfer it to Fedora 17 or Fedora 18.
Well, I understand this issue is not going to be fixed, due to being for old and not popular hardware. At least can we request to restore the old behaviour, where the issue only affected the network connection and not to cause a kernel panic that returns to text console and lose all your work?
"list_del corruption" messages from logs from comment 30 do not give enough information to find out where the problem is. Please install kernel-debug variant, it should print some other warnings, which should indicate better where the problem is.
Created attachment 666733 [details] Full /var/log/messages Issue reproduced with debug kernel: Dec 20 15:21:00 bb229 kernel: [ 374.788380] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 1300 Dec 20 15:21:24 bb229 kernel: [ 399.529927] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 220 Dec 20 15:21:28 bb229 kernel: [ 402.719641] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 724 issue reproduced with normal kernel: Dec 20 17:03:20 bb229 kernel: [ 2573.521475] sky2 0000:04:00.0: eth0: rx error, status 0xe3c4e3c4 length 0 Dec 20 17:04:01 bb229 kernel: [ 2614.600284] eth0: hw csum failure Dec 20 17:04:01 bb229 kernel: [ 2614.600294] Pid: 0, comm: swapper/0 Tainted: G C O 3.6.10-2.fc17.x86_64 #1 It is worthy to say that debug kernel didn't cause a single issue, whereas normal kernel produced a kernel panic in a text console. How can debug kernel work find and normal have issue? Please, ignore the many restart attempts, since I couldn't get rid of the debug kernel... keyboard was didn't respond on grub menu and I couldn't remove a running kernel with yum. I had to edit grub.cfg to load the normal kernel first.
Rsyslog might not log all kernel messages, please attach dmesg output from debug kernel when issue happens (i.e when 'dmesg | grep "rx error"' will shows entries). Thanks.
I also checked dmesg and there was nothing there that wasn't included in the /var/log/messages. Dmesg displayed only those lines printed in messages log. Does it require any other configuration to make the kernel more chatty?
Still I would like to see it.
Created attachment 667242 [details] Part of /var/log/messages with debug kernel As you wish!
Created attachment 667243 [details] Dmesg with debug kernel
I think crash does not happen on debug-kernel because vbox modules are not loaded. Please remove or blacklist vbox modules on standard kernel and check if that prevent crashes.
Created attachment 674018 [details] sky2_32bit_dma.patch Trial patch - use 32 bit DMA only on sky2 device. Similar patch helped with random hw problems on skge driver, perhaps troubles here are also caused by 64 bit DMA. Please test, I lunched kernel build with patch here: http://koji.fedoraproject.org/koji/taskinfo?taskID=4845143
Hello Stanislaw, I apologise for the delay. I was confused how exactly am I going to test this. I am doing the following at the moment: root@bb229:[201] ~ # yum install http://kojipkgs.fedoraproject.org//work/tasks/5145/4845145/kernel-3.6.11-4.sky2.fc17.x86_64.rpm if I need more packages to update, please let me know.
Created attachment 675588 [details] Part of Jan 9, 2013 /var/log/messages Unfortunately, the issue occurred on the suggested kernel as well.
At least those "list_del corruption" related crashes gone, that must be a bug on vbox modules. Hw csum errors can be possibly related with some problems on PCIe bus. Do any of below (mutual exclusive) kernel boot options prevent them ? pcie_aspm=off pcie_aspm=force Would be good also to try: pci=nocsr,noacpi,nomsi
You can also try noapic kernel boot option.
This is nothing that could be easily fix http://marc.info/?t=128157418200003&r=1&w=4 If upstream maintainer is not capable to fix the problem, the more we are not able to do this - closing as won't fix.
Hello Stanislaw, Thank you for your time to look into this matter. I agree with you, but my comment 31 is still valid. May I ask if it is possible to restore the old kernel behaviour where the issue didn't cause a kernel panic and we were able to overcome with a simple "rmmod sky2; modprobe sky2"?
As I wrote before I think kernel panic is caused by vbox modules. We do not see the panic when modules were not load, i.e. in kernel-debug. Or I'm wrong, panic happens also without vbox? Note: vbox are external modules and well know crap, we do not support them.
It was also reproduced on the custom built kernel kernel-3.6.11-4.sky2.fc17.x86_64.rpm, where the vbox modules where not loaded. I am aware of vbox situation :)
Oh, I missed that, we have this in sky2 kernel logs: Jan 9 16:27:02 bb229 kernel: [ 308.371941] sky2 0000:04:00.0: eth0: rx error, status 0xe3c4e3c4 length 0 Jan 9 16:27:28 bb229 kernel: [ 333.901176] thunderbird: Corrupted page table at address 7f01b9710b7c Jan 9 16:27:28 bb229 kernel: [ 333.901234] PGD 9be76067 PUD 9e630067 PMD a884d067 PTE d71c433c690438cb I think we have memory corruption caused by sky2 DMA as described here: http://marc.info/?l=linux-netdev&m=128207384327193&w=4 Looks like on some kernel builds or with some modules loaded, corruption can happen on vital data, on others (like kernel-debug) we corrupt non allocated memory. I can not help here. My only advice is to stop using sky2 on your machine.
Created attachment 678275 [details] Part of Jan 14, 2013 /var/log/messages For completeness, I have also included the /var/log/messages with the suggested kernel options (pcie_aspm=off, pcie_aspm=force, pci=nocsr,noacpi,nomsi and noapic). None of them have worked. Thanks Stanislaw for your efforts! I second your advise. I should upgrade my PC soon.
This seems similar: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1576816 I see this bug on Debian kernel 3.16.0-4-amd64 as well. Seems to be kernel or driver related. This is workstation class hardware I am running on...