Bug 592398 - sky2 rx hw csum failure - need to reload sky2 module to recover
sky2 rx hw csum failure - need to reload sky2 module to recover
Status: CLOSED WONTFIX
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
17
All Linux
low Severity medium
: ---
: ---
Assigned To: Stanislaw Gruszka
Fedora Extras Quality Assurance
: Reopened
: 569779 632501 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-14 15:12 EDT by Maciej Żenczykowski
Modified: 2013-01-14 11:00 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-01-14 07:19:50 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Part of /var/log/messages (16.93 KB, text/plain)
2012-10-24 02:11 EDT, Panos Kavalagios
no flags Details
Full /var/log/messages (9.86 MB, text/plain)
2012-12-20 10:25 EST, Panos Kavalagios
no flags Details
Part of /var/log/messages with debug kernel (79.10 KB, text/plain)
2012-12-21 08:50 EST, Panos Kavalagios
no flags Details
Dmesg with debug kernel (54.58 KB, text/plain)
2012-12-21 08:51 EST, Panos Kavalagios
no flags Details
sky2_32bit_dma.patch (968 bytes, text/plain)
2013-01-07 08:40 EST, Stanislaw Gruszka
no flags Details
Part of Jan 9, 2013 /var/log/messages (678.39 KB, text/plain)
2013-01-09 09:37 EST, Panos Kavalagios
no flags Details
Part of Jan 14, 2013 /var/log/messages (967.49 KB, text/plain)
2013-01-14 11:00 EST, Panos Kavalagios
no flags Details

  None (edit)
Description Maciej Żenczykowski 2010-05-14 15:12:59 EDT
I've run into a situation where the sky2 driver kind-of-locks-up.

Kernel: 2.6.32.12-115.fc12.x86_64

Machine in question is a Macbook Pro 4,1 with "lspci -vvnn" of the nic:

0c:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. Marvell Yukon 88E8058 PCI-E Gigabit Ethernet Controller [11ab:436a] (rev 13)
        Subsystem: Marvell Technology Group Ltd. Device [11ab:00ba]
        Physical Slot: 5
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 256 bytes
        Interrupt: pin A routed to IRQ 29
        Region 0: Memory at d7200000 (64-bit, non-prefetchable) [size=16K]
        Region 2: I/O ports at 5000 [size=256]
        Expansion ROM at dba00000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
                Product Name: Marvell Yukon 88E8058 Gigabit Ethernet Controller
                Read-only fields:
                        [PN] Part number: Yukon 88E8058
                        [EC] Engineering changes: Rev. 1.3
                        [MN] Manufacture ID: 4d 61 72 76 65 6c 6c
                        [SN] Serial number: AbCdEfGE8127B
                        [CP] Extended capability: 01 10 cc 03
                        [RV] Reserved: checksum good, 9 byte(s) reserved
                Read/write fields:
                        [RW] Read-write area: 121 byte(s) free
                End
        Capabilities: [5c] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0300c  Data: 41a1
        Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 2048 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl: ASPM L0s L1 Enabled; RCB 128 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 1f, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: sky2
        Kernel modules: sky2


Messages in /var/log/messages:

(no messages for over half an hour)
May 14 10:34:15 nike kernel: sky2 eth0: rx length error: status 0x5ea0100 length 3018
May 14 10:34:18 nike kernel: eth0: hw csum failure.
May 14 10:34:18 nike kernel: Pid: 0, comm: swapper Not tainted 2.6.32.12-115.fc12.x86_64 #1
May 14 10:34:18 nike kernel: Call Trace:
May 14 10:34:18 nike kernel: <IRQ>  [<ffffffff813b457f>] netdev_rx_csum_fault+0x3b/0x3f
May 14 10:34:18 nike kernel: [<ffffffff813ae4ba>] __skb_checksum_complete_head+0x51/0x65
May 14 10:34:18 nike kernel: [<ffffffff813ae4df>] __skb_checksum_complete+0x11/0x13
May 14 10:34:18 nike kernel: [<ffffffff81417f4d>] nf_ip_checksum+0xdd/0xe3
May 14 10:34:18 nike kernel: [<ffffffff813d7af3>] tcp_error+0x105/0x1a2
May 14 10:34:18 nike kernel: [<ffffffff813f5536>] ? tcp_rcv_established+0x555/0x6a6
May 14 10:34:18 nike kernel: [<ffffffff813d4b69>] nf_conntrack_in+0x17a/0x86e
May 14 10:34:18 nike kernel: [<ffffffff81047209>] ? enqueue_entity+0x25b/0x267
May 14 10:34:18 nike kernel: [<ffffffff8104810a>] ? enqueue_task_fair+0x2a/0x6d
May 14 10:34:18 nike kernel: [<ffffffff814185bb>] ipv4_conntrack_in+0x21/0x23
May 14 10:34:18 nike kernel: [<ffffffff813d1c22>] nf_iterate+0x46/0x89
May 14 10:34:18 nike kernel: [<ffffffff813e0fd6>] ? ip_rcv_finish+0x0/0x3ba
May 14 10:34:18 nike kernel: [<ffffffff813d1ccf>] nf_hook_slow+0x6a/0xcb
May 14 10:34:18 nike kernel: [<ffffffff813e0fd6>] ? ip_rcv_finish+0x0/0x3ba
May 14 10:34:18 nike kernel: [<ffffffff813e160c>] ip_rcv+0x27c/0x2c9
May 14 10:34:18 nike kernel: [<ffffffff813b3ce7>] netif_receive_skb+0x3fc/0x421
May 14 10:34:18 nike kernel: [<ffffffff813b3e6e>] napi_skb_finish+0x29/0x3d
May 14 10:34:18 nike kernel: [<ffffffff813b42c9>] napi_gro_receive+0x2f/0x34
May 14 10:34:18 nike kernel: [<ffffffffa014b002>] sky2_poll+0x813/0xa88 [sky2]
May 14 10:34:18 nike kernel: [<ffffffff81017e73>] ? native_sched_clock+0x2d/0x5f
May 14 10:34:18 nike kernel: [<ffffffff81017d45>] ? sched_clock+0x9/0xd
May 14 10:34:18 nike kernel: [<ffffffff813b43ff>] net_rx_action+0xaf/0x1c9
May 14 10:34:18 nike kernel: [<ffffffff8105d998>] __do_softirq+0xe5/0x1a9
May 14 10:34:18 nike kernel: [<ffffffff810acfbd>] ? handle_IRQ_event+0x60/0x121
May 14 10:34:18 nike kernel: [<ffffffff81012e6c>] call_softirq+0x1c/0x30
May 14 10:34:18 nike kernel: [<ffffffff810143ea>] do_softirq+0x46/0x86
May 14 10:34:18 nike kernel: [<ffffffff8105d7d6>] irq_exit+0x3b/0x7d
May 14 10:34:18 nike kernel: [<ffffffff8145ad3d>] do_IRQ+0xa5/0xbc
May 14 10:34:18 nike kernel: [<ffffffff81012693>] ret_from_intr+0x0/0x11
May 14 10:34:18 nike kernel: <EOI>  [<ffffffff81295665>] ? acpi_idle_enter_bm+0x27e/0x2b2
May 14 10:34:18 nike kernel: [<ffffffff8129565e>] ? acpi_idle_enter_bm+0x277/0x2b2
May 14 10:34:18 nike kernel: [<ffffffff8138949e>] ? cpuidle_idle_call+0x99/0xf3
May 14 10:34:18 nike kernel: [<ffffffff81010cdd>] ? cpu_idle+0xaa/0xe4
May 14 10:34:18 nike kernel: [<ffffffff8144ea71>] ? start_secondary+0x1f2/0x233

This was the first occurrence, it is immediately followed by further hw csum failures (with differing stack traces).

In total 208 (logged) failures over the course of the next 10 minutes.  Furthermore the network is mostly unusable (imagine network running at dial-up modem speeds with 90% packet loss) - but it is *not* _quite_ dead: occasionally stuff does seem to get through.

Unplugging and replugging the ethernet cable didn't help, however unloading and reloading the sky2 module (and reconfiguring networking) fixes the problem [ip link down/up not tried].

This is the first time I'm reporting this, but I've now seen this somewhere on the order of 3 times (ie. very very rarely) over the last 3-4 months on various fc12 kernels from koji.

It might be worth noting that (at least this time) this seems to have happened (pretty much?) immediately after reaching my laptop after not using it during the night (but it wasn't suspended or anything like that).
Comment 1 Richard Dale 2010-07-08 02:45:25 EDT
Also occurs on my Gigabyte 965P-S3 Core 2 Duo, on FC13 2.6.33.5-124.fc13.i686

First starts out as:
eth0: hw csum failure.

with trace:
Call Trace:
 [<c076ebb6>] ? printk+0xf/0x11
 [<c06ea443>] netdev_rx_csum_fault+0x29/0x30
 [<c06e559a>] __skb_checksum_complete_head+0x42/0x57
 [<c06e55ba>] __skb_checksum_complete+0xb/0xd
 [<c073dbe5>] nf_ip_checksum+0xcf/0xdb
 [<c0706bfa>] tcp_error+0xd8/0x16a
 [<c0706b22>] ? tcp_error+0x0/0x16a
 [<c07044af>] nf_conntrack_in+0xf4/0x688
 [<c0679135>] ? usb_submit_urb+0x23f/0x2a0
 [<f378f79c>] ? ftdi_submit_read_urb+0x3e/0x74 [ftdi_sio]
 [<f378faf3>] ? ftdi_read_bulk_callback+0x321/0x332 [ftdi_sio]
 [<c0679206>] ? usb_free_urb+0x11/0x13
 [<c073e13c>] ipv4_conntrack_in+0x19/0x1e
 [<c0701f2b>] nf_iterate+0x2f/0x62
 [<c070e440>] ? ip_rcv_finish+0x0/0x2bd
 [<c070203f>] nf_hook_slow+0x3b/0x91
 [<c070e440>] ? ip_rcv_finish+0x0/0x2bd
 [<c070e8fe>] ip_rcv+0x201/0x234
 [<c070e440>] ? ip_rcv_finish+0x0/0x2bd
 [<c06e9d69>] netif_receive_skb+0x3ae/0x3c9
 [<c06e9e91>] napi_skb_finish+0x1e/0x34
 [<c06ea207>] napi_gro_receive+0x20/0x24
 [<f3722dc6>] sky2_poll+0x7c6/0x9d8 [sky2]
 [<c06ea301>] net_rx_action+0x92/0x188
 [<c043c1d1>] __do_softirq+0xac/0x152
 [<c043c2a8>] do_softirq+0x31/0x3c
 [<c043c3bc>] irq_exit+0x29/0x5c
 [<c040459d>] do_IRQ+0x86/0x9a
 [<c0403830>] common_interrupt+0x30/0x38
 [<c04091f3>] ? mwait_idle+0x5c/0x67
 [<c04024b8>] cpu_idle+0x91/0xad
 [<c075e39e>] rest_init+0x62/0x64
 [<c09b78f1>] start_kernel+0x346/0x34b
 [<c09b7099>] i386_start_kernel+0x99/0xa0

then quickly degrades until we get other bad things happening:
BUG: soft lockup - CPU#1 stuck for 61s!
Comment 2 Chuck Ebbert 2010-07-08 06:59:45 EDT
*** Bug 569779 has been marked as a duplicate of this bug. ***
Comment 3 Maciej Żenczykowski 2010-07-12 00:49:02 EDT
Just happened again with

$ uname -a
Linux nike 2.6.33.5-112.fc13.x86_64 #1 SMP Thu May 27 02:28:31 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

fixed with "ip link set eth0 down; ip link set eth0 up; dhclient -x; dhclient eth0"


Jul 10 19:41:26 nike kernel: sky2 eth0: rx length error: status 0x5ea0100 length 3018
Jul 10 19:41:52 nike kernel: eth0: hw csum failure.
...1547 additional identical hw csum failure messages across 34 hours (I was away from the machine and when I came back I found the network was unusable and fixed it with the above).
Jul 12 05:47:35 nike kernel: eth0: hw csum failure.
Comment 4 Maciej Żenczykowski 2010-08-11 20:29:36 EDT
Still happening with

$ uname -a
Linux nike 2.6.34.3-35.rc1.fc13.x86_64 #1 SMP Sat Aug 7 16:32:24 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

It actually appears to be happening more often of late (I think 2.6.32 was better than 2.6.33 or 2.6.34).

Once it starts happening your network is dead and you get error messages (with stack dump) about every 10 seconds (for example 700 'eth0: hw csum failure' messages over the course of an hour).

It looks like the error recovery of the sky2 driver in this case (nike kernel: sky2 0000:0c:00.0: eth0: rx length error: status 0x5d60100 length 2982) could use some love.
Comment 5 Chuck Ebbert 2010-09-10 17:03:13 EDT
*** Bug 632501 has been marked as a duplicate of this bug. ***
Comment 6 Chuck Ebbert 2010-09-14 11:00:15 EDT
I see you reported this to netdev but there was no fix:

  http://marc.info/?t=128157418200003&r=1&w=4
Comment 7 Maciej Żenczykowski 2010-09-14 14:28:41 EDT
Yeah, I think it's as 'simple' as adding some sort of driver reset triggered by the 'rx len error', but that's much simpler for someone that actually understands the driver code.

It's obviously a recoverable error since 'ip link set eth0 down && ip link set eth0 up' makes the problem go away (till the next time it happens).

I'm not 100% sure yet, but I don't think turning off all acceleration via 'ethtool -K eth0 ...' prevents the problem from happening (although it may decrease the probability of it triggering - hard to say, it's very sporadic...)

My last occurences (I'm reasonably sure the Sep 8 one was with all accel off):

cat /var/log/messages-* /var/log/messages | egrep 'rx len'
Aug 16 21:26:25 eth0: rx length error: status 0x5e50100 length 3013
Aug 17 11:16:18 eth0: rx length error: status 0x5ea0100 length 3018
Aug 18 02:14:27 eth0: rx length error: status 0x5ea0100 length 3018
Aug 24 21:34:37 eth0: rx length error: status 0x5ea0100 length 3018
Aug 24 22:35:00 eth0: rx length error: status 0x5e50100 length 3013
Aug 25 09:42:19 eth0: rx length error: status 0x5ea0100 length 3018
Sep  8 03:57:04 eth0: rx length error: status 0x5e50100 length 3013

I don't fully follow the corruption comments on the netdev thread and their true implications (and why they don't want to fix it?).
Comment 8 Panos Kavalagios 2010-11-08 02:58:12 EST
This is a very irritating problem. It was fixed in Fedora 11 and returned in Fedora 12 as I have reported to bug 514693. Since then, we cannot download a big file. Our ethernet connection works fine as soon as there is no big download involved.
Comment 9 Maciej Żenczykowski 2010-11-08 04:58:31 EST
I haven't seen this error in a long time:

# cat /var/log/messages-* /var/log/messages | egrep 'rx len'
Oct 12 16:10:22 nike kernel: sky2 0000:0c:00.0: eth0: rx length error: status 0x3c0300 length 1548

I wonder if this means it got fixed somehow in recent kernels, or if I've just been lucky?

# cat /var/log/messages-* /var/log/messages | egrep 'Linux version'
Oct 19 11:22:22 nike kernel: Linux version 2.6.34.7-61.fc13.x86_64 (mockbuild@x86-02.phx2.fedoraproject.org) (gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC) ) #1 SMP Tue Oct 19 04:06:30 UTC 2010
Nov  7 19:44:08 nike kernel: [    0.000000] Linux version 2.6.35.6-48.fc14.x86_64 (mockbuild@x86-02.phx2.fedoraproject.org) (gcc version 4.5.1 20100924 (Red Hat 4.5.1-4) (GCC) ) #1 SMP Fri Oct 22 15:36:08 UTC 2010

So this suggests that just perhaps 2.6.34.7-61.fc13.x86_64 is good (or much better...)?

# last | egrep boot | tac
reboot   system boot  2.6.34.7-59.fc13 Fri Oct  1 11:12 - 11:19 (18+00:07)  
reboot   system boot  2.6.34.7-61.fc13 Tue Oct 19 11:22 - 19:41 (19+09:19)  
reboot   system boot  2.6.35.6-48.fc14 Sun Nov  7 19:44 - 19:56  (00:12)    

Based on kernel changelog seems doubtful that -61 was fixed compared to -59 though...:

* Mon Oct 18 2010 Kyle McMartin <kyle@redhat.com> 2.6.34.7-61
- Add Ricoh e822 support. (rhbz#596475) Thanks to sgruszka@ for
  sending the patches in.

* Mon Oct 18 2010 Kyle McMartin <kyle@redhat.com> 2.6.34.7-60
- Quirk to disable DMAR with Ricoh card reader/firewire. (rhbz#605888)

* Mon Oct 18 2010 Kyle McMartin <kyle@redhat.com>
- Two networking fixes (skge, r8169) from sgruska. (rhbz#447489,629158)

* Thu Oct 14 2010 Neil Horman <nhorman@redhat.com>
- Fix rcu warning in twsock_net (bz 642905)

* Wed Oct 06 2010 Neil Horman <nhorman@redhat.com>
- Fix WARN_ON when you try to create an exiting bond in bond_masters

* Thu Sep 30 2010 Chuck Ebbert <cebbert@redhat.com>
- CVE-2010-3432: sctp-do-not-reset-the-packet-during-sctp_packet_config.patch

* Thu Sep 30 2010 Ben Skeggs <bskeggs@redhat.com> 2.6.34.7-59



I think this means I've just been very lucky...
Comment 10 Panos Kavalagios 2010-11-08 05:16:08 EST
You're extremely lucky :) The only thing that has changed is that they just hid the error. Try "hw csum failure"! It's still out there. Try also to download Fedora DVD and enjoy. After about 290M it was reproduced on me.
Comment 11 Maciej Żenczykowski 2010-11-08 05:36:26 EST
$ uname -a
Linux nike 2.6.35.6-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

$ wget -O /dev/null http://mirrors.kernel.org/fedora/releases/14/Fedora/x86_64/iso/Fedora-14-x86_64-DVD.iso
--2010-11-08 02:31:16--  http://mirrors.kernel.org/fedora/releases/14/Fedora/x86_64/iso/Fedora-14-x86_64-DVD.iso
Resolving mirrors.kernel.org... 204.152.191.39
Connecting to mirrors.kernel.org|204.152.191.39|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3520802816 (3.3G) [application/x-iso9660-image]
Saving to: `/dev/null'

100%[====================================>] 3,520,802,816 22.1M/s   in 2m 28s  

2010-11-08 02:33:44 (22.6 MB/s) - `/dev/null' saved [3520802816/3520802816]

And I downloaded twice, so after 6.6 GB transferred, still no errors...
Comment 12 Panos Kavalagios 2010-11-08 06:19:50 EST
The bug is opened against Fedora 13 and not 14. I'm using Fedora 13 with the latest kernel:

$ uname -a
Linux bb229 2.6.34.7-61.fc13.x86_64 #1 SMP Tue Oct 19 04:06:30 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

If the bug is solved in Fedora 14, then that would be awesome news. I think I have to schedule the upgrade earlier.
Comment 13 Maciej Żenczykowski 2010-11-08 06:46:26 EST
I upgraded barely 2 days ago, so it is still _way_ too early to say whether it's truly stable.

Note that I saw no issues from Oct 19th till Nov 6th on the F13 kernel (2.6.34.7-61.fc13.x86_64), and I'm sure I did way more downloading than a mere couple GB's during that time (including downloading all of F14 during a yum upgrade).

Basically all I'm saying is that of late the probability of this triggering has gone drastically down for me.  I have no idea what the actual cause of this apparent goodness is.  Weather?

It looks like our bugs must be slightly different in some way...

[oh, wait, I just realized, I think I've turned off all hw optimizations on the nic (via 'ethtool -K' before link up), maybe that's the fix, I'll retry the downloads tomorrow with tso and the like turned on]
Comment 14 Maciej Żenczykowski 2010-11-09 01:54:11 EST
I copied (downloaded via scp -c arcfour) around 100GB of data with full acceleration turned on and saw no issues with the F14 kernel.  But, then, my experience has always been that it has a tendency to break during periods of idleness, and not load...

I'll leave acceleration turned on and we'll see if it breaks by itself.
Comment 15 Panos Kavalagios 2010-11-09 06:46:14 EST
Maciej thank you very much for your detailed information. I've upgraded my system as well to Fedora 14:

# uname -a
Linux bb229 2.6.35.6-48.fc14.x86_64 #1 SMP Fri Oct 22 15:36:08 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

and it seems to work fine. I've downloaded Fedora DVD ISO from LAN by using scp, wget and browser with no problems. OK, on Firefox the browser freezes every time it reaches 2GB downloaded size, but I don't think that bug can be attributed to ethernet controller. It looks like firefox bug. Chrome browser downloaded the file without issues and much faster.

My problem was observed only after downloading a big file. I have never faced the problem on yum downloads for example. Only if I were trying to download a big file over Internet. I just hope that bug won't re-appear as it happened in the past. Fingers crossed :)
Comment 16 Panos Kavalagios 2010-11-10 02:16:12 EST
Famous last words:

Nov 10 02:13:38 bb229 kernel: [45909.915965] eth0: hw csum failure.
Nov 10 02:13:38 bb229 kernel: [45909.915974] Pid: 0, comm: swapper Not tainted 2
.6.35.6-48.fc14.x86_64 #1
Nov 10 02:13:38 bb229 kernel: [45909.915977] Call Trace:
Nov 10 02:13:38 bb229 kernel: [45909.915980]  <IRQ>  [<ffffffff813be889>] netdev
_rx_csum_fault+0x3b/0x40
Nov 10 02:13:38 bb229 kernel: [45909.915997]  [<ffffffff813b9057>] __skb_checksu
m_complete_head+0x51/0x65
Nov 10 02:13:38 bb229 kernel: [45909.916002]  [<ffffffff813b907c>] __skb_checksu
m_complete+0x11/0x13
Nov 10 02:13:38 bb229 kernel: [45909.916008]  [<ffffffff814281aa>] nf_ip_checksu
m+0xce/0xd4
Nov 10 02:13:38 bb229 kernel: [45909.916015]  [<ffffffff813e7c73>] udp_error+0x1
42/0x1a0
Nov 10 02:13:38 bb229 kernel: [45909.916022]  [<ffffffff81047a42>] ? select_idle
_sibling+0x3a/0xee
Nov 10 02:13:38 bb229 kernel: [45909.916028]  [<ffffffff813e314f>] nf_conntrack_
in+0x14d/0x8b4
Nov 10 02:13:38 bb229 kernel: [45909.916036]  [<ffffffff81105c71>] ? __raw_local
_irq_save+0x1d/0x23
Nov 10 02:13:38 bb229 kernel: [45909.916042]  [<ffffffff81108682>] ? kmem_cache_
free+0x7a/0xb9
...

That problem was caused by the following crontab entry:

0 2 10 11 *       /home/panos/bin/runwget 'ftp://ftp.ntua.gr/pub/linux/fedora/linux/releases/14/Fedora/i386/iso/Fedora-14-i386-DVD.iso'

> dir -h /opt/tmp/Fedora-14-i386-DVD.iso 
-rw-r--r-- 1 panos users 391M 2010-11-10 02:13 /opt/tmp/Fedora-14-i386-DVD.iso

Only 391MB were able to be downloaded and after 13 minutes the problem appeared. Indeed, the problem has been limited and it does not occur so often, but for sure is still there.
Comment 17 Maciej Żenczykowski 2010-11-11 03:32:08 EST
Wee... hit the bug again in the current F14 kernel.
This time I had all acceleration enabled.

Will try running with all rx acceleration disabled, but tx acceleration turned on, ie:
  sudo ip link set eth0 down
  sudo ethtool -K eth0 rx off tx on sg on  tso on  gso on  gro off lro off  rxhash on
  sudo ip link set eth0 up

I am beginning to think that this doesn't happen without acceleration (or is much much rarer).
Especially fishy is that this appears to happen with a length > ethernet MTU (ie. rx length error: status 0x... length ~3000).  It's almost like an lro issue, except I do not believe lro to be available (I can't turn it on with ethtool)?


[252362.248169] sky2 0000:0c:00.0: eth0: rx length error: status 0x5b50100 length 2917
[252365.261951] eth0: hw csum failure.
[252365.261964] Pid: 0, comm: swapper Not tainted 2.6.35.6-48.fc14.x86_64 #1
[252365.261970] Call Trace:
[252365.261975]  <IRQ>  [<ffffffff813be889>] netdev_rx_csum_fault+0x3b/0x40
[252365.262000]  [<ffffffff813b9057>] __skb_checksum_complete_head+0x51/0x65
[252365.262009]  [<ffffffff813b907c>] __skb_checksum_complete+0x11/0x13
[252365.262019]  [<ffffffff814281aa>] nf_ip_checksum+0xce/0xd4
[252365.262029]  [<ffffffff813e7c73>] udp_error+0x142/0x1a0
[252365.262041]  [<ffffffff81105c71>] ? __raw_local_irq_save+0x1d/0x23
[252365.262050]  [<ffffffff81108682>] ? kmem_cache_free+0x7a/0xb9
[252365.262060]  [<ffffffff813e314f>] nf_conntrack_in+0x14d/0x8b4
[252365.262071]  [<ffffffff81047a42>] ? select_idle_sibling+0x3a/0xee
[252365.262080]  [<ffffffff8142884d>] ipv4_conntrack_in+0x21/0x23
[252365.262088]  [<ffffffff813e0529>] nf_iterate+0x46/0x89
[252365.262098]  [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a
[252365.262106]  [<ffffffff813e05d6>] nf_hook_slow+0x6a/0xd0
[252365.262114]  [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a
[252365.262123]  [<ffffffff813efe71>] ? ip_rcv_finish+0x0/0x34a
[252365.262132]  [<ffffffff813f042b>] NF_HOOK.clone.8+0x46/0x58
[252365.262142]  [<ffffffff813b6965>] ? __netdev_alloc_skb+0x34/0x51
[252365.262150]  [<ffffffff813f07c5>] ip_rcv+0x21e/0x24d
[252365.262159]  [<ffffffff813bda4c>] __netif_receive_skb+0x3ed/0x412
[252365.262168]  [<ffffffff813be15d>] netif_receive_skb+0x57/0x5e
[252365.262177]  [<ffffffff813be676>] napi_skb_finish+0x29/0x41
[252365.262186]  [<ffffffff813be6bd>] napi_gro_receive+0x2f/0x34
[252365.262223]  [<ffffffffa02011e2>] sky2_poll+0xa3b/0xc8c [sky2]
[252365.262234]  [<ffffffff8101057c>] ? native_sched_clock+0x35/0x37
[252365.262243]  [<ffffffff81010587>] ? sched_clock+0x9/0xd
[252365.262253]  [<ffffffff8106b129>] ? sched_clock_local+0x12/0x75
[252365.262264]  [<ffffffff8146912f>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[252365.262273]  [<ffffffff81108682>] ? kmem_cache_free+0x7a/0xb9
[252365.262283]  [<ffffffff813bf420>] net_rx_action+0xac/0x1bb
[252365.262291]  [<ffffffff813bcd1f>] ? __napi_schedule+0x50/0x57
[252365.262302]  [<ffffffff81053839>] __do_softirq+0xdd/0x199
[252365.262317]  [<ffffffffa01fd627>] ? sky2_intr+0x35/0x3c [sky2]
[252365.262327]  [<ffffffff8100abdc>] call_softirq+0x1c/0x30
[252365.262335]  [<ffffffff8100c338>] do_softirq+0x46/0x82
[252365.262344]  [<ffffffff81053999>] irq_exit+0x3b/0x7d
[252365.262352]  [<ffffffff8146f075>] do_IRQ+0x9d/0xb4
[252365.262361]  [<ffffffff81469593>] ret_from_intr+0x0/0x11
[252365.262366]  <EOI>  [<ffffffff8128f39c>] ? raw_local_irq_enable+0x10/0x12
[252365.262385]  [<ffffffff8106b3a0>] ? sched_clock_idle_wakeup_event+0x17/0x1b
[252365.262394]  [<ffffffff81290208>] acpi_idle_enter_bm+0x228/0x260
[252365.262405]  [<ffffffff813939e1>] cpuidle_idle_call+0x8b/0xe9
[252365.262416]  [<ffffffff81008325>] cpu_idle+0xaa/0xcc
[252365.262425]  [<ffffffff81450e46>] rest_init+0x8a/0x8c
[252365.262436]  [<ffffffff81ba1c49>] start_kernel+0x40b/0x416
[252365.262446]  [<ffffffff81ba12c6>] x86_64_start_reservations+0xb1/0xb5
[252365.262455]  [<ffffffff81ba13c2>] x86_64_start_kernel+0xf8/0x107

[252385.388560] eth0: hw csum failure.
[252385.388573] Pid: 0, comm: swapper Not tainted 2.6.35.6-48.fc14.x86_64 #1

[252395.271567] eth0: hw csum failure.
[252395.271575] Pid: 0, comm: swapper Not tainted 2.6.35.6-48.fc14.x86_64 #1

[252415.422920] eth0: hw csum failure.
[252415.422933] Pid: 3414, comm: chrome Not tainted 2.6.35.6-48.fc14.x86_64 #1

[252425.279133] eth0: hw csum failure.
[252425.279146] Pid: 3388, comm: chrome Not tainted 2.6.35.6-48.fc14.x86_64 #1
Comment 18 Jon Masters 2010-11-18 03:10:44 EST
Perhaps this needs filing upstream?
Comment 19 Maciej Żenczykowski 2011-03-05 19:34:37 EST
$ uname -a
Linux nike 2.6.35.11-83.fc14.x86_64 #1 SMP Mon Feb 7 07:06:44 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

Mar  5 16:20:02 nike kernel: [263628.531087] sky2 0000:0c:00.0: eth0: rx length error: status 0x5860100 length 2822

$ ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off
ntuple-filters: off
receive-hashing: on

I think this means that turning off rx accel makes the bug _much_ less likely to occur, but can still happen.

Very not clear to me how length > MTU (1500) happens with all rx offload turned off.

I've bugged upstream (via email to netdev) and was told it's not a bug in the driver, but rather in the bios/firmware or something.  Even though clearly it can be fixed in the driver by handling the error with an appropriate reset, they're not willing to listen - they consider it not their problem.
Comment 20 Michael Eager 2011-12-02 01:55:23 EST
I see the same on F15: 

# uname -a
Linux redwood.eagercon.com 2.6.41.1-1.fc15.x86_64 #1 SMP Fri Nov 11 21:36:28 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

[  242.545001] p35p1: hw csum failure.
[  242.545006] Pid: 0, comm: kworker/0:0 Tainted: P            2.6.41.1-1.fc15.x86_64 #1
[  242.545009] Call Trace:
[  242.545011]  <IRQ>  [<ffffffff813ec311>] netdev_rx_csum_fault+0x38/0x3c
[  242.545021]  [<ffffffff813e625f>] __skb_checksum_complete_head+0x51/0x65
[  242.545026]  [<ffffffff813e6284>] __skb_checksum_complete+0x11/0x13
[  242.545037]  [<ffffffffa0f3fb14>] br_multicast_rcv+0x885/0xd52 [bridge]
[  242.545042]  [<ffffffff81117195>] ? virt_to_head_page+0xe/0x31
[  242.545052]  [<ffffffffa0f3d552>] ? br_nf_pre_routing+0x2f/0x3df [bridge]
[  242.545060]  [<ffffffffa0f37827>] ? br_multicast_flood+0x11d/0x12c [bridge]
[  242.545066]  [<ffffffff8140fc32>] ? nf_iterate+0x48/0x7d
[  242.545074]  [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge]
[  242.545082]  [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge]
[  242.545087]  [<ffffffff8140fcd9>] ? nf_hook_slow+0x72/0x115
[  242.545096]  [<ffffffffa0f38672>] br_handle_frame_finish+0x92/0x20f [bridge]
[  242.545104]  [<ffffffffa0f385e0>] ? NF_HOOK.constprop.0+0x58/0x58 [bridge]
[  242.545113]  [<ffffffffa0f385d9>] NF_HOOK.constprop.0+0x51/0x58 [bridge]
[  242.545118]  [<ffffffff8149c9ac>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[  242.545126]  [<ffffffffa0f38995>] br_handle_frame+0x1a6/0x1c1 [bridge]
[  242.545135]  [<ffffffffa0f387ef>] ? br_handle_frame_finish+0x20f/0x20f [bridge]
[  242.545139]  [<ffffffff813ea31e>] __netif_receive_skb+0x2c5/0x417
[  242.545146]  [<ffffffff813ed5ba>] netif_receive_skb+0x6c/0x73
[  242.545152]  [<ffffffff813ed650>] napi_skb_finish+0x27/0x3f
[  242.545156]  [<ffffffff813edaa7>] napi_gro_receive+0x2f/0x34
[  242.545165]  [<ffffffffa0022293>] sky2_poll+0x7db/0x9f7 [sky2]
[  242.545170]  [<ffffffff813edbd3>] net_rx_action+0xa9/0x1b8
[  242.545175]  [<ffffffff8105d683>] __do_softirq+0xc9/0x1b5
[  242.545179]  [<ffffffff81014b35>] ? paravirt_read_tsc+0x9/0xd
[  242.545183]  [<ffffffff81014fec>] ? sched_clock+0x9/0xd
[  242.545188]  [<ffffffff814a536c>] call_softirq+0x1c/0x30
[  242.545192]  [<ffffffff81010b45>] do_softirq+0x46/0x81
[  242.545197]  [<ffffffff8105d94b>] irq_exit+0x57/0xb1
[  242.545201]  [<ffffffff814a5c4e>] do_IRQ+0x8e/0xa5
[  242.545205]  [<ffffffff8149cd6e>] common_interrupt+0x6e/0x6e
[  242.545208]  <EOI>  [<ffffffff81015d21>] ? mwait_idle+0x87/0xb4
[  242.545216]  [<ffffffff81015d14>] ? mwait_idle+0x7a/0xb4
[  242.545220]  [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
[  242.545225]  [<ffffffff8148baa3>] start_secondary+0x23f/0x241

Not related to MTU > 1500:

# ifconfig p35p1
p35p1     Link encap:Ethernet  HWaddr 00:1D:60:44:09:8C  
          inet6 addr: fe80::21d:60ff:fe44:98c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:94054 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20449 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:138109575 (131.7 MiB)  TX bytes:1919020 (1.8 MiB)
          Interrupt:19
Comment 21 Panos Kavalagios 2011-12-02 02:43:44 EST
I don't want to disappoint you, but the same failures occurs in Fedora 16 on the latest kernel kernel-3.1.2-1.fc16.x86_64 as well. Nothing has changed. Can you please move the bug against F16?

Nov 25 16:06:55 bb229 kernel: [20038.066696] eth0: hw csum failure.
Nov 25 16:06:55 bb229 kernel: [20038.066706] Pid: 8864, comm: firefox Tainted: G         C  3.1.2-1.fc16.x86_64 #1
Nov 25 16:06:55 bb229 kernel: [20038.066711] Call Trace:
Nov 25 16:06:55 bb229 kernel: [20038.066714]  <IRQ>  [<ffffffff813d7fde>] netdev_rx_csum_fault+0x38/0x3c
Nov 25 16:06:55 bb229 kernel: [20038.066734]  [<ffffffff813d1f2b>] __skb_checksum_complete_head+0x51/0x65
Nov 25 16:06:55 bb229 kernel: [20038.066741]  [<ffffffff813d1f50>] __skb_checksum_complete+0x11/0x13
Nov 25 16:06:55 bb229 kernel: [20038.066748]  [<ffffffff8143d654>] nf_ip_checksum+0xcd/0xd3
Nov 25 16:06:55 bb229 kernel: [20038.066771]  [<ffffffffa0468360>] udp_error+0x137/0x195 [nf_conntrack]
Nov 25 16:06:55 bb229 kernel: [20038.066780]  [<ffffffff8110cb93>] ? dma_pool_alloc+0x22f/0x244
Nov 25 16:06:55 bb229 kernel: [20038.066797]  [<ffffffffa04639ff>] nf_conntrack_in+0x174/0x7dc [nf_conntrack]
Nov 25 16:06:55 bb229 kernel: [20038.066807]  [<ffffffff813539f2>] ? uhci_alloc_td+0x1f/0x4d
Nov 25 16:06:55 bb229 kernel: [20038.066813]  [<ffffffff81353de9>] ? uhci_submit_common+0x2a7/0x341
Nov 25 16:06:55 bb229 kernel: [20038.066827]  [<ffffffffa0488569>] ipv4_conntrack_in+0x21/0x23 [nf_conntrack_ipv4]
Nov 25 16:06:55 bb229 kernel: [20038.066837]  [<ffffffff813fb903>] nf_iterate+0x48/0x7d
Nov 25 16:06:55 bb229 kernel: [20038.066844]  [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35
Nov 25 16:06:55 bb229 kernel: [20038.066850]  [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35
Nov 25 16:06:55 bb229 kernel: [20038.066857]  [<ffffffff813fb9aa>] nf_hook_slow+0x72/0x114
Nov 25 16:06:55 bb229 kernel: [20038.066862]  [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35
Nov 25 16:06:55 bb229 kernel: [20038.066869]  [<ffffffff81403794>] ? inet_del_protocol+0x35/0x35
Nov 25 16:06:55 bb229 kernel: [20038.066874]  [<ffffffff81403aec>] NF_HOOK.constprop.3+0x46/0x58
Nov 25 16:06:55 bb229 kernel: [20038.066878]  [<ffffffff81404089>] ip_rcv+0x239/0x268
Nov 25 16:06:55 bb229 kernel: [20038.066883]  [<ffffffff813d60f2>] __netif_receive_skb+0x3cd/0x418
Nov 25 16:06:55 bb229 kernel: [20038.066887]  [<ffffffff813d928a>] netif_receive_skb+0x6c/0x73
Nov 25 16:06:55 bb229 kernel: [20038.066891]  [<ffffffff813d9320>] napi_skb_finish+0x27/0x3f
Nov 25 16:06:55 bb229 kernel: [20038.066894]  [<ffffffff813d9778>] napi_gro_receive+0x2f/0x34
Nov 25 16:06:55 bb229 kernel: [20038.066906]  [<ffffffffa00f3293>] sky2_poll+0x7db/0x9f7 [sky2]
Nov 25 16:06:55 bb229 kernel: [20038.066912]  [<ffffffff81085d6f>] ? arch_local_irq_save+0x15/0x1b
Nov 25 16:06:55 bb229 kernel: [20038.066916]  [<ffffffff813d98a4>] net_rx_action+0xa9/0x1b8
Nov 25 16:06:55 bb229 kernel: [20038.066922]  [<ffffffff8105d67b>] __do_softirq+0xc9/0x1b5
Nov 25 16:06:55 bb229 kernel: [20038.066926]  [<ffffffff81014b35>] ? paravirt_read_tsc+0x9/0xd
Nov 25 16:06:55 bb229 kernel: [20038.066930]  [<ffffffff81014fec>] ? sched_clock+0x9/0xd
Nov 25 16:06:55 bb229 kernel: [20038.066936]  [<ffffffff814bfb6c>] call_softirq+0x1c/0x30
Nov 25 16:06:55 bb229 kernel: [20038.066940]  [<ffffffff81010b45>] do_softirq+0x46/0x81
Nov 25 16:06:55 bb229 kernel: [20038.066944]  [<ffffffff8105d943>] irq_exit+0x57/0xb1
Nov 25 16:06:55 bb229 kernel: [20038.066948]  [<ffffffff814c044e>] do_IRQ+0x8e/0xa5
Nov 25 16:06:55 bb229 kernel: [20038.066953]  [<ffffffff814b756e>] common_interrupt+0x6e/0x6e
Nov 25 16:06:57 bb229 kernel: [20038.066956]  <EOI>
Comment 22 Maciej Żenczykowski 2011-12-02 04:15:21 EST
Yeah, I also see it still occasionally on F15.  The driver's buggy... and just doesn't have good error recovery.

If you put the physical nic (eth0, renamed to peth0) in a bond [or bridge] (which you then call eth0), you can monitor /var/log/messages [via script] and [automatically ip link down/up the physical nic to recover without losing network state.
Comment 23 Panos Kavalagios 2012-02-02 13:13:16 EST
It seems that it is fixed in kernel-3.2.2-1.fc16.x86_64 (or earlier) or at least I haven't reproduced it yet that issue. Download of big files have worked fine so far.
Comment 24 Josh Boyer 2012-02-28 13:17:50 EST
(In reply to comment #23)
> It seems that it is fixed in kernel-3.2.2-1.fc16.x86_64 (or earlier) or at
> least I haven't reproduced it yet that issue. Download of big files have worked
> fine so far.

Thanks for letting us know.  If you see it again, please reopen.
Comment 25 Michael Eager 2012-02-28 13:32:29 EST
Any possibility that this would be backported to Fedora 15?
Comment 26 Panos Kavalagios 2012-03-28 08:47:31 EDT
The issue re-appeared on kernel-3.3.0-4.fc16.x86_64. Now it has an "improved" behaviour. Not only you have your network disconnected accompanied by kernel stacktraces of sky2 module, but also your computer hangs. You have to press the "reset" button to recover your system. Reloading of sky2 kernel module is not possible anymore.

Can we please re-open the issue and increase its priority?
Comment 27 Stanislaw Gruszka 2012-07-31 03:51:51 EDT
Reopening per comment 26.
Comment 28 Dave Jones 2012-10-23 11:39:33 EDT
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
Comment 29 Maciej Żenczykowski 2012-10-23 19:10:07 EDT
I no longer have the laptop in question.
Comment 30 Panos Kavalagios 2012-10-24 02:11:10 EDT
Created attachment 632552 [details]
Part of /var/log/messages

You might not having the hardware, but there are still people out there having it. We have suffered for years by this bug and to close it with a "NOTABUG" resolution is not helping at all.

Please, find attached the part of logs that the problem has occurred. It was reproduced on F17 with the following kernel package:

kernel-3.6.2-4.fc17.x86_64

Do us also a favour and re-open the bug. Transfer it to Fedora 17 or Fedora 18.
Comment 31 Panos Kavalagios 2012-12-20 07:09:39 EST
Well, I understand this issue is not going to be fixed, due to being for old and not popular hardware. At least can we request to restore the old behaviour, where the issue only affected the network connection and not to cause a kernel panic that returns to text console and lose all your work?
Comment 32 Stanislaw Gruszka 2012-12-20 07:51:55 EST
"list_del corruption" messages from logs from comment 30 do not give enough information to find out where the problem is. Please install kernel-debug variant, it should print some other warnings, which should indicate better where the problem is.
Comment 33 Panos Kavalagios 2012-12-20 10:25:58 EST
Created attachment 666733 [details]
Full /var/log/messages

Issue reproduced with debug kernel:

Dec 20 15:21:00 bb229 kernel: [  374.788380] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 1300
Dec 20 15:21:24 bb229 kernel: [  399.529927] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 220
Dec 20 15:21:28 bb229 kernel: [  402.719641] sky2 0000:04:00.0: eth0: rx error, status 0x7ffc0001 length 724

issue reproduced with normal kernel:

Dec 20 17:03:20 bb229 kernel: [ 2573.521475] sky2 0000:04:00.0: eth0: rx error, status 0xe3c4e3c4 length 0
Dec 20 17:04:01 bb229 kernel: [ 2614.600284] eth0: hw csum failure
Dec 20 17:04:01 bb229 kernel: [ 2614.600294] Pid: 0, comm: swapper/0 Tainted: G         C O 3.6.10-2.fc17.x86_64 #1

It is worthy to say that debug kernel didn't cause a single issue, whereas normal kernel produced a kernel panic in a text console. How can debug kernel work find and normal have issue? 

Please, ignore the many restart attempts, since I couldn't get rid of the debug kernel... keyboard was didn't respond on grub menu and I couldn't remove a running kernel with yum. I had to edit grub.cfg to load the normal kernel first.
Comment 34 Stanislaw Gruszka 2012-12-20 10:52:12 EST
Rsyslog might not log all kernel messages, please attach dmesg output from debug kernel when issue happens (i.e when 'dmesg | grep "rx error"' will shows entries). Thanks.
Comment 35 Panos Kavalagios 2012-12-20 11:10:44 EST
I also checked dmesg and there was nothing there that wasn't included in the /var/log/messages. Dmesg displayed only those lines printed in messages log. Does it require any other configuration to make the kernel more chatty?
Comment 36 Stanislaw Gruszka 2012-12-20 11:27:20 EST
Still I would like to see it.
Comment 37 Panos Kavalagios 2012-12-21 08:50:20 EST
Created attachment 667242 [details]
Part of /var/log/messages with debug kernel

As you wish!
Comment 38 Panos Kavalagios 2012-12-21 08:51:40 EST
Created attachment 667243 [details]
Dmesg with debug kernel
Comment 39 Stanislaw Gruszka 2013-01-07 08:07:55 EST
I think crash does not happen on debug-kernel because vbox modules are not loaded. Please remove or blacklist vbox modules on standard kernel and check if that prevent crashes.
Comment 40 Stanislaw Gruszka 2013-01-07 08:40:00 EST
Created attachment 674018 [details]
sky2_32bit_dma.patch

Trial patch - use 32 bit DMA only on sky2 device. Similar patch helped with random hw problems on skge driver, perhaps troubles here are also caused by 64 bit DMA. Please test, I lunched kernel build with patch here:
http://koji.fedoraproject.org/koji/taskinfo?taskID=4845143
Comment 41 Panos Kavalagios 2013-01-09 09:17:45 EST
Hello Stanislaw,

I apologise for the delay. I was confused how exactly am I going to test this. I am doing the following at the moment:

root@bb229:[201] ~ # yum install http://kojipkgs.fedoraproject.org//work/tasks/5145/4845145/kernel-3.6.11-4.sky2.fc17.x86_64.rpm

if I need more packages to update, please let me know.
Comment 42 Panos Kavalagios 2013-01-09 09:37:45 EST
Created attachment 675588 [details]
Part of Jan 9, 2013 /var/log/messages

Unfortunately, the issue occurred on the suggested kernel as well.
Comment 43 Stanislaw Gruszka 2013-01-10 03:47:09 EST
At least those "list_del corruption" related crashes gone, that must be a bug on vbox modules.

Hw csum errors can be possibly related with some problems on PCIe bus. Do any of below (mutual exclusive) kernel boot options prevent them ?

pcie_aspm=off
pcie_aspm=force 

Would be good also to try:

pci=nocsr,noacpi,nomsi
Comment 44 Stanislaw Gruszka 2013-01-13 15:45:05 EST
You can also try noapic kernel boot option.
Comment 45 Stanislaw Gruszka 2013-01-14 07:19:50 EST
This is nothing that could be easily fix
http://marc.info/?t=128157418200003&r=1&w=4

If upstream maintainer is not capable to fix the problem, the more we are not able to do this - closing as won't fix.
Comment 46 Panos Kavalagios 2013-01-14 08:17:34 EST
Hello Stanislaw,

Thank you for your time to look into this matter. I agree with you, but my comment 31 is still valid. May I ask if it is possible to restore the old kernel behaviour where the issue didn't cause a kernel panic and we were able to overcome with a simple "rmmod sky2; modprobe sky2"?
Comment 47 Stanislaw Gruszka 2013-01-14 08:29:45 EST
As I wrote before I think kernel panic is caused by vbox modules. We do not see the panic when modules were not load, i.e. in kernel-debug. Or I'm wrong, panic happens also without vbox? Note: vbox are external modules and well know crap, we do not support them.
Comment 48 Panos Kavalagios 2013-01-14 08:34:06 EST
It was also reproduced on the custom built kernel kernel-3.6.11-4.sky2.fc17.x86_64.rpm, where the vbox modules where not loaded. 

I am aware of vbox situation :)
Comment 49 Stanislaw Gruszka 2013-01-14 09:07:16 EST
Oh, I missed that, we have this in sky2 kernel logs:

Jan  9 16:27:02 bb229 kernel: [  308.371941] sky2 0000:04:00.0: eth0: rx error, status 0xe3c4e3c4 length 0
Jan  9 16:27:28 bb229 kernel: [  333.901176] thunderbird: Corrupted page table at address 7f01b9710b7c
Jan  9 16:27:28 bb229 kernel: [  333.901234] PGD 9be76067 PUD 9e630067 PMD a884d067 PTE d71c433c690438cb

I think we have memory corruption caused by sky2 DMA as described here:
http://marc.info/?l=linux-netdev&m=128207384327193&w=4

Looks like on some kernel builds or with some modules loaded, corruption can happen on vital data, on others (like kernel-debug) we corrupt non allocated memory.

I can not help here. My only advice is to stop using sky2 on your machine.
Comment 50 Panos Kavalagios 2013-01-14 11:00:07 EST
Created attachment 678275 [details]
Part of Jan 14, 2013 /var/log/messages

For completeness, I have also included the /var/log/messages with the suggested kernel options (pcie_aspm=off, pcie_aspm=force, pci=nocsr,noacpi,nomsi and noapic). None of them have worked.

Thanks Stanislaw for your efforts! I second your advise. I should upgrade my PC soon.

Note You need to log in before you can comment on or make changes to this bug.