Description of problem: Network access fails, and system ultimately locks or kernel abends (not sure which since screen monitor screen went blank and wouldn't come back alive) Version-Release number of selected component (if applicable): 2.6.17-1.2174_FC5smp How reproducible: Unknown, this has happened only once since using the new kernel. Steps to Reproduce: 1. Nothing to do, happened when system was idle. 2. 3. Actual results: kernel panic or lockup Expected results: stable system Additional info: There has been issues with the sky2 driver in the past if I recall correctly (or the driver that sky2 replaced). Here's a cut from the logs... the error repeats itself Aug 11 00:17:49 secure kernel: NETDEV WATCHDOG: eth0: transmit timed out Aug 11 00:17:49 secure kernel: sky2 eth0: tx timeout Aug 11 00:17:49 secure kernel: sky2 hardware hung? flushing Aug 11 00:17:49 secure kernel: BUG: warning at include/net/dst.h:154/dst_release() (Not tainted) Aug 11 00:17:49 secure kernel: <c05bb4a0> __kfree_skb+0x47/0xf6 <f88e7b7c> sky2_tx_complete+0xac/0x105 [sky2] Aug 11 00:17:49 secure kernel: <c0618177> _spin_lock_irqsave+0x9/0xd <f88e9806> sky2_poll+0x70f/0x895 [sky2] Aug 11 00:17:49 secure kernel: <c05bb26a> skb_clone+0x3d/0x1cb <c05c11d9> net_rx_action+0xa7/0x189 Aug 11 00:17:49 secure kernel: <c05c0c00> net_tx_action+0xbb/0xd8 <c0429dd9> __do_softirq+0x58/0xc2 Aug 11 00:17:49 secure kernel: <c04064e9> do_softirq+0x46/0x51 Aug 11 00:17:49 secure kernel: ======================= Aug 11 00:17:49 secure kernel: <c0406498> do_IRQ+0x79/0x84 <c040487e> common_interrupt+0x1a/0x20 Aug 11 00:17:49 secure kernel: <c0615fb4> schedule+0xb00/0xb69 <c0618177> _spin_lock_irqsave+0x9/0xd Aug 11 00:17:49 secure kernel: <c04364f1> prepare_to_wait+0x20/0xb4 <c0425d03> do_syslog+0xe4/0x30d Aug 11 00:17:49 secure kernel: <c043638c> autoremove_wake_function+0x0/0x35 <c04a156a> kmsg_read+0x0/0x36 Aug 11 00:17:49 secure kernel: <c046bb12> vfs_read+0xa6/0x14e <c046bf76> sys_read+0x41/0x67 Aug 11 00:17:49 secure kernel: <c0403dd5> sysenter_past_esp+0x56/0x79 the last error logged before lockup: Aug 11 00:18:23 secure kernel: BUG: warning at include/net/dst.h:154/dst_release() (Not tainted) Aug 11 00:18:23 secure kernel: <c05bb4a0> __kfree_skb+0x47/0xf6 <c05e4505> tcp_recvmsg+0x555/0x74f Aug 11 00:18:23 secure kernel: <c05b7cc9> sock_common_recvmsg+0x3e/0x54 <c05b58c0> do_sock_read+0xba/0xc2 Aug 11 00:18:23 secure kernel: <c05b5e82> sock_aio_read+0x5e/0x6a <c046b1db> do_sync_read+0xc3/0xfd Aug 11 00:18:23 secure kernel: <c043638c> autoremove_wake_function+0x0/0x35 <c046bb26> vfs_read+0xba/0x14e Aug 11 00:18:23 secure kernel: <c046bf76> sys_read+0x41/0x67 <c0403e3f> syscall_call+0x7/0xb Aug 11 00:18:23 secure kernel: BUG: warning at include/net/dst.h:154/dst_release() (Not tainted)
Okay, it happened again, and this time I've got a vmcore to work from... What I don't get is the time differences between when the dump occured and when the log entries occured (Almost a day difference?). I also don't get how the dump doesn't mention the sky2 driver at all. Perhaps something more core is being corrupted which is taking the rest down, or the sky2 logs I'm seeing are the result of the kdump kernel kicking in and taking over after the oops? I've reconfigured kdump to reboot after dumping the core so I'll know for certain the next time this happens. Also, I think this might be related to using the Azureus bittorrent client on a windows computer that goes through linux. It uses this upnp daemon: http://upnp.sourceforge.net/ Note that I haven't had trouble with this daemon previously when I used a different bittorrent client. Crash output: KERNEL: /usr/lib/debug/lib/modules/2.6.17-1.2174_FC5smp/vmlinux DUMPFILE: vmcore CPUS: 2 DATE: Thu Aug 17 00:47:00 2006 UPTIME: 5 days, 08:50:35 LOAD AVERAGE: 0.00, 0.00, 0.01 TASKS: 149 RELEASE: 2.6.17-1.2174_FC5smp VERSION: #1 SMP Tue Aug 8 16:00:39 EDT 2006 MACHINE: i686 (3412 Mhz) MEMORY: 2 GB PANIC: "Oops: 0000 [#1]" (check log for details) PID: 1989 COMMAND: "klogd" TASK: f7e3ebf0 [THREAD_INFO: c6228000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> bt PID: 1989 TASK: f7e3ebf0 CPU: 0 COMMAND: "klogd" #0 [c078edb4] crash_kexec at c0444961 #1 [c078edfc] die at c040547a #2 [c078ee3c] do_page_fault at c061949c #3 [c078ee9c] error_code (via page_fault) at c04049d5 EAX: 00000023 EBX: ffffffff ECX: c06d7410 EDX: c50efa78 EBP: 80000080 DS: 007b ESI: 00000023 ES: 007b EDI: f6da3000 CS: 0060 EIP: c04ea83b ERR: ffffffff EFLAGS: 00210086 #4 [c078eed0] _raw_spin_lock at c04ea83b #5 [c078eef8] cache_flusharray at c0466e93 #6 [c078ef18] kfree at c0466f67 #7 [c078ef2c] kfree_skbmem at c05bb3fb --- <soft IRQ> --- #0 [c6228ef4] do_softirq at c04064a3 #1 [c6228f00] do_IRQ at c0406493 #2 [c6228f18] common_interrupt at c0404879 EAX: 00000000 EBX: c6228f72 ECX: c06d58b0 EDX: 000061e5 EBP: 00000055 DS: 007b ESI: 00000fff ES: 007b EDI: 00000000 CS: 0060 EIP: c0425d43 ERR: ffffffb5 EFLAGS: 00200246 #3 [c6228f4c] do_syslog at c0425d43 #4 [c6228f9c] sys_read at c046bf71 #5 [c6228fb8] sysenter_entry at c0403dce EAX: 00000003 EBX: 00000000 ECX: 00bac6c0 EDX: 00000fff DS: 007b ESI: 00bac6c0 ES: 007b EDI: 00baa9ff SS: 007b ESP: bfb8b898 EBP: bfb8b8c8 CS: 0073 EIP: 003ca410 ERR: 00000003 EFLAGS: 00200246 crash> whatis _raw_spin_lock void _raw_spin_lock(spinlock_t *); crash> bt -f ... #4 [c078eed0] _raw_spin_lock at c04ea83b [RA: c0466e98 SP: c078eed4 FP: c078eef8 SIZE: 40] c078eed4: 0000000b f6ac17c8 0000000a 00200002 c078eee4: 00200002 f59e1c14 ffffffff c549c838 c078eef4: f6da3000 c0466e98 #5 [c078eef8] cache_flusharray at c0466e93 [RA: c0466f6c SP: c078eefc FP: c078ef18 SIZE: 32] c078eefc: d2180980 c50efa78 00000023 c50efa78 c078ef0c: c549c838 f6da3000 00200286 c0466f6c ... crash> rd c0477e98 8 c0477e98: 60003d07 12750000 00001bb8 404de800 .=.`..u.......M@ c0477ea8: c085fffb 00db840f 868b0000 000000d4 ................ Log: Aug 18 02:50:41 secure kernel: NETDEV WATCHDOG: eth0: transmit timed out Aug 18 02:50:41 secure kernel: sky2 eth0: tx timeout Aug 18 02:50:41 secure kernel: sky2 hardware hung? flushing Aug 18 02:50:41 secure kernel: BUG: warning at include/net/dst.h:154/dst_release() (Not tainted) Aug 18 02:50:41 secure kernel: <c05bb4a0> __kfree_skb+0x47/0xf6 <f88e7b7c> sky2_tx_complete+0xac/0x105 [sky2] Aug 18 02:50:41 secure kernel: <c0420031> migration_call+0x257/0x3a5 <f88e9806> sky2_poll+0x70f/0x895 [sky2] Aug 18 02:50:41 secure kernel: <c05bb26a> skb_clone+0x3d/0x1cb <c05c11d9> net_rx_action+0xa7/0x189 Aug 18 02:50:41 secure kernel: <c05c0c00> net_tx_action+0xbb/0xd8 <c0429dd9> __do_softirq+0x58/0xc2 Aug 18 02:50:41 secure kernel: <c04064e9> do_softirq+0x46/0x51 Aug 18 02:50:41 secure kernel: ======================= Aug 18 02:50:41 secure kernel: <c0406498> do_IRQ+0x79/0x84 <c040487e> common_interrupt+0x1a/0x20 Aug 18 02:50:41 secure kernel: <c0425d43> do_syslog+0x124/0x30d <c043638c> autoremove_wake_function+0x0/0x35 Aug 18 02:50:41 secure kernel: <c04a156a> kmsg_read+0x0/0x36 <c046bb12> vfs_read+0xa6/0x14e Aug 18 02:50:41 secure kernel: <c046bf76> sys_read+0x41/0x67 <c0403dd5> sysenter_past_esp+0x56/0x79 Aug 18 02:50:41 secure kernel: BUG: warning at include/net/dst.h:154/dst_release() (Not tainted) Aug 18 02:50:41 secure kernel: <c05bb4a0> __kfree_skb+0x47/0xf6 <f88e7b7c> sky2_tx_complete+0xac/0x105 [sky2] Aug 18 02:50:41 secure kernel: <c0420031> migration_call+0x257/0x3a5 <f88e9806> sky2_poll+0x70f/0x895 [sky2] Aug 18 02:50:41 secure kernel: <c05bb26a> skb_clone+0x3d/0x1cb <c05c11d9> net_rx_action+0xa7/0x189 Aug 18 02:50:41 secure kernel: <c05c0c00> net_tx_action+0xbb/0xd8 <c0429dd9> __do_softirq+0x58/0xc2 Aug 18 02:50:41 secure kernel: <c04064e9> do_softirq+0x46/0x51 Aug 18 02:50:41 secure kernel: ======================= Aug 18 02:50:41 secure kernel: <c0406498> do_IRQ+0x79/0x84 <c040487e> common_interrupt+0x1a/0x20 Aug 18 02:50:41 secure kernel: <c0425d43> do_syslog+0x124/0x30d <c043638c> autoremove_wake_function+0x0/0x35 Aug 18 02:50:41 secure kernel: <c04a156a> kmsg_read+0x0/0x36 <c046bb12> vfs_read+0xa6/0x14e Aug 18 02:50:41 secure kernel: <c046bf76> sys_read+0x41/0x67 <c0403dd5> sysenter_past_esp+0x56/0x79 Aug 18 02:50:41 secure kernel: Attempt to release alive inet socket f62b6300 Aug 18 02:50:41 secure kernel: BUG: warning at include/net/dst.h:154/dst_release() (Not tainted) Aug 18 02:50:41 secure kernel: <c05bb4a0> __kfree_skb+0x47/0xf6 <f88e7b7c> sky2_tx_complete+0xac/0x105 [sky2] Aug 18 02:50:41 secure kernel: <c0420031> migration_call+0x257/0x3a5 <f88e9806> sky2_poll+0x70f/0x895 [sky2] Aug 18 02:50:41 secure kernel: <c05bb26a> skb_clone+0x3d/0x1cb <c05c11d9> net_rx_action+0xa7/0x189 Aug 18 02:50:41 secure kernel: <c05c0c00> net_tx_action+0xbb/0xd8 <c0429dd9> __do_softirq+0x58/0xc2 Aug 18 02:50:41 secure kernel: <c04064e9> do_softirq+0x46/0x51 Aug 18 02:50:41 secure kernel: ======================= ... Aug 18 02:50:41 secure kernel: ======================= Aug 18 02:50:41 secure kernel: <c0406498> do_IRQ+0x79/0x84 <c040487e> common_interrupt+0x1a/0x20 Aug 18 02:50:41 secure kernel: <c0425d43> do_syslog+0x124/0x30d <c043638c> autoremove_wake_function+0x0/0x35 Aug 18 02:50:41 secure kernel: <c04a156a> kmsg_read+0x0/0x36 <c046bb12> vfs_read+0xa6/0x14e Aug 18 02:50:41 secure kernel: <c046bf76> sys_read+0x41/0x67 <c0403dd5> sysenter_past_esp+0x56/0x79
Another crash: KERNEL: /usr/lib/debug/lib/modules/2.6.17-1.2174_FC5smp/vmlinux DUMPFILE: vmcore CPUS: 2 DATE: Mon Aug 21 18:30:16 2006 UPTIME: 00:25:44 LOAD AVERAGE: 0.00, 0.00, 0.00 TASKS: 138 RELEASE: 2.6.17-1.2174_FC5smp VERSION: #1 SMP Tue Aug 8 16:00:39 EDT 2006 MACHINE: i686 (3412 Mhz) MEMORY: 2 GB PANIC: "Oops: 0000 [#1]" (check log for details) PID: 0 COMMAND: "swapper" TASK: c06d1c80 (1 of 2) [THREAD_INFO: c0747000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> bt PID: 0 TASK: c06d1c80 CPU: 0 COMMAND: "swapper" #0 [c078ec34] crash_kexec at c0444961 #1 [c078ec7c] die at c040547a #2 [c078ecbc] do_page_fault at c061949c #3 [c078ed1c] error_code (via page_fault) at c04049d5 EAX: 00200200 EBX: f6a5338c ECX: f6e619b8 EDX: 00000000 EBP: f6e61800 DS: 007b ESI: f6a53300 ES: 007b EDI: c6361000 CS: 0060 EIP: f8a650bb ERR: ffffffff EFLAGS: 00210296 #4 [c078ed50] ip_nat_cleanup_conntrack at f8a650bb #5 [c078ed68] destroy_conntrack at f8af7239 #6 [c078ed80] packet_rcv_spkt at c06150b0 #7 [c078eda8] dev_hard_start_xmit at c05bf943 #8 [c078edcc] __qdisc_run at c05cdf88 #9 [c078edec] dev_queue_xmit at c05c13d7 --- <soft IRQ> --- #0 [c0747f7c] do_softirq at c04064a3 #1 [c0747f88] apic_timer_interrupt at c040490a #2 [c0747fc8] cpu_idle at c04030c5 crash> whatis ip_nat_cleanup_conntrack whatis: gdb request failed: whatis ip_nat_cleanup_conntrack crash> bt -f #3 [c078ed1c] error_code (via page_fault) at c04049d5 EAX: 00200200 EBX: f6a5338c ECX: f6e619b8 EDX: 00000000 EBP: f6e61800 DS: 007b ESI: f6a53300 ES: 007b EDI: c6361000 CS: 0060 EIP: f8a650bb ERR: ffffffff EFLAGS: 00210296 [RA: f8a650bb SP: c078ed20 FP: c078ed50 SIZE: 52] c078ed20: f6a5338c f6e619b8 00000000 f6a53300 c078ed30: c6361000 f6e61800 00200200 0000007b c078ed40: 0000007b ffffffff f8a650bb 00000060 c078ed50: 00210296 #4 [c078ed50] ip_nat_cleanup_conntrack at f8a650bb [RA: f8af723b SP: c078ed54 FP: c078ed68 SIZE: 24] c078ed54: f7e62910 00000020 c078edc8 f5e93a80 c078ed64: f6a53300 f8af723b #5 [c078ed68] destroy_conntrack at f8af7239 [RA: c06150b3 SP: c078ed6c FP: c078ed80 SIZE: 24] c078ed6c: 00000000 c04c28e4 00000020 f5e93a80 c078ed7c: f77ca580 c06150b3 crash> dis f8a650bb 0xf8a650bb <ip_nat_cleanup_conntrack+39>: mov (%eax),%eax
Server locked up again. kdump failed to dump the kernel, so I was unable to see any kernel errors. (screen saver blanked out the screen) This was the only error before it locked about 10 mins later: Sep 14 05:17:01 secure kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 14 05:17:01 secure kernel: sky2 eth0: tx timeout Sep 14 05:17:01 secure kernel: sky2 status report lost? Sep 14 05:17:51 secure kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 14 05:17:51 secure kernel: sky2 eth0: tx timeout Sep 14 05:17:51 secure kernel: sky2 hardware hung? flushing Note that this MB has two sky2 interfaces, so this problem might be related to that. It's an ASUS P5WD2E Premium Here's lspci's output: 00:00.0 Host bridge: Intel Corporation Memory Controller Hub (rev c0) 00:01.0 PCI bridge: Intel Corporation PCI Express Graphics Port (rev c0) 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (r ev 01) 00:1c.3 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4 (r ev 01) 00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express P ort 5 (rev 01) 00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express P ort 6 (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Control ler (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bri dge (rev 01) 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (re v 01) 00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA Storage Controllers cc=IDE (rev 01) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 01:00.0 Mass storage controller: <pci_lookup_name: buffer too small> (rev 13) 01:01.0 VGA compatible controller: nVidia Corporation NV18 [GeForce4 MX 4000 AGP 8x] (rev c1) 01:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Cont roller (PHY/Link) 02:00.0 SATA controller: Marvell Technology Group Ltd. Unknown device 6141 (rev 01) 03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 20) 04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 20)
Same problem here. The system just locks up from time to time wihtout any specific reason. ~~ 8< ~~ Sep 18 18:29:48 host01 kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 18 18:29:48 host01 kernel: sky2 eth0: tx timeout Sep 18 18:29:48 host01 kernel: sky2 hardware hung? flushing Sep 18 18:43:08 host01 kernel: NETDEV WATCHDOG: eth0: transmit timed out Sep 18 18:43:08 host01 kernel: sky2 eth0: tx timeout Sep 18 18:43:08 host01 kernel: sky2 status report lost? [keeps repeating till reboot] ~~ >8 ~~
Other symptoms include my eth0 going "dead"... eth1 continues to work (both are sky2).... ifdowning the interfaces, and rmmodding the drivers and then bringning them back up solves this problem for a few mins before it happens again... the only REAL solution when this happens is to reboot. Sometimes it works, sometimes it doesn't.
on my box eth0 goes "dead" but i never get anything in dmesg or /var/log/messages talking about the network card, only samba complaining that such and such host no longer exists (because the box no longer communicates on the lan). running newest kernel release on 64bit fc5. i never ran into this until just the other day, lucky before now i suppose. if i issue an rmmod, modprobe, ifdown, ifup then it is back until it happens again. the machine isnt mission critical, so it is more of an annoyance for now. i have a multiport intel nic i am going to put in to use until i can go back to the onboard marvell
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
installing the update now, will get back when i have tested it. downside is i bought an intel pcie gige, but oh well. it will be nice having a pair of gige in this box.
This did not fix anything, at least for me. the rmmod && modprobe method did bring it back to functional. back to the intel.
since this is in the NEEDINFO state, is there any info you want from me? in my case nothing appears in dmesg to indicate failure. the only other thing of note is that i am hooking into a 10/100 switch (not a gig) and that it only happens to me when i am making moderate use of the available bandwidth. so far if maxing it out (9-11 MB/s) i have not hit the bug but i might be getting lucky. nearly idle traffic levels also havent triggered it thusfar for me.
I'll point the upstream maintainer at the bug, maybe he has some ideas.
There is a much later version of sky2 (1.9) available in the Fedora-netdev kernels: http://people.redhat.com/linville/kernels/fedora-netdev/ Please give those a try and post the results here...thanks!
ok, i will try them out :) it will have to be after the weekend though, way too much happening. i will start testing either sunday night or monday night
Upstream came up with a patch that may solve this bug. I've added it to CVS for the next FC5 update, which should come out some time next week hopefully.
ok so tell me which you would prefer i do, upgade to fc6, wait on the official fc5 update, or try out the netdev kernels?
FWIW, you can try the netdev kernels w/o upgrading to fc6. And when Dave pushes-out the updated official fc5 kernel you will still get that as well.
Oh i know, i was asking which direction you guys would prefer i try first really. i will try the netdev route first.
I was about to post about how after upgrading to 1.6.18-1.2200 the problem went away, but it hit me 5 times today... The only thing I was doing differently was tunneling a Windows Remote Desktop session through ssh. In any event, I've installed and am running the netdev version now. The only thing I'd like to request is the inclusion of kernel-debuginfo and kernel-kdump rpms for the netdev version of the kernel. It would help with reporting of any new bugs found. Damn... the problem just happened again with this version of the kernel. :(
ok the netdev version so far is working like a champ for me, i havent managed to make it trigger the problem since i installed kernel-2.6.18-1.2200.2.10.fc5.netdev.13.1 around 48 hrs ago. i dont know if this means it is fixed as the gentleman from zestysoft.com still is having issues but thusfar it is working great. i am going to keep running with this version until a new official release hits then try it out
should be fixed in 2.6.18-1.2239.fc5 now in updates.
Not fixed... kernel panic some time last night. Saw somthing like sky2_dequeue on the kernel panic screen. Can't give you any more details since kdump no longer works: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=215268
strings of sky2.ko from the 2.6.18-1.2239.fc5 kernel shows: version=1.5 not v1.9 that Mr. Linville mentioned?
I experience this bug on FC6 both with stock kernel and with latest kernel-2.6.18-1.2849.fc6 one, even though it's said in changelog that it was "fixed". So don't expect to get rid of this by upgrading to FC6. And sky2 version is 1.5 too. The only difference is that it doesn't hang for me, and rmmod+modprobe helps, but that's probably due to the difference in hardware. On fc5 and 2.6.17 error messages were exactly the same, and rmmod+modprobe helped too. NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 163 .. 123 report=163 done=163 sky2 hardware hung? flushing NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 123 .. 82 report=163 done=163 sky2 status report lost? NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 162 .. 121 report=163 done=163 sky2 status report lost? (continues)
The version in FC6.netdev.1 is 1.10 -- this is later than what is in vanilla FC6. Does this help? http://people.redhat.com/linville/kernels/fedora-netdev/
Problem appears to be the same as reported in: http://bugzilla.kernel.org/show_bug.cgi?id=6839 There is a patch available for this problem in newer kernels. Note that there is still no official patch for RHEL 4.4 available.
Looks like that patch made it into 2.6.18.3, which is queued for the next update. Work in progress RPMs of the next update kernel are at http://people.redhat.com/davej/kernels/Fedora/
kernel-2.6.18-1.2849.2.2.fc6.netdev.2.x86_64 with sky2 version 1.10 behaves MUCH worse - not only those problems happen more often, but my computer locks up now. It already locked up three times since I started to using that kernel 10 hours ago. I'm going back to regular kernel-2.6.18-1.2849.fc6.x86_64. With 2.6.17, problem happened once a day. With 2.6.18, problem happened once a week. With netdev kernel, it happens as often as with 2.6.17 and computer seems to hang right after first portion of messages. It didn't happen before. Nov 22 08:52:19 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 22 08:52:19 washi kernel: sky2 eth0: tx timeout Nov 22 08:52:19 washi kernel: sky2 status report lost? Nov 22 11:16:35 washi syslogd 1.4.1: restart.
> Work in progress RPMs of the next update kernel are at Any chance of getting x86_64 version or srpm? PS netdev kernel just locked up once again. I think that problem happens much more often now than with old 2.6.17 kernel.
oops. x86-64 kernels there now too.
I need -devel packages too.. (can't really let it stay on my system for a long time without fglrx driver).
done
Great, thanks. I installed it, if no problems will occur in a week or so, I'll post here
Something really strange just happened. I was working while having an ssh to home computer (the one with the problems). At some point, connection was lost. I couldn't reconnect. I thought "so this kernel doesn't solve the problem too". But, in a hour, I could ssh again! That never happened before. My computer was turned on for over 20 hours while the problem persisted. So either this was a very rare case when network driver "fixed itself", or the new kernel still have the same problem, but somehow manages to fix the problem. from messages: Nov 28 15:12:59 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 28 15:12:59 washi kernel: sky2 eth0: tx timeout Nov 28 15:12:59 washi kernel: sky2 hardware hung? flushing Nov 28 15:13:14 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 28 15:13:14 washi kernel: sky2 eth0: tx timeout Nov 28 15:13:14 washi kernel: sky2 status report lost? Nov 28 15:13:24 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 28 15:13:24 washi kernel: sky2 eth0: tx timeout Nov 28 15:13:24 washi kernel: sky2 status report lost? ... Nov 28 15:52:09 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 28 15:52:09 washi kernel: sky2 eth0: tx timeout Nov 28 15:52:09 washi kernel: sky2 status report lost? Nov 28 15:52:14 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 28 15:52:14 washi kernel: sky2 eth0: tx timeout Nov 28 15:52:14 washi kernel: sky2 status report lost? Nov 28 15:52:15 washi kernel: sky2 eth0: rx error, status 0x7ffc0001 length 168 Right now everyting works great. Any ideas?
The "sky2 eth0: rx error, status 0x7ffc0001 length 168" message looks like something different to me. It may or may not be related to the hang, and probably only indicates a problem with a single frame. Did you ever try the fedora-netdev kernels?
> It may or may not be related to the hang, and probably only indicates a problem with a single frame. Yes, but after that message eth0 started to work properly again. Maybe it was a problem with single frame, but its handling code somehow fixed the source of another problem? Yes, I tried netdev kernel, read comments 27 and 28. I won't try it again unless something has changed in that area..
> Yes, but after that message eth0 started to work properly again. Maybe it > was a problem with single frame, but its handling code somehow fixed the > source of another problem? The 'handling code' is just a printk, followed by bumping an error count and exiting. It does requeue the buffer for later packet receives, but that isn't any different from when it receives a good frame. If you can reproduce the hang, could you try sending a frame to the hung adapter from another box on the same LAN segment, using ping or arping? Preferrably this would be from a box that already has a valid entry in his ARP cache (so that he sends unicast frames). Does this "unhang" the adapter? Can you characterize the traffic the adapter is seeing when it hangs?
> If you can reproduce the hang, could you try sending a frame to the hung adapter from another box on the same LAN segment, using ping or arping? I will be able to do this later, maybe in a week or so. My another computer cries for replacement of broken hdd and can barely work a few hours without hanging :-/ I can tell that if I turn the switch off and then turn it on, primary computer (with sky2 in hanging state) locks up instantly. At least it used to be true when I used fc5 with 2.6.17, probably it was fixed, I don't really want to check it again ;). The only way of resetting the switch is ifdown eth0; (reset); ifup eth0. As about traffic - some ssh sessions, bittorrent, cron'ed rsync jobs sometimes. Usually it happens when I'm away from computer - at night, or while I'm working. However, I've seen this while regular browsing too. Probably torrent is to blame, since it's seeding most of the time. Also other people reported that nfs traffic can result in this or very similar problem (#188283, #197592).
About ping: adsl modem is connected to the same switch, and it forwards incoming ssh connections to this box. Connection attempt doesn't unhang adapter. Also, while I was writing previous message, I lost connection to my box again. I'm somehow interested whether problem will go away by itself this time or will have to wait till I come home.
It is working again. Nov 28 16:56:59 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 28 16:56:59 washi kernel: sky2 eth0: tx timeout Nov 28 16:56:59 washi kernel: sky2 hardware hung? flushing Nov 28 17:00:54 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 28 17:00:54 washi kernel: sky2 eth0: tx timeout Nov 28 17:00:54 washi kernel: sky2 status report lost? ... Nov 28 17:54:13 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Nov 28 17:54:13 washi kernel: sky2 eth0: tx timeout Nov 28 17:54:13 washi kernel: sky2 status report lost? Nov 28 17:54:16 washi kernel: sky2 eth0: rx error, status 0x7ffc0001 length 512 I'm staying with this kernel. Not perfect, but at least now I believe for sure that it can work around this problem.
few interesting things to note. The status on that rx error indicates that we've had an rx fifo overrun. Also the reported length of the frame doesn't match the length in the status field (0x7ffc). not sure why that would be. I'm trying to get hold of the hardware docs now. Until then, setting to needinfo until requested tests are done
Created attachment 142518 [details] call trace from /var/log/messages Today I've seen different pattern. First, usually there is one "hardware hung? flushing" message and then numerous "status report lost?" messages, however this time, they were alternating a bit. Second, after "rx error, status 0x7ffc0001" message the problem hasn't fixed itself; third, my computer finally locked up. Fortunately, call trace was written to messages and somehow survived reboot. This is complete log.
For the record, I just got the following on our x86_64 webserver running 2.6.18-1.2257.fc5: NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 459 .. 418 report=459 done=459 sky2 hardware hung? flushing We've been bitten by this bug a few times already, this was the fifth time in a few months. Build #2257's announcement message claimed to "fix lockup with sky2 driver" but unfortunately the fix doesn't seem to have helped :(
Still waiting on hw docs. Larry can you help me out? :)
The same problem of a hanging sky2 driver is observed for three distributions: 1. Suse => http://bugzilla.kernel.org/show_bug.cgi?id=6839 2. RHEL => https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=216799 3. Fedora => this bugthread The patch for Suse works well for Suse and RHEL. Not for Fedora? Suspect... Was the patch really integrated in the kernel that was used for the testing in comment #27 ?
The patch from bug 216799 corresponds to upstream commit 470ea7eba4aaa517533f9b02ac9a104e77264548 and that is _not_ in the netdev kernel mentioned in comment 27. However, that patch seems to be in 2.6.19-1.2869.fc6 (the current latest fc6 update kernel). Could you try that kernel and post the results here? Thanks!
If it helps, I am getting the similar problem on a later FC6 Kernel 2.6.19-1.2895.fc6. I notice no bug report is open on FC6. Not sure if it needs to be opened or not?
When I tried 2.6.19-1.2859.fc6, I experienced even worse problem. Right now I returned to 2.6.18 because it ignores one of my hard drives, however I tried to use 2.6.19 for a little while. It has sky2 driver version 1.10. I experienced crash with it - much like with netdev kernel which also had 1.10 driver (see comment 27). Exact backtrace: Jan 28 18:43:22 washi kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 28 18:43:22 washi kernel: sky2 eth0: tx timeout Jan 28 18:43:22 washi kernel: sky2 status report lost? Jan 28 18:43:31 washi kernel: BUG: soft lockup detected on CPU#0! Jan 28 18:43:31 washi kernel: Jan 28 18:43:31 washi kernel: Call Trace: Jan 28 18:43:31 washi kernel: [<ffffffff8026999a>] show_trace+0x34/0x47 Jan 28 18:43:31 washi kernel: [<ffffffff802699bf>] dump_stack+0x12/0x17 Jan 28 18:43:31 washi kernel: [<ffffffff802b6c92>] softlockup_tick+0xdb/0xf6 Jan 28 18:43:31 washi kernel: [<ffffffff80293bdb>] update_process_times+0x42/0x68 Jan 28 18:43:31 washi kernel: [<ffffffff802749d9>] smp_local_timer_interrupt+0x34/0x55 Jan 28 18:43:31 washi kernel: [<ffffffff8027508d>] smp_apic_timer_interrupt+0x51/0x69 Jan 28 18:43:31 washi kernel: [<ffffffff8025ccf6>] apic_timer_interrupt+0x66/0x70 Jan 28 18:43:31 washi kernel: [<ffffffff80207808>] _raw_spin_lock+0x78/0xe5 Jan 28 18:43:31 washi kernel: [<ffffffff881a21ad>] :sky2:sky2_tx_timeout+0xf7/0x194 Jan 28 18:43:31 washi kernel: [<ffffffff8023e9c0>] dev_watchdog+0x7b/0xc2 Jan 28 18:43:31 washi kernel: [<ffffffff8029360e>] run_timer_softirq+0x13b/0x1b5 Jan 28 18:43:31 washi kernel: [<ffffffff80211ee5>] __do_softirq+0x55/0xc4 Jan 28 18:43:31 washi kernel: [<ffffffff8025d24c>] call_softirq+0x1c/0x30 Jan 28 18:43:31 washi kernel: [<ffffffff8026aa2f>] do_softirq+0x2c/0x97 Jan 28 18:43:31 washi kernel: [<ffffffff80275092>] smp_apic_timer_interrupt+0x56/0x69 Jan 28 18:43:31 washi kernel: [<ffffffff8025ccf6>] apic_timer_interrupt+0x66/0x70 Jan 28 18:43:31 washi kernel: [<ffffffff802690f2>] mwait_idle_with_hints+0x44/0x45 Jan 28 18:43:31 washi kernel: [<ffffffff8025554d>] mwait_idle+0xc/0x20 Jan 28 18:43:31 washi kernel: [<ffffffff802476bd>] cpu_idle+0x8b/0xae Jan 28 18:43:31 washi kernel: [<ffffffff806b0733>] start_kernel+0x22a/0x22f Jan 28 18:43:31 washi kernel: [<ffffffff806b015a>] _sinittext+0x15a/0x15e Jan 28 18:43:31 washi kernel: Jan 28 18:43:49 washi kernel: BUG: spinlock lockup on CPU#0, swapper/0, ffff81007d1eb280 (Tainted: P M ) Jan 28 18:43:49 washi kernel: Jan 28 18:43:49 washi kernel: Call Trace: Jan 28 18:43:49 washi kernel: [<ffffffff8026999a>] show_trace+0x34/0x47 Jan 28 18:43:49 washi kernel: [<ffffffff802699bf>] dump_stack+0x12/0x17 Jan 28 18:43:49 washi kernel: [<ffffffff80207854>] _raw_spin_lock+0xc4/0xe5 Jan 28 18:43:49 washi kernel: [<ffffffff881a21ad>] :sky2:sky2_tx_timeout+0xf7/0x194 Jan 28 18:43:49 washi kernel: [<ffffffff8023e9c0>] dev_watchdog+0x7b/0xc2 Jan 28 18:43:49 washi kernel: [<ffffffff8029360e>] run_timer_softirq+0x13b/0x1b5 Jan 28 18:43:49 washi kernel: [<ffffffff80211ee5>] __do_softirq+0x55/0xc4 Jan 28 18:43:49 washi kernel: [<ffffffff8025d24c>] call_softirq+0x1c/0x30 Jan 28 18:43:49 washi kernel: [<ffffffff8026aa2f>] do_softirq+0x2c/0x97 Jan 28 18:43:49 washi kernel: [<ffffffff80275092>] smp_apic_timer_interrupt+0x56/0x69 Jan 28 18:43:49 washi kernel: [<ffffffff8025ccf6>] apic_timer_interrupt+0x66/0x70 Jan 28 18:43:49 washi kernel: [<ffffffff802690f2>] mwait_idle_with_hints+0x44/0x45 Jan 28 18:43:49 washi kernel: [<ffffffff8025554d>] mwait_idle+0xc/0x20 Jan 28 18:43:49 washi kernel: [<ffffffff802476bd>] cpu_idle+0x8b/0xae Jan 28 18:43:49 washi kernel: [<ffffffff806b0733>] start_kernel+0x22a/0x22f Jan 28 18:43:49 washi kernel: [<ffffffff806b015a>] _sinittext+0x15a/0x15e Jan 28 18:43:49 washi kernel: Now for the interesting part. I googled for this bug again and found very interesting recent discussion on LKML. The link is http://www.gossamer-threads.com/lists/linux/kernel/725399 There are two solutions suggested, "idle=poll" and "pci=nomsi" kernel options. I'll try them in order to see if that helps. Unfortunately, I'll have to try that with 2.6.18, since I can't switch to 2.6.19 for now (due to bug 224680), but maybe that doesn't matter.
just to clarify, you mention problems with kernel 2.6.19-1.2859.fc6, while linvile mentioned a potential fix with kernel 2.6.19-1.2869.fc6. was that a typo or not? I'll try the same options here on my test setup. Please post results when you have them. Thanks!
Oops sorry, I meant kernel-2.6.19-1.2895.fc6.x86_64 (current in fc6 updates). $ rpm -q kernel kernel-2.6.18-1.2856.fc6.x86_64 kernel-2.6.19-1.2895.fc6.x86_64 Just wrote 59 instead of 95.. And I never tried 2869, not 2.6.18 nor 2.6.19.
ok, let me know how your irq=poll and pci=nomsi tests go.
irq=poll doesn't work. kernel recognized it, wrote "using poll in idle threads" on boot, but I got the same call trace after a while. This time, machine was working after those message, but "rmmod sky2" resulted in instang hang. I'm trying pci=nomsi now..
I guess pci=nomsi helped, at least I haven't experienced this bug for a while. However, I don't think I will be of any help in the future, since I just moved to 2.6.20-1.rt2.0111 kernel announced in https://www.redhat.com/archives/fedora-devel-list/2007-February/msg00211.html and I'm pretty happy with it, this RT stuff works great. Not sure if I'll experience any sky2 problems with it, but it won't be wise to write about them in this topic. Anyway, if I ever see that again, I'll just return to using some external PCI network card as in good ol' times - problems with these marvell chips/drivers are horrible..
I just noticed that we registered a second handler on the same irq number in sky2, but the msi handler doesn't flag the interrupt with IRQF_SHARED. Can you try to add the shared flag to the msi request just to make sure that we're not seeing some sort of odd behavior there.
Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers
This bug was fixed either in late FC6 or in F7 and doesn't present in F8, it can be closed properly.