Description of problem: During /kernel/filesystems/nfs/connectathon testing of the RHEL4 Z-stream kernel 2.6.9-89.0.27.EL the following issue was seen regarding sky2: <-SNIP-> 07/12/10 23:20:06 JobID:166216 Test:/kernel/filesystems/nfs/connectathon Response:1 07/12/10 23:20:07 testID:3488295 start: lockd: can't encode arguments: 5 lockd: can't encode arguments: 5 lockd: can't encode arguments: 5 <-SNIP-> lockd: couldn't shutdown host module! lockd: couldn't shutdown host module! [-- MARK -- Mon Jul 12 23:50:00 2010] sky2 0000:03:00.0: pci hw error (0x2010) [-- MARK -- Mon Jul 12 23:55:00 2010] sky2 eth1: tx timeout net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 sky2 eth1: tx timeout sky2 eth1: tx timeout sky2 eth1: tx timeout sky2 eth1: tx timeout <-SNIP-> [-- MARK -- Tue Jul 13 00:05:00 2010] sky2 eth1: tx timeout sky2 eth1: tx timeout sky2 eth1: tx timeout nfs: server sol10-nfs not responding, still trying Badness in local_bh_enable at kernel/softirq.c:141 Call Trace:<ffffffff80140d76>{local_bh_enable+89} <ffffffff8030ebde>{netpoll_poll_dev+520} <ffffffff8030e9b4>{netpoll_send_skb+761} <ffffffffa0288169>{:netconsole:write_msg+361} <ffffffff8013adeb>{__call_console_drivers+68} <ffffffff8013b132>{release_console_sem+495} <ffffffff8013b5b2>{vprintk+873} <ffffffff8013b6d5>{printk+141} <ffffffffa023ce1b>{:sunrpc:rpc_sleep_on+180} <ffffffffa023b16e>{:sunrpc:xprt_transmit+1130} <ffffffffa0238da4>{:sunrpc:xprt_adjust_timeout+468} <ffffffffa0237f2a>{:sunrpc:call_timeout+196} <ffffffffa023e1ec>{:sunrpc:__rpc_execute+187} <ffffffff80365eb0>{thread_return+0} <ffffffffa023e6d2>{:sunrpc:__rpc_schedule+170} <ffffffffa023f5b7>{:sunrpc:rpciod+824} <ffffffff80136f00>{autoremove_wake_function+0} <ffffffff80136f00>{autoremove_wake_function+0} <ffffffff8011589c>{syscall_trace_leave+53} <ffffffff8011167f>{child_rip+8} <ffffffffa0305036>{:nfs:nfs4_compare_super+0} <ffffffffa023f27f>{:sunrpc:rpciod+0} <ffffffff80111677>{child_rip+0} Badness in local_bh_enable at kernel/softirq.c:141 <-SNIP-> Version-Release number of selected component (if applicable): ------ x86_64 ------ x86_64 - kernel 2.6.9-89.0.27.EL /kernel/filesystems/nfs/connectathon Recipe-426404 Test - /kernel/filesystems/nfs/connectathon/EXTERNALWATCHDOG - FAIL http://rhts.redhat.com/testlogs/2010/07/166216/426404/3488295/console.txt [System Name: phenom-01.lab.bos.redhat.com] Test watchdogged during /kernel/filesystems/nfs/connectathon. <--SNIP--> 07/12/10 23:20:06 JobID:166216 Test:/kernel/filesystems/nfs/connectathon Response:1 <--SNIP--> sky2 eth1: tx timeout sky2 eth1: tx timeout [-- MARK -- Tue Jul 13 00:05:00 2010] <--SNIP--> [-- MARK -- Mon Jul 12 23:50:00 2010] sky2 0000:03:00.0: pci hw error (0x2010) [-- MARK -- Mon Jul 12 23:55:00 2010] sky2 eth1: tx timeout net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:296: spin_trylock(net/core/dev.c:000001022d5b9898) already locked by net/sched/sch_generic.c/189 sky2 eth1: tx timeout sky2 eth1: tx timeout sky2 eth1: tx timeout sky2 eth1: tx timeout How reproducible: Testing on host with sky2 driver- phenom-01.lab.bos.redhat.com Run the KernelTier1 Testing with the 2.6.9-89.0.27.EL kernel or the kernel listed below in the "Additional Info" section. or Run the /kernel/networking/ndnc and /kernel/filesystems/nfs/connectathon/ with the 2.6.9-89.0.27.EL kernel or the kernel listed below in the "Additional Info" section. Actual results: Testing fails as stated above. Expected results: This test should pass successfully. Additional info: You can run the testing with either of the following kernels: Zstream kernel 2.6.9-89.0.27.EL Ystream kernel 2.6.9-89.29.EL As both of these kernel have the same single patch added. It seems this patch triggers the issue, as testing with both previous kernel versions does _not_ produce this issue. I tested as follows: Job#166321 Recipe-426763 Test Zstream 2.6.9-89.0.27.EL phenom-01 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=166321 http://rhts.redhat.com/testlogs/2010/07/166321/426763/3490854/console.txt This kernel includes the patch. Issue was reproduced. Job#166465 Recipe-427102 Test Zstream kernel-2.6.9-89.0.26.EL phenom-01 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=166465 http://rhts.redhat.com/testlogs/2010/07/166465/427102/3492860/console.txt This kernel does _not_ have the patch There were no issues. The testing results look good. Job#166463 Recipe-427100 Test kernel-2.6.9-89.29.EL phenom-01 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=166463 This kernel includes the patch. Issue was reproduced. Job#166462 Recipe-427099 Test kernel-2.6.9-89.28.EL phenom-01 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=166462 http://rhts.redhat.com/testlogs/2010/07/166462/427099/3492848/console.txt This kernel does _not_ have the patch. There were no issues. The testing results look good. More Additional Info: Searched bz for sky2 issues: KNOWN_ISSUE seen /kernel/filesystems/nfs/connectathon/EXTERNALWATCHDOG - FAIL Bug 216801 - sky2 transmitter lockup (Marvell network interface freeze https://bugzilla.redhat.com/show_bug.cgi?id=216801 This referenced http://bugzilla.kernel.org/show_bug.cgi?id=6839 that contained a suggested workaround. Unfortunately, the the suggested workaround failed and the issue again was reproduced. Job#166325 Recipe-426767 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=166325 http://rhts.redhat.com/testlogs/2010/07/166325/426767/3490869/console.txt Thank you. -pbunyan
All, Just to be clear... This issue occurs only with NetDump enabled. /kernel/networking/ndnc enables NetDump. Testing with /kernel/filesystems/nfs/connectathon/ alone does reproduced issue. Best, -pbunyan
All, One of the test jobs for this issue PANIC'd the system: Running the /kernel/networking/ndnc and /kernel/filesystems/nfs/connectathon/ with the 2.6.9-89.29.EL kernel the phenom-01.lab.bos.redhat.com PANIC'd. Job#166463 Recipe-427100 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=166463 x86_64 - kernel 2.6.9-89.29.EL /kernel/filesystems/nfs/connectathon Recipe-427100 Test - /kernel/filesystems/nfs/connectathon/ - PANIC http://rhts.redhat.com/testlogs/2010/07/166463/427100/3492852/console.txt <-SNIP-> 07/14/10 10:34:57 JobID:166463 Test:/kernel/filesystems/nfs/connectathon Response:1 07/14/10 10:34:58 testID:3492852 start: [-- MARK -- Wed Jul 14 10:35:00 2010] lockd: can't encode arguments: 5 <-SNIP-> [-- MARK -- Wed Jul 14 11:00:00 2010] sky2 eth1: tx timeout net/core/netpoll.c:300: spin_trylock(net/core/dev.c:000001022d311898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:300: spin_trylock(net/core/dev.c:000001022d311898) already locked by net/sched/sch_generic.c/189 <-SNIP-> [-- MARK -- Wed Jul 14 12:30:00 2010] sky2 eth1: tx timeout sky2 eth1: tx timeout [-- MARK -- Wed Jul 14 12:35:00 2010] Badness in dst_release at include/net/dst.h:158 Call Trace:<IRQ> <ffffffff802fba08>{__kfree_skb+98} <ffffffffa00f330e>{:sky2:sky2_tx_complete+374} <ffffffffa00f4f00>{:sky2:sky2_poll+2522} <ffffffff80302753>{net_rx_action+305} <ffffffff80140c90>{__do_softirq+76} <ffffffff80140d17>{do_softirq+49} <ffffffff80113f1f>{do_IRQ+664} <ffffffff8011110f>{ret_from_intr+0} <EOI> <ffffffff802b6770>{ide_outsw+0} <ffffffff8010e807>{default_idle+0} <ffffffff8010e827>{default_idle+32} <ffffffff8010e897>{cpu_idle+26} <ffffffff80575704>{start_kernel+637} <ffffffff805751b7>{_sinittext+439} Badness in dst_release at include/net/dst.h:158 Call Trace:<IRQ> <ffffffff802fba08>{__kfree_skb+98} <ffffffffa00f330e>{:sky2:sky2_tx_complete+374} <ffffffffa00f4f00>{:sky2:sky2_poll+2522} <ffffffff80302753>{net_rx_action+305} <ffffffff80140c90>{__do_softirq+76} <ffffffff80140d17>{do_softirq+49} <ffffffff80113f1f>{do_IRQ+664} <ffffffff8011110f>{ret_from_intr+0} <EOI> <ffffffff802b6770>{ide_outsw+0} <ffffffff8010e807>{default_idle+0} <ffffffff8010e827>{default_idle+32} <ffffffff8010e897>{cpu_idle+26} <ffffffff80575704>{start_kernel+637} <ffffffff805751b7>{_sinittext+439} Unable to handle kernel paging request at 0000000102e93a7c RIP: Best, -pbunyan
This last backtrace shows an issue with dst_release. Specificallly it indicates that the skbs being freed had an underflowed refcount on the dst entry attached to them, which suggests that someone is calling dst_release more often than dst_hold in some path. My guess would be that sky2_tx_timeout is racing with sky2_tx_done in interrupt context on another cpu and we're double freeing skbs on on the tx_cons list in sky2. Thats just a symptom of the larger problem though. while that can be fixed, it still doesnt' explain why sky2 is getting tx timeouts in the first place.
A scratch build that included the patch here: https://bugzilla.kernel.org/show_bug.cgi?id=6839#c55 has successfully passed cthon testing.
Looks like that was a backport of this: commit 6771290102c4703dae56bc3e121deb63530e206c Author: Stephen Hemminger <shemminger> Date: Mon Dec 4 15:53:45 2006 -0800 [PATCH] sky2: beter ram buffer partitioning
Correction it looks like this is the patch we are looking for: commit 470ea7eba4aaa517533f9b02ac9a104e77264548 Author: Stephen Hemminger <shemminger> Date: Fri Oct 20 17:06:11 2006 -0700 [PATCH] sky2: 88E803X transmit lockup and the patch in comment #7 came after it.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I built a test kernel with sky2 patch built in (comment 9). Paul still noticed the issue with sky2 driver. Following is small part of error message. What attracts my attention here is first message of "pci hw error". Does it mean that there was some kind of hardware error which froze the transmission on card and then rest of the errors just followed... *************************************************************************** sky2 0000:03:00.0: pci hw error (0x2010) sky2 eth1: tx timeout net/core/netpoll.c:300: spin_trylock(net/core/dev.c:000001022d01d898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:300: spin_trylock(net/core/dev.c:000001022d01d898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:300: spin_trylock(net/core/dev.c:000001022d01d898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:300: spin_trylock(net/core/dev.c:000001022d01d898) already locked by net/sched/sch_generic.c/189 net/core/netpoll.c:300: spin_trylock(net/core/dev.c:000001022d01d898) already locked by net/sched/sch_generic.c/189 sky2 eth1: tx timeout [-- MARK -- Wed Jul 21 12:10:00 2010] sky2 eth1: tx timeout sky2 eth1: tx timeout nfs: server rhel5-nfs not responding, still trying sky2 eth1: tx timeout nfs: server rhel5-nfs not responding, still trying sky2 eth1: tx timeout ******************************************************************************
[Testing update from PaulB] All, Just to follow up and add to the puzzle... JeffB and I tested a bit more. Testing with latest Z-Stream and Y-Stream kernels on phenom-01 system with /kernel/networking/ndnc and /kernel/filesystems/nfs/connectathon is intermittent. However, testing with latest Z-Stream and Y-Stream kernels on amd-mako-02 system with /kernel/networking/ndnc and /kernel/filesystems/nfs/connectathon we ran into a further glitch. The issue with amd-mako-02 during testing was as follows: [a] The system would successfully install the RHEL4u8 base kernel. [b] However, after rebooting the system would not get an ip. Thus the system could not check in to continue the test. The system would hang and the test would eventually watchdog. [c] We did hook up a crash cart to the system to follow the issue. As the system was in this state, following the install of RHEL4u8 and reboot. Logging in to the systems and $service network stop $modprobe -r sky2 $modprobe sky2 $service network start $ifconfig -a The system was then able to get an ip. Similar to comment#18 in https://bugzilla.redhat.com/show_bug.cgi?id=216799 Note: simply stopping an restarting network service was not successful. Success was only achieved following removing and reinserting the sky2 module Further. The amd-mako-02 system is connected to the network via a Cisco Catalyst 2960G 10/100/1000 switch. We modified this and connected the the amd-mako-02 to a NetGear FS608 10/100 switch, and the NetGear to the Cisco. I then reran the same testing. The system would successfully install the RHEL4u8 base kernel and _now_ following reboot successfully get an ip. Howver, I did experience the same intermittent failure as before running latest Z-Stream and Y-Stream kernel with /kernel/networking/ndnc and /kernel/filesystems/nfs/connectathon.
Committed in 89.30.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
I think I might know whats going on here. This caught my eye in the summary: sky2 0000:03:00.0: pci hw error (0x2010) rhts isn't being responsive, so I can't confirm that precedes the other errors as well, but if it does, I think this doesn't have anything to do with netdump, but more specifically netconsole. The hardware error path in sky2 is full of printks that are rate limited. So if we get a hardware error at the wrong time, we'll send a whole bunch of printks down the tx path of the sky2 driver while we're in the middle of trying to recover from a major error (for the sake of recording it, the above is a PCI MASTER ABORT) I'll attach a patch shortly.
Created attachment 472682 [details] patch to prevent printks on netconsole in sky2 This isn't necessecarily a permanent fix, but I think it should solve the probelm. It will disable the netdevices queue, causing us to drop frames in dev_queue_xmit, thereby preventing us from accessing the tx path of the sky2 driver while trying to recover from a hw error. Please give this patch a test and see if it remedies the problem. Thanks!
well, it seems as though we have a marginal improvement. We no longer see the lockdep warnings that arise from trying to use netconsole while we're recovering from a tx hw error. Now we need to figure out why it is that the sky2 driver can't reset itself properly. I'll dig into that now
Committed in 97.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When running with the following Ethernet device: Ethernet controller: Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20) Vendor ID:11ab Device ID:4360 Netconsole must be disabled to avoid this issue.
Please note: When the "sky2 eth1: tx timeout" issue occurs, the system unresponsive. I am unable to login to the system tty terminals or serial console. Therefore, I am unable to try to rmmod/insmod or ifdown/ifup. I must reboot the system to recover from this issue. Updating "Technical Note"... -pbunyan
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -4,4 +4,5 @@ Vendor ID:11ab Device ID:4360 -Netconsole must be disabled to avoid this issue.+Disabling netconsole does _not_ resolve this issue. +System must be rebooted to recover from this issue.
(In reply to comment #29) > Please note: > When the "sky2 eth1: tx timeout" issue occurs, the system unresponsive. I am > unable to login to the system tty terminals or serial console. Therefore, I am > unable to try to rmmod/insmod or ifdown/ifup. I must reboot the system to > recover from this issue. > > Updating "Technical Note"... > > -pbunyan What seems odd about the tx timeout issues is that we don't see any "NETDEV WATCHDOG" messages in the log before we see the tx timeout messages from the driver. Those messages should come out as KERN_INFO (6) messages shouldn't be coming out. If /proc/sys/kernel/printk is "7 4 1 7" then we should see them. I also wonder if we should try to backport this fix: commit 819067916d785cac0369b8d6e187b4a83fd17785 Author: Stephen Hemminger <shemminger> Date: Thu Feb 15 16:40:33 2007 -0800 sky2: transmit timeout The transmit timeout code could hang, and it would not clear out problems if the hardware was stuck. Change the code to effectively do a device down/up similar to the suspend/resume code.
(In reply to comment #33) > I also wonder if we should try to backport this fix: > > commit 819067916d785cac0369b8d6e187b4a83fd17785 > Author: Stephen Hemminger <shemminger> > Date: Thu Feb 15 16:40:33 2007 -0800 > > sky2: transmit timeout > > The transmit timeout code could hang, and it would not clear out > problems if the hardware was stuck. Change the code to effectively do > a device down/up similar to the suspend/resume code. Neil horman backported that patch and we tried it. It did not help.
(In reply to comment #34) > (In reply to comment #33) > > I also wonder if we should try to backport this fix: > > > > commit 819067916d785cac0369b8d6e187b4a83fd17785 > > Author: Stephen Hemminger <shemminger> > > Date: Thu Feb 15 16:40:33 2007 -0800 > > > > sky2: transmit timeout > > > > The transmit timeout code could hang, and it would not clear out > > problems if the hardware was stuck. Change the code to effectively do > > a device down/up similar to the suspend/resume code. > > Neil horman backported that patch and we tried it. It did not help. I knew he tried something, but no longer had the emails with that information. Thanks for the update, Vivek!
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,8 +1,18 @@ -When running with the following Ethernet device: +[Silas: the following issue needs a CCFR description: what was the nature of the issue? Could a Subject Matter Expert provide sentences about each of the below to be edited and incorporated into the RHEL 4.9 Tech Notes? + +Cause: + +Consequence: + +Fix: + +Result: + +Original: When running with the following Ethernet device: Ethernet controller: Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20) Vendor ID:11ab Device ID:4360 Disabling netconsole does _not_ resolve this issue. -System must be rebooted to recover from this issue.+System must be rebooted to recover from this issue].
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,18 +1,12 @@ -[Silas: the following issue needs a CCFR description: what was the nature of the issue? Could a Subject Matter Expert provide sentences about each of the below to be edited and incorporated into the RHEL 4.9 Tech Notes? +When using the sky2 driver for Marvell Ethernet adapters with the following device: -Cause: + Ethernet controller: Marvell Technology Group Ltd. + 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20) + Vendor ID:11ab + Device ID:4360 -Consequence: +packet transmission may time out with the following message being written to netconsole: -Fix: + sky2 eth1: tx timeout -Result: +This is a known issue: after receiving these messages, the system must be rebooted in order to fix the packet transmission issues. (BZ#614559)- -Original: When running with the following Ethernet device: - Ethernet controller: Marvell Technology Group Ltd. - 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20) - Vendor ID:11ab - Device ID:4360 - -Disabling netconsole does _not_ resolve this issue. -System must be rebooted to recover from this issue].
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html