Bug 228733
Description
Greg Bailey
2007-02-14 18:22:54 UTC
Created attachment 148077 [details]
Console output of panic on pv-hab-number120
Created attachment 148078 [details]
Console output of panic on pv-hab-number100
This happened just again on a server called "it7000cp". I will attach the console output as a separate attachment. /var/log/messages had the following right before the panic: Mar 1 09:06:25 it7000cp kernel: NETDEV WATCHDOG: eth0: transmit timed out Mar 1 09:06:25 it7000cp kernel: sky2 eth0: tx timeout Mar 1 09:06:25 it7000cp kernel: sky2 hardware hung? flushing Mar 1 09:11:05 it7000cp kernel: NETDEV WATCHDOG: eth0: transmit timed out Mar 1 09:11:05 it7000cp kernel: sky2 eth0: tx timeout Mar 1 09:11:05 it7000cp kernel: sky2 status report lost? Mar 1 09:11:45 it7000cp kernel: NETDEV WATCHDOG: eth0: transmit timed out Mar 1 09:11:45 it7000cp kernel: sky2 eth0: tx timeout Mar 1 09:11:45 it7000cp kernel: sky2 hardware hung? flushing Mar 1 09:16:01 it7000cp su(pam_unix)[26196]: session opened for user ssadmin by (uid=0) Mar 1 09:16:02 it7000cp su(pam_unix)[26196]: session closed for user ssadmin Mar 1 09:20:25 it7000cp kernel: NETDEV WATCHDOG: eth0: transmit timed out Mar 1 09:20:25 it7000cp kernel: sky2 eth0: tx timeout Mar 1 09:20:25 it7000cp kernel: sky2 status report lost? Mar 1 09:21:20 it7000cp kernel: NETDEV WATCHDOG: eth0: transmit timed out Mar 1 09:21:20 it7000cp kernel: sky2 eth0: tx timeout Mar 1 09:21:20 it7000cp kernel: sky2 hardware hung? flushing Mar 1 09:31:01 it7000cp su(pam_unix)[28173]: session opened for user ssadmin by (uid=0) Mar 1 09:31:02 it7000cp su(pam_unix)[28173]: session closed for user ssadmin Mar 1 09:46:01 it7000cp su(pam_unix)[29903]: session opened for user ssadmin by (uid=0) Mar 1 09:46:02 it7000cp su(pam_unix)[29903]: session closed for user ssadmin Mar 1 09:50:50 it7000cp kernel: sky2 eth0: rx error, status 0x7ffc0001 length 96 Mar 1 09:50:50 it7000cp kernel: KERNEL: assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at net/unix/af_unix.c (333) Mar 1 09:50:50 it7000cp kernel: KERNEL: assertion (sk_unhashed(sk)) failed at net/unix/af_unix.c (334) Mar 1 09:50:50 it7000cp kernel: KERNEL: assertion (!sk->sk_socket) failed at net/unix/af_unix.c (335) Mar 1 09:50:50 it7000cp kernel: Attempt to release alive unix socket: d0fd0100 Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at net/unix/af_unix.c (333) Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (sk_unhashed(sk)) failed at net/unix/af_unix.c (334) Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (!sk->sk_socket) failed at net/unix/af_unix.c (335) Mar 1 09:50:51 it7000cp kernel: Attempt to release alive unix socket: d0fd0100 Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at net/unix/af_unix.c (333) Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (sk_unhashed(sk)) failed at net/unix/af_unix.c (334) Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (!sk->sk_socket) failed at net/unix/af_unix.c (335) Mar 1 09:50:51 it7000cp kernel: Attempt to release alive unix socket: d0fd0100 Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at net/unix/af_unix.c (333) Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (sk_unhashed(sk)) failed at net/unix/af_unix.c (334) Mar 1 09:50:51 it7000cp kernel: KERNEL: assertion (!sk->sk_socket) failed at net/unix/af_unix.c (335) Mar 1 09:50:51 it7000cp kernel: Attempt to release alive unix socket: d0fd0100 Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at net/unix/af_unix.c (333) Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (sk_unhashed(sk)) failed at net/unix/af_unix.c (334) Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (!sk->sk_socket) failed at net/unix/af_unix.c (335) Mar 1 09:50:52 it7000cp kernel: Attempt to release alive unix socket: d0fd0100 Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at net/unix/af_unix.c (333) Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (sk_unhashed(sk)) failed at net/unix/af_unix.c (334) Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (!sk->sk_socket) failed at net/unix/af_unix.c (335) Mar 1 09:50:52 it7000cp kernel: Attempt to release alive unix socket: d0fd0100 Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at net/unix/af_unix.c (333) Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (sk_unhashed(sk)) failed at net/unix/af_unix.c (334) Mar 1 09:50:52 it7000cp kernel: KERNEL: assertion (!sk->sk_socket) failed at net/unix/af_unix.c (335) Mar 1 09:50:52 it7000cp kernel: Attempt to release alive unix socket: d0fd0100 Mar 1 09:50:54 it7000cp kernel: Warning: kfree_skb passed an skb still on a list (from c027bf26). Mar 1 09:50:54 it7000cp kernel: ------------[ cut here ]------------ Mar 1 09:50:54 it7000cp kernel: kernel BUG at net/core/skbuff.c:293! Mar 1 09:50:54 it7000cp kernel: invalid operand: 0000 [#1] Created attachment 149028 [details]
Console output of panic on it7000cp
can you please try booting your systems with pci=nomsi on the kernel commandline and see if the problem still occurs? A bunch of sky2 fixes just went into 2.6.19.6 I tried adding this parameter on one of the servers, and things appear OK with the Marvell NIC. The problem is, I have >100 servers with this kernel build, and so far these panics have only happened on 1 server at a time, and never on the same server twice (so far). They're also (fortunately or unfortunately) pretty rare, and I don't have a way to trigger the panic except to wait for the phone call... :-( Ok, so we're not going to get valid results out of pci=nomsi in your environment, then. Can you roll out a test kernel to your systems. I think this commit: 819067916d785cac0369b8d6e187b4a83fd17785 from linus' tree is likely the problem your seeing. I'll build a RHEL4 kernel for you to test with. fwiw, I'm not sure its worth taking the spot patch for this. I'm currently working on just taking that patch, but if you have the time, it would probably be worth your while to compile the latest RHEL4 kernel, and just substitue the lastest upstream sky2 drvier for testing purposes. Is it possible for you to give that a try, or do you need to wait for me to get this backported? Which latest RHEL4 kernel should I use? The 2.6.9-42.0.10.EL one for U4 or the 2.6.9-48.EL (or jbaron's 2.6.9-49.EL) for U5? Which sky2 version should I attempt to merge--the 1.13 in Linus' tree? Any special considerations shoehorning the latest 2.6.20 sky2 driver into 2.6.9? I'll have a go at it based on your version information... I would suggest just using the latest RHEL4 kernel available from RHN. Theres not much point in using anything else, since thats what any fix will be applied to anyway. The only reason not to use the latest RHEL4 kernel is if your environment has a need to not move forward in kernel versions. If so, follow you internal guidelines and use whatever is mandated (since the upstream sky driver should be able to largely be a drop in replacement to any RHEL4 kernel). As for which sky2 driver, just grab the latest from linus' git tree, since the requisite changes referenced above are all in there. Or just let me know that its more hassle than its worth for you, and I'll let you know when I have a kernel built here :) OK, I grabbed sky2.c and sky2.h from 2.6.21-rc2 as that seemed to have the latest versions. I added the "#include sky2_compat.h" line to sky2.c (I assume that's required). I get build failures when attempting to compile sky2.ko. I've attached the build failure as an attachment. Created attachment 149410 [details]
sky2 1.13 build failure output
After looking at it alittle more closely, and discussing it with some others around here the consensus is that a backport of the specific patch that I think is requred for this fix would be preferable to a completely update, given the fact that RHEL4 is getting on in years. I've uploaded a test kernel for you to: http://people.redhat.com/nhorman Please give it a try and let me know the results. Thanks! I have upgraded to kernel-smp-2.6.9-49.EL.bz228733 on a few servers and have not encountered any regressions so far... Can you attach the patch file you used for this? Is it the same as the above referenced commit from Linus' tree or did you have to modify it? (Or, I'll pick it from the kernel.src.rpm if that's available somewhere... Neil, can you comment on whether you think bugzilla #216799 might be related to this issue? I've also encountered the same symptoms as that one and need to know if I should pull in another patch, or if you think the patch you supplied also addresses #216799. They both seem to talk about transmit timeout stuff. Thanks! Created attachment 149663 [details] test patch in kernel kernel-smp-2.6.9-49.EL.bz228733.i686.rpm Heres the patch. Its the same patch I referenced previously, plus another patch that hit the same code, which I included to make the application easier. It basically schedules interface restarts to occur in process context to avoid the tx hang that was occuring. Please let me know when you are confident that this has fixed your problem. Thanks! No panics thus far... Neil, do you have the .src.rpm for this kernel? I no longer have access to the 2.6.9-49.EL source from which it was derived as it looks like jbaron is up to 2.6.9-51.EL now... are the old ones archived away anywhere? I don't have the srpm anymore (expunged by the build system here), but its all taged in CVS so I can rebuild it quickly and post it for you. I assume that since you are asking for the srpm, you are reasonably confident that this has fixed your problem? This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Ok, srpm uploaded to http://people.redhat.com/nhorman the patch attached above is available in the srpm as linux-kernel-test.patch. Let me know if you are comfortable saying this patch fixes your problems, and I'll propose it for 4.6 inclusion. I received the following over and over in dmesg output and lost network connectivity: NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 10 .. 482 report=11 done=11 sky2 status report lost? This sounds like the same problem (or similar, at least) as bugzilla #216799 (see comment 15). Does your bz228733 kernel include that patch also? I was able to restart the interface with ifdown, rmmod, modprobe, ifup sequence. I will attach the output of "dmesg" as an attachment. Created attachment 150978 [details] Output of dmesg with tx timeout messages This dmesg output is from a 2.6.9-49.EL.bz228733 kernel. Yes, the errors you describe seem to relate to bz 216799, and no, my kernel does not contain that patch. If you'd like to incorporate it to the provided src rpm, feel free. It should apply fairly cleanly. Please let me know when you are comfortable with this fix. Thanks! I have built a kernel "2.6.9-51.1.INTL" which includes the patch referenced above in comment #16, and the patch from bugzilla #216799. This kernel has been installed on a few dozen servers. Although I have yet to see a kernel panic or hung network interface (which statistically would have probably happened by now), I'm still seeing various combinations of the following in dmesg output: icmp v4 hw csum failure udp v4 hw csum failure. hw tcp v4 csum failed The interface appears to recover from these timeouts (I see the messages saying that it is disabled and then enabled): Apr 20 05:49:08 stun1 kernel: printk: 2 messages suppressed. Apr 20 05:49:38 stun1 kernel: printk: 1 messages suppressed. Apr 20 05:50:28 stun1 last message repeated 3 times Apr 20 05:51:29 stun1 last message repeated 3 times Apr 20 08:22:52 stun1 login(pam_unix)[3772]: bad username [ ] Apr 20 08:22:52 stun1 login[3772]: FAILED LOGIN 1 FROM (null) FOR , Authentication failure Apr 20 08:22:52 stun1 login(pam_unix)[3772]: bad username [] Apr 20 08:22:52 stun1 login[3772]: FAILED LOGIN 2 FROM (null) FOR , Authentication failure Apr 20 08:22:52 stun1 login(pam_unix)[3772]: bad username [] Apr 20 08:22:52 stun1 login[3772]: FAILED LOGIN 3 FROM (null) FOR , Authentication failure Apr 20 08:22:53 stun1 login(pam_unix)[3772]: bad username [] Apr 20 08:22:53 stun1 login[3772]: FAILED LOGIN SESSION FROM (null) FOR , Authentication failure Apr 20 08:28:44 stun1 kernel: NETDEV WATCHDOG: eth0: transmit timed out Apr 20 08:28:44 stun1 kernel: sky2 eth0: tx timeout Apr 20 08:28:44 stun1 kernel: sky2 eth0: disabling interface Apr 20 08:28:44 stun1 kernel: sky2 eth0: enabling interface Apr 20 08:28:46 stun1 kernel: sky2 eth0: Link is up at 100 Mbps, full duplex, flow control none Apr 20 08:29:00 stun1 kernel: printk: 12 messages suppressed. The failed login messages seem suspicious in that there's missing information; I'm not sure if the timeout happened during an attempted login from the Internet. Do I need: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=874183072de73a36a958585e3186639fd2634701 It certainly seems like you do. Are you comfortable building that into the current kernel you are working with? If not let me know, and I can build it for you. Also, just FYI, One of the other engineers here mentioned to me that a wholesale upgrade of sky2 in 4.6 might be prudent as he has a number of sky2 fixes outstanding, so 4.6 might see all these patches and more integrated as well. Let me know if the above git commit fixes the remainder of your problems. Thanks! I can probably pull that patch in and make a new kernel; the problem is the turnaround time in getting the new kernel tested before it goes on our production hardware. Regarding the wholesale upgrade of sky2, that would be my preference as well (see comments #12 and #13 above), but I'd need help getting the latest sky2 driver (just updated yesterday!) to build in the 2.6.9 tree. How much would the latest sky2.h and sky2.c from Linus' tree need to be modified to build in the RHEL4 kernel? FWIW, wholesale update of sky2 is considered for 5.1, not 4.6. YMMV, but the backport of current sky2 sources to rhel4 might be a bit painful (but certainly not impossible). my bad, I thought they were slated for 4.6. Well, that being what it is, the wholesale backport was a bit harry IIRC for 4.6, the sky2.h driver relied on some kernel infrastructure that didn't exist in 2.6.9 which we would also need to backport. Anywho, lets move forward with with the cherry picked patch for now. Let me know how the testing goes, while I take another crack at the wholesale backport. If I get it together, I'll post it here for you to try. The combination of the #216799 and #228733 patches is going well. No kernel panics to report so far. Please propose these for Update 6. I've yet to include the patch referenced in Comment #25, as it is of lower priority. I'd still be interested in testing a current version of the sky2 driver backported for 2.6.9 however if such a thing materialized... Unfotunately, I've not attempted a wholesale backport yet, sorry. its been shoved down my todo list. As the git patch you referenced seems to be working for you however, I'll coordinate with our other networking engnineers and make sure this gets in for you. Thanks! If the patches for bug 216799 and the ones in this comment #25 and comment #16 resolve your issues then we can look at adding those and rather than doing a full backport of the latest driver for the next update. Ok, I managed to backport the latest sky2 driver to RHEL4 (minus some infrastructure that doesn't fit in RHEL4 at this point). The src rpm is on my people page: http://people.redhat.com/nhorman Please build it and try it out. Thanks! ping, any update here? We've been running on a kernel with the patch from Comment #16 and bug 216799 for quite a while without any panics. It would be great to see at least those fixes in Update 6. I have not yet built a kernel with the latest backported sky2 driver from Comment #33 because I'm not enough of a kernel hacker to put the right debug statements in it to figure out why it's crashing (see comments on bug 216799), but I'm a very willing guinea pig for supplied kernels! :-) I'd rather not take just that fix if we get the whole driver in. Besides, its rather late at this point for anything to get into 4.6. Lets try to get this together for 4.7. As for the debugging, if you can just put some prinks in sky2_status_intr to print out the value of le-link. If you don't feel comfortable with that, let me know and I'll get to it as soon as you can. Also, it might be worth a shot to disable MSI interrupts if your hardware supports them from the kernel command line. What's the kernel command line option to disable MSI interrupts? I couldn't find anything relevant in /usr/share/doc/kernel-doc-2.6.9/Documentation/kernel-parameters.txt should be: nomsi I've disabled msi with the "pci=nomsi" kernel command line argument and still get a panic when the interface is brought up. I've added printk statements but they don't appear to show up on the console. Can you attach a modified sky2.c file I can use with proper debugging statements in it? Also, doesn't the "EIP is at netif_receive_skb" in the stack trace (I'll attach the most recent crash) mean that the invalid access occurred in net/core/dev.c ? Created attachment 160831 [details] Kernel Panic from 2.6.9-55.3.EL.bz228733.2smp yeah, I hadn't looked at the EIP much, since what I think is happening is that sky2 is passing a NULL pointer to netif_receive_skb because le->link is out of bounds. I'm attaching a patch that should help you tell. If you have multiple NIC's on board, it may spew a number of messages and slow your system down a bit, so be warned. Created attachment 160839 [details]
patch to debug sky2 oops
With the patch from Comment #42 I get a few NETIF_RECEIVE_SKB debugs while rc scripts are executed, and then the panic when I "ifup eth0": ... system rc scripts ... NETIF_RECEIVE_SKB: SKB = f68b7280 NETIF_RECEIVE_SKB: DEV = c03462c0 NETIF_RECEIVE_SKB: SKB = f6dfc180 NETIF_RECEIVE_SKB: DEV = c03462c0 NETIF_RECEIVE_SKB: DEV = c03462c0 NETIF_RECEIVE_SKB: SKB = f704b480 NETIF_RECEIVE_SKB: DEV = c03462c0 NETIF_RECEIVE_SKB: SKB = f6dfc180 NETIF_RECEIVE_SKB: DEV = c03462c0 ... system rc scripts ... [root@geb-test0 ~]# ifup eth0 ip_tables: (C) 2000-2002 Netfilter core team SKY2 DEBUG: le->link = 0 SKY2 DEBUG: le->link = 0 NETIF_RECEIVE_SKB: SKB = f5e16c80 NETIF_RECEIVE_SKB: DEV = 00000000 Unable to handle kernel NULL pointer dereference at virtual address 0000017c printing eip: c02822aa *pde = 3729e001 Oops: 0000 [#1] SMP Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mirror dm_mod button battery ac ftdi_sio usbserial uhci_hcd ehci_hcd hw_random e1000 sky2 ext3 jbd ata_piix libata sd_mod scsi_mod CPU: 0 EIP: 0060:[<c02822aa>] Not tainted VLI EFLAGS: 00010282 (2.6.9-55.3.EL.bz228733.3smp) EIP is at netif_receive_skb+0x3d/0x310 eax: f5e16c80 ebx: f7297c00 ecx: c03d1f58 edx: 00000000 esi: f5e16c80 edi: 0000003c ebp: f7297e40 esp: c03d1f64 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c03d1000 task=c0324a80) Stack: f5e16c80 00000001 f5e16c80 f7297c00 f5e16c80 0000003c f7297e40 f88abb3d 00000064 003c0300 f6e98008 00000002 00000000 00000040 f7f0c680 00000000 00000000 40000000 f7297c00 00000040 f7f0c680 f88ac2b3 c03d1fd4 00000000 Call Trace: [<f88abb3d>] sky2_status_intr+0x227/0x46a [sky2] [<f88ac2b3>] sky2_poll+0x5c/0xbf [sky2] [<c0282728>] net_rx_action+0xae/0x160 [<c0126a14>] __do_softirq+0x4c/0xb1 [<c010819f>] do_softirq+0x4f/0x56 ======================= [<c0107ab4>] do_IRQ+0x1a2/0x1ae [<c02d6dcc>] common_interrupt+0x18/0x20 [<c01040e8>] mwait_idle+0x33/0x42 [<c01040a0>] cpu_idle+0x26/0x3b [<c0397786>] start_kernel+0x199/0x19d Code: 00 50 68 d6 a1 30 c0 e8 67 06 ea ff 8b 44 24 10 ff 70 18 68 f6 a1 30 c0 e8 56 06 ea ff 8b 44 24 18 89 44 24 10 8b 50 18 83 c4 10 <83> ba 7c 01 00 00 00 74 6f 31 c0 f6 42 58 20 74 14 0f b7 82 ae <0>Kernel panic - not syncing: Fatal exception in interrupt Created attachment 160849 [details]
patch to fix null dev pointer
well, that definately shows the problem, although I'm not sure how its
occuring. le->link is valid, but the sky2_hw structs dev array seems to have
been nulled out (or never initialized), which it certainly seems it should have
been. Anywho, I think this attached patch should fix it. Please replace the
debug patch you were just using with this one and see if the problem clears up.
Thanks!
Same output from "2.6.9-55.3.EL.bz228733.4smp" (patch from comment #44): ip_tables: (C) 2000-2002 Netfilter core team SKY2 DEBUG: le->link = 0 SKY2 DEBUG: le->link = 0 NETIF_RECEIVE_SKB: SKB = f7db2c80 NETIF_RECEIVE_SKB: DEV = 00000000 Unable to handle kernel NULL pointer dereference at virtual address 0000017c printing eip: c02822aa *pde = 372b7001 Oops: 0000 [#1] SMP Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mirror dm_mod button battery ac ftdi_sio usbserial uhci_hcd ehci_hcd hw_random e1000 sky2 ext3 jbd ata_piix libata sd_mod scsi_mod CPU: 0 EIP: 0060:[<c02822aa>] Not tainted VLI EFLAGS: 00010286 (2.6.9-55.3.EL.bz228733.4smp) EIP is at netif_receive_skb+0x3d/0x310 Sorry, but I can't buy the same error on this build, I don't think you managed to compile the patch in. if dev was still null in this case, it should have oopsed back in the sky2_poll routine: Please add the following line: printk(KERN_CRIT "SKY2_DEBUG: dev = %p\n",dev0); at the top of they sky2_poll routine, right after the variable declaration, rebuild and try again. Thanks! OK, I added the following line to sky2_poll: printk(KERN_CRIT "SKY2_DEBUG: sky2_poll: dev = %p\n",dev0); And get: ifup eth0 ip_tables: (C) 2000-2002 Netfilter core team SKY2_DEBUG: sky2_poll: dev = f731f800 SKY2_DEBUG: sky2_poll: dev = f731f800 SKY2 DEBUG: le->link = 0 SKY2 DEBUG: le->link = 0 NETIF_RECEIVE_SKB: SKB = f5c52080 NETIF_RECEIVE_SKB: DEV = 00000000 Unable to handle kernel NULL pointer dereference at virtual address 0000017c printing eip: c02822aa *pde = 3700a001 Oops: 0000 [#1] SMP Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ftdi_sio usbserial dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd hw_random e1000 sky2(U) ext3 jbd ata_piix libata sd_mod scsi_mod CPU: 0 EIP: 0060:[<c02822aa>] Not tainted VLI EFLAGS: 00010286 (2.6.9-55.3.EL.bz228733.4smp) EIP is at netif_receive_skb+0x3d/0x310 Ok, clearly something is going wrong here. in sky2_poll the dev pointer looks perfectly valid, but by the time we call sky2_status_intr is gotten corrupted to NULL. And in my tree here, I can't see how that happens (nor does it happen in my testing). Can you please attach the sky2.c file from your tree so that I can compare the two please? Thanks Created attachment 160924 [details] sky2.c version related to comment 47 Requested version of sky2.c that relates to comments #47, #48. Not sure why this bug is still NEEDINFO; did it not update correctly or is there any other information or testing I can provide? thanks! you needed to either set the state back to assigned or click the "I am providing the requested info" checkbox. well, I see the problem. I'm not sure how it happened but your version of the file has some significant (and critical) differences between what my last patch changed in the base version of the file, and what you uploaded. Not sure how it happened, but most notably your version of the file never assigns skb->dev, which my version does. I'm going to build & verify a binary kernel here for you, and post it to my people page. What arches do you need? Is x86 sufficient, or do you need others as well? I just need i686 smp. In theory I could take a standard Red Hat kernel source tree and replace sky2.h and sky2.c, right? I'm not sure how my sky2.c would be different; I basically appended your patch to the linux-kernel-test.patch file... you could just replace the code that way, but its error prone, since your patched sky2.c file is wrong. I don't know how you got your file off track either, but somewhere between your base file and my patch you added something extra. Perhaps you had something erroneous in your linux-kernel-test.patch file previously. Anywho, I'm building now, and will have binaries posted for you on monday Ok, I've posted a i686 smp kernel here: http://people.redhat.com/nhorman/rpms/kernel-smp-2.6.9-55.3.EL.bz228733.i686.rpm I've been testing it on my sky2 card for a few hours here this morning. It has survived dhcp/scp/ping flooding for the past two hours here, and should be good to go. Please give it a try and let me know your results. Thanks! The kernel referenced in Comment #55 boots fine and can access the network, etc. I'm investigating the missing skb->dev reference and see the following in the "linux-kernel-test.patch" file that's part of your kernel-2.6.9-55.3.EL.bz228733.2.src.rpm file that used to be available from your people page, and see the following in it: @@ -1955,17 +2068,20 @@ dev = hw->dev[le->link]; sky2 = netdev_priv(dev); - length = le->length; - status = le->status; + length = le16_to_cpu(le->length); + status = le32_to_cpu(le->status); switch (le->opcode & ~HW_OWNER) { case OP_RXSTAT: - skb = sky2_receive(sky2, length, status); - if (!skb) - break; + skb = sky2_receive(dev, length, status); + if (unlikely(!skb)) { + sky2->net_stats.rx_dropped++; + goto force_update; + } - skb->dev = dev; skb->protocol = eth_type_trans(skb, dev); + sky2->net_stats.rx_packets++; + sky2->net_stats.rx_bytes += skb->len; dev->last_rx = jiffies; #ifdef SKY2_VLAN_TAG_USED Can you attach the linux-kernel-test.patch file I should be using? It appears the one in the earlier .src.rpm might not be right. Thanks! Created attachment 161265 [details]
correct patch for sky2 from cvs
sure, here it is. Not sure how it got changed in the srpm. its exactly the
same patch, just without the - in front of the skb->dev =... line (and the
corresponding line number changes that go with it). Very odd, it was correct
in our CVS tree here, so I have no idea how that would have changed. Anywho,
given that the kernel I built worked for you, I'm thinking that this is ready
for me to post for inclusion here, unless you would like to rebuild with this
patch and do some more testing. Whats your preference?
I built a kernel on 8/14 with the supplied patch and have been running it on a half dozen servers or so since then. Just this morning I hit some sort of a timeout issue and found this in /var/log/messages on one of the servers: Aug 21 10:36:01 el4-node1 kernel: sky2 eth0: tx timeout Aug 21 10:36:01 el4-node1 kernel: sky2 eth0: disabling interface Aug 21 10:36:01 el4-node1 kernel: sky2 eth0: enabling interface Aug 21 10:36:01 el4-node1 kernel: sky2 eth0: ram buffer 48K Aug 21 10:36:04 el4-node1 kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx Then I do "service network stop", "rmmod sky2", "service network start" to restore network connectivity: Aug 21 14:30:38 el4-node1 network: Setting network parameters: succeeded Aug 21 14:30:38 el4-node1 network: Bringing up loopback interface: succeeded Aug 21 14:30:38 el4-node1 kernel: ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 169 Aug 21 14:30:38 el4-node1 kernel: sky2 0000:02:00.0: v1.14 addr 0xdeefc000 irq 169 Yukon-EC (0xb6) rev 2 Aug 21 14:30:38 el4-node1 kernel: sky2 eth0: addr 00:0e:0c:6a:c9:54 Aug 21 14:30:38 el4-node1 kernel: sky2 eth0: enabling interface Aug 21 14:30:38 el4-node1 kernel: sky2 eth0: ram buffer 48K ... Aug 21 14:30:40 el4-node1 kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx Aug 21 14:30:42 el4-node1 network: Bringing up interface eth0: succeeded The output of "dmesg" shows: NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 412 .. 371 report=413 done=413 sky2 eth0: disabling interface sky2 eth0: enabling interface sky2 eth0: ram buffer 48K sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx sky2 eth0: disabling interface divert: freeing divert_blk for eth0 ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 169 PCI: Setting latency timer of device 0000:02:00.0 to 64 sky2 0000:02:00.0: v1.14 addr 0xdeefc000 irq 169 Yukon-EC (0xb6) rev 2 divert: allocating divert_blk for eth0 sky2 eth0: addr 00:0e:0c:6a:c9:54 sky2 eth0: enabling interface sky2 eth0: ram buffer 48K sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx When I was getting the "timeout" messages, I was unable to ping any other network devices. Is this related to either Comment #5, Comment #15, or Comment #22? Don't know for sure, although msi might be a possibility. I'd try with pci=nomsi just to see. I'll look upstream and see if something more recent has gone in. I've rebooted with "pci=nomsi" on the command line, and still see in /proc/interrupts: 217: 710348 0 0 0 PCI-MSI eth0 Is this expected? I'm trying to understand the "tx timeout" messages, and how to reproduce them. In my test environment, I have 2 servers, each of which has a sky2 Marvell NIC connected to a switch as "eth0". On server "A", I type "nc serverB 3409 < /dev/zero" On server "B", I type "nc -p 3409 > /dev/null" I see lots of traffic from A->B, as expected. If I shutdown eth0 on server "B", wait a while, and then re-enable eth0 on server "B", I see the following in "dmesg" output on server B: sky2 eth0: disabling interface sky2 eth0: enabling interface sky2 eth0: ram buffer 48K As expected... The problem is that server B now is unable to ping or access the local network anymore. "mii-tool" shows a link present. ethtool eth0 shows that a link is NOT present. tcpdump of eth0 shows no activity (meanwhile server A is still spewing out lots of zeroes...) If I perform "service network restart" on server B, it doesn't help anything. If I unload the sky2 module, then things clear up and I'm back on the network again. I'm curious about this testcase because the symptom seems to match the earlier "tx timeout" messages; the driver tried to re-enable itself after a timeout, but it's still not able to see any traffic. Any ideas? Is there some other way to trigger a "tx timeout"? Seems like the restart that's supposed to happen misses something. Dang! That would be because RHEL4 is too old to support pci=nomsi. Sorry about that, should have checked that first. I'll add to the patch to see if I can disable msi interrupts manually for sky2 specifically. scratch that, we seem to be in luck, there is already a sky2 module parameter called disable_msi. Just add this line to /etc/modprobe.conf options ethX disable_msi=1 where ethX is the alias name for the interface driven by your sky2 module OK, I added "options eth0 disable_msi=1" to /etc/modprobe.conf on both server "A" and server "B" as described in Comment #60. I still see the same behavior as described in Comment #60, except that /proc/interrupts now shows: 169: 340 114654 615 71 IO-APIC-level uhci_hcd, eth0 The loss of connectivity as described in that comment still apply and the "service network restart" does NOT restore connectivity--I have to unload and reload the sky2 module. Hmm, I wonder if this is what your seeing? http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c59697e06058fc2361da8cefcfa3de85ac107582 Looks like sky2 had some tx timer problem workaround that got lost upstream during a driver rebase, and was then readded when the problem recurred. That should apply with just an offset to the current sky2 build. Think you can apply it on top of what we have, or shall I build you a new kernel? OK, I included that patch. I get the same results as described in Comment #60, except that now I get the additional line in the dmesg output: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx And "ethtool eth0" actually shows "Link detected: yes", unlike before. However, these changes notwithstanding, I still am not able to access the network until I reload the "sky2" module. :-( And you still get the tx_timeout messages? My goal is to somehow trigger the scenario that generates the "tx timeout" messages, but I've been unsuccessful in doing so. I can't forcibly get the "tx timeout" message to be displayed. Comment #60 was an attempt to do that, and I thought it suspicious that the symptoms exhibited by "tx timeout" being displayed and the result of my exercise of bouncing the network interface while lots of traffic is being received seemed to be pretty similar. Ok, that changes thing. Sounds like we may need to reset the hardware on open. I'll see if I can enhance the patch to do that, incoporating the patch from comment 64 on the way. Thanks! Just for point of reference, I tried the test described in Comment #60 on a vanilla 2.6.23-rc4 kernel, and I get the same failure condition. How does this type of issue get reported upstream? (Do I report a bug on kernel.org or does Red Hat generally do that type of thing?) Also, for what it's worth, the vendor sk98lin 10.20.3.3 driver does not appear to have this issue... Oh, that is good to know. you can here: http://bugzilla.kernel.org/ If the problem is upstream as well, then perhaps it would be prdent to move forward with this bz, and roll it into RHEL4.6 or 4.7 and pursue the problem upstream and backport when its fixed. What are your thoughts? The problem happens intermittently with the vanilla 2.6.23-rc4 kernel, whereas it happens consistently with the RHEL4 kernel (sky2 1.14). I've opened: http://bugzilla.kernel.org/show_bug.cgi?id=8962 to track the upstream issue. Re: your question in Comment #70, how big or involved would the patch you reference in Comment #68 be? I'd be interested in trying it... Well, I don't honestly know. My initial inclination was that there woudl be some relatively straightforward way to reset the chip in the sky2_probe routine that we could borrow and put in the open routine, making for an easy patch. Looking at it though, it may be rather more complicated than that. I'll get up with the upstream maintainer and see what his thoughts on the matter are, since he's much more familiar with sky2 than I am. by the way, it hasn't escaped my notice from the upstream bz that you are testing this on CentOs not on RHEL per-se. whiel it doesn't particularly bother me one way or the other, and this clearly isn't a distribution specific problem, I'd be curious to know if you've contacted them for support on this issue? I hadn't contacted them because my original problem with the sky2 driver was on a RHEL 4 system and so I opened a support ticket with Red Hat. After 3 months the response from support was that I should view bugzilla #198808 for more information about this problem. I don't have permissions to view that bug (and complained as such, to no avail), so I opted to write my own bugzilla entry instead, and here it is... Out of curiosity, in trying to determine what you had to modify to the upstream sky2 driver to retrofit it to an RHEL4 kernel, is it mostly removing wake-on-lan stuff? (And which upstream kernel did you use to pull sky2 1.14 from--a specific kernel rev. or a GIT commit?) Just as an FYI, Stephen Hemminger posted a sky2 update to netdev today -- it might be worth trying his patches though I don't see anything in his descriptions that will really help the issues discussed here. I pulled from Linus's tree, I don't remember the exact tag/version, but I'll look it up for you if you like Andy, I'll go over Stephens update in the AM. Thanks! Since I'm unable to reproduce were you able to get the requested debug info that Stephen asked for? Yes, comment posted to that bug report on 9/7. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. We've managed to get some extra sky2 hardware available and are working on setting up reproducers now. Created attachment 302652 [details]
new sky2 backport
I've not tested it yet (no sky2 hardware in hand at the moment), but I've done
this backport of the latest sky2 driver that a co-worker has been using on a
2.6.25 kernel, and he has been unable to reproduce any lockups or crashes with
it. If you could give it a spin, I'd appreciate it. Thanks!
Updating PM score. Since RHEL 4.8 External Beta has begun, and this bugzilla remains unresolved, it has been rejected as it is not proposed as exception or blocker. closing, no activity from reporter for over a year Unfortunately I found myself in the same situation as the reporter of #216799 (sky2 transmitter lockup), and moved on to other, less problematic hardware (that didn't have Marvell interfaces). Before doing so, however, I made use of the patches for this bugzilla and #216799 and was able to make satisfactory use of the hardware. The prior patches posted to this bug helped tremendously; unfortunately by the time the last request for testing occurred (4/16/2008) I was no longer using this hardware... |