Description of problem: I'm trying out RHEL beta on my "home server" (router/proxy/mail/NAS), which was basically desktop PC based on Asus K8N4-E Deluxe MB, based on NForce 4-4x. I'm using internal gigabit ethernet interface (forcedeth driver) for high-speed internal communications (server delivers nfs through it). The problem is, server hangs under network load. Version-Release number of selected component (if applicable): kernel-2.6.18-36.el5.x86_64 I also tried kernel-2.6.18-36.el5.jwltest.41.x86_64 with same results How reproducible: Sometimes Steps to Reproduce: 1. Start experimenting with iperf. Eventually the bug will appear. So-called "bidirectional" transfer tests together with MTU tweaking between tests seem to trigger it fastest. Though, even if the router is left by itself, eventually regular (http, mail, torrent) traffic can trigger this. Actual results: Kernel panic, system hangs. It is instant and and leaves no traces in /var/log/messages. Expected results: Additional info: The system seems to have some problems with acpi - linux can't find any devices on USB bus and SATA harddrives, though it detects USB controller and both SATA controllers (from nvidia and silicon image) just fine. Therefore, I'm using following kernel options: acpi=off nolapic. APIC is turned off in BIOS, when it's turned on, but acpi=off parameter is passed, something really messy happens. nolapic parameter doesn't really change anything, these network problems happen whether it's used or not. Contents of /proc/interrupts: CPU0 0: 312995458 XT-PIC timer 1: 8 XT-PIC i8042 2: 0 XT-PIC cascade 3: 0 XT-PIC ohci_hcd:usb1 5: 145359 XT-PIC sata_nv 7: 173362769 XT-PIC eth0 8: 0 XT-PIC rtc 11: 0 XT-PIC ehci_hcd:usb2, sata_nv 12: 114783321 XT-PIC eth1 NMI: 0 LOC: 0 ERR: 0 MIS: 0 After googling on problems similar to this, I came to conclusion it could be interrupt-related problem. I got advice to use "options forcedeth max_interrupt_work=16" option. I tried it, and it greatly reduced the probability of the kernel panic happening - now system doesn't seem to hang while routing at all, but experiments with iperf (major network load) still can hang it. Therefore, it's not a solution. As for the real kernel trace, well.. Since it's not in logs, I can't capture it nicely. The best I could manage was making a photo of the screen with my cellphone..
Created attachment 188451 [details] Trace one, full view
Created attachment 188461 [details] Upper part of first trace
Created attachment 188471 [details] Middle part of first trace
Created attachment 188481 [details] Lower part of first trace
Created attachment 188491 [details] Trace two - this one with max_interrupt_work=16
I got another report that looks quite similar to this. That showed me that it's dying in skb_over_panic(). Did you happen to see any lines that began with this: skput:over:... in /var/log/messages? The call stack should look like this: skb_over_panic skb_put nv_rx_process[_optimized] nv_nic_irq nv_do_nic_poll There is a patch upstream that brings back the use of the optimized data path for do_nic_poll since it was left out of the original work. This might be interesting to try, but I'm not sure it will matter too much.
Created attachment 229201 [details] forcedeth-optimized-irq-routine.patch Upstream patch that would be interesting to try. commit fcc5f2665c81e087fb95143325ed769a41128d50 Author: Ayaz Abdulla <aabdulla> Date: Fri Mar 23 05:49:37 2007 -0500 forcedeth: fix nic poll The nic poll routine was missing the call to the optimized irq routine. This patch adds the missing call for the optimized path.
I don't get anything in /var/log/messages, nothing is left there after crash, and when I connect a monitor to this system I can't see the lines before the ones I posted. However, I'll rebuild kernel with this patch and will try it out soon. It looks interesting, yes. Btw, in last month the system only hang once or twice - with max_interrupt_work=16 and simple routing tasks. Unfortunately, to achieve it I had to move nfs serving task away from it.. To test this patch, I'll return full load on this system.
Great! Thank you for trying this patch. I'm not sure it will help, but it seems interesting.
I'm getting some feedback that this patch is helping for RHEL4 -- I'll build some new test kernels and post a link to them here.
Hmmmmmm it seems I've added this patch already. Can you try a kernel from here: http://people.redhat.com/dzickus/el5/53.el5/ It should resolve your issue. Thanks!
I feel confident this is a duplicate of bug 245191 Please reopen this if the kernel from comment #12 does not resolve your problem. Thanks! *** This bug has been marked as a duplicate of 245191 ***
I tried with kernel-2.6.18-51.el5.jwltest.43.x86_64 from http://people.redhat.com/linville/kernels/rhel5/ which includes this patch and it doesn't crash anymore. I got "spurious 8259A interrupt: IRQ7." message once in dmesg under load (IRQ7 is my eth0 interrupt) and I get TONS of "eth0: too many iterations (6) in nv_nic_irq." messages, but at least everything seems to work. I only did a synthetic testing with iperf, I'll reopen one of these bugs in case of any problems under real load - but I guess it's safe to close them now.