Bug 280151 - forcedeth driver causes kernel panic in nv_tx_done call
forcedeth driver causes kernel panic in nv_tx_done call
Status: CLOSED DUPLICATE of bug 245191
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Andy Gospodarek
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-09-06 04:57 EDT by Vladimir Mosgalin
Modified: 2014-06-29 18:59 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-10-18 19:10:51 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Trace one, full view (662.18 KB, image/jpeg)
2007-09-06 04:57 EDT, Vladimir Mosgalin
no flags Details
Upper part of first trace (819.42 KB, image/jpeg)
2007-09-06 04:59 EDT, Vladimir Mosgalin
no flags Details
Middle part of first trace (589.18 KB, image/jpeg)
2007-09-06 05:00 EDT, Vladimir Mosgalin
no flags Details
Lower part of first trace (438.55 KB, image/jpeg)
2007-09-06 05:01 EDT, Vladimir Mosgalin
no flags Details
Trace two - this one with max_interrupt_work=16 (560.36 KB, image/jpeg)
2007-09-06 05:02 EDT, Vladimir Mosgalin
no flags Details
forcedeth-optimized-irq-routine.patch (1015 bytes, patch)
2007-10-16 16:35 EDT, Andy Gospodarek
no flags Details | Diff

  None (edit)
Description Vladimir Mosgalin 2007-09-06 04:57:42 EDT
Description of problem:
I'm trying out RHEL beta on my "home server" (router/proxy/mail/NAS), which was
basically desktop PC based on Asus K8N4-E Deluxe MB, based on NForce 4-4x. I'm
using internal gigabit ethernet interface (forcedeth driver) for high-speed
internal communications (server delivers nfs through it). The problem is, server
hangs under network load.

Version-Release number of selected component (if applicable):
kernel-2.6.18-36.el5.x86_64
I also tried kernel-2.6.18-36.el5.jwltest.41.x86_64 with same results

How reproducible:
Sometimes

Steps to Reproduce:
1. Start experimenting with iperf. Eventually the bug will appear. So-called
"bidirectional" transfer tests together with MTU tweaking between tests seem to
trigger it fastest. Though, even if the router is left by itself, eventually
regular (http, mail, torrent) traffic can trigger this.

Actual results:
Kernel panic, system hangs. It is instant and and leaves no traces in
/var/log/messages.

Expected results:


Additional info:
The system seems to have some problems with acpi - linux can't find any devices
on USB bus and SATA harddrives, though it detects USB controller and both SATA
controllers (from nvidia and silicon image) just fine.

Therefore, I'm using following kernel options: acpi=off nolapic. APIC is turned
off in BIOS, when it's turned on, but acpi=off parameter is passed, something
really messy happens. nolapic parameter doesn't really change anything, these
network problems happen whether it's used or not.

Contents of /proc/interrupts:
           CPU0       
  0:  312995458          XT-PIC  timer
  1:          8          XT-PIC  i8042
  2:          0          XT-PIC  cascade
  3:          0          XT-PIC  ohci_hcd:usb1
  5:     145359          XT-PIC  sata_nv
  7:  173362769          XT-PIC  eth0
  8:          0          XT-PIC  rtc
 11:          0          XT-PIC  ehci_hcd:usb2, sata_nv
 12:  114783321          XT-PIC  eth1
NMI:          0 
LOC:          0 
ERR:          0
MIS:          0

After googling on problems similar to this, I came to conclusion it could be
interrupt-related problem. I got advice to use "options forcedeth
max_interrupt_work=16" option. I tried it, and it greatly reduced the
probability of the kernel panic happening - now system doesn't seem to hang
while routing at all, but experiments with iperf (major network load) still can
hang it. Therefore, it's not a solution.

As for the real kernel trace, well.. Since it's not in logs, I can't capture it
nicely. The best I could manage was making a photo of the screen with my cellphone..
Comment 1 Vladimir Mosgalin 2007-09-06 04:57:42 EDT
Created attachment 188451 [details]
Trace one, full view
Comment 2 Vladimir Mosgalin 2007-09-06 04:59:50 EDT
Created attachment 188461 [details]
Upper part of first trace
Comment 3 Vladimir Mosgalin 2007-09-06 05:00:44 EDT
Created attachment 188471 [details]
Middle part of first trace
Comment 4 Vladimir Mosgalin 2007-09-06 05:01:30 EDT
Created attachment 188481 [details]
Lower part of first trace
Comment 5 Vladimir Mosgalin 2007-09-06 05:02:53 EDT
Created attachment 188491 [details]
Trace two - this one with max_interrupt_work=16
Comment 6 Andy Gospodarek 2007-10-16 16:33:53 EDT
I got another report that looks quite similar to this.  That showed me that it's
dying in skb_over_panic().  Did you happen to see any lines that began with this:

skput:over:...

in /var/log/messages?

The call stack should look like this:

skb_over_panic
skb_put
nv_rx_process[_optimized]
nv_nic_irq
nv_do_nic_poll

There is a patch upstream that brings back the use of the optimized data path
for do_nic_poll since it was left out of the original work.  This might be
interesting to try, but I'm not sure it will matter too much.
Comment 7 Andy Gospodarek 2007-10-16 16:35:14 EDT
Created attachment 229201 [details]
forcedeth-optimized-irq-routine.patch

Upstream patch that would be interesting to try.  

commit fcc5f2665c81e087fb95143325ed769a41128d50
Author: Ayaz Abdulla <aabdulla@nvidia.com>
Date:	Fri Mar 23 05:49:37 2007 -0500

    forcedeth: fix nic poll

    The nic poll routine was missing the call to the optimized irq routine.
    This patch adds the missing call for the optimized path.
Comment 8 Vladimir Mosgalin 2007-10-16 16:53:56 EDT
I don't get anything in /var/log/messages, nothing is left there after crash,
and when I connect a monitor to this system I can't see the lines before the
ones I posted.

However, I'll rebuild kernel with this patch and will try it out soon. It looks
interesting, yes.

Btw, in last month the system only hang once or twice - with
max_interrupt_work=16 and simple routing tasks. Unfortunately, to achieve it I
had to move nfs serving task away from it.. To test this patch, I'll return full
load on this system.
Comment 9 Andy Gospodarek 2007-10-16 17:19:06 EDT
Great!  Thank you for trying this patch.  I'm not sure it will help, but it
seems interesting.
Comment 11 Andy Gospodarek 2007-10-18 16:14:04 EDT
I'm getting some feedback that this patch is helping for RHEL4 -- I'll build
some new test kernels and post a link to them here.
Comment 12 Andy Gospodarek 2007-10-18 16:30:41 EDT
Hmmmmmm it seems I've added this patch already.  Can you try a kernel from here:

http://people.redhat.com/dzickus/el5/53.el5/

It should resolve your issue.  Thanks!
Comment 13 Andy Gospodarek 2007-10-18 19:10:51 EDT
I feel confident this is a duplicate of bug 245191 

Please reopen this if the kernel from comment #12 does not resolve your problem.

Thanks!



*** This bug has been marked as a duplicate of 245191 ***
Comment 14 Vladimir Mosgalin 2007-10-20 09:21:01 EDT
I tried with kernel-2.6.18-51.el5.jwltest.43.x86_64 from
http://people.redhat.com/linville/kernels/rhel5/ which includes this patch and
it doesn't crash anymore. I got "spurious 8259A interrupt: IRQ7." message once
in dmesg under load (IRQ7 is my eth0 interrupt) and I get TONS of "eth0: too
many iterations (6) in nv_nic_irq." messages, but at least everything seems to
work. I only did a synthetic testing with iperf, I'll reopen one of these bugs
in case of any problems under real load - but I guess it's safe to close them now.

Note You need to log in before you can comment on or make changes to this bug.