Bug 330841 - skge driver: interrupt stuck on SK-9821
skge driver: interrupt stuck on SK-9821
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: 4Suite (Show other bugs)
5.0
i686 Linux
low Severity low
: ---
: ---
Assigned To: Andy Gospodarek
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-13 13:51 EDT by Kay Diederichs
Modified: 2014-06-29 18:59 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-08-10 10:19:34 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
skge-screaming-interrupt.patch (1.58 KB, patch)
2009-10-28 13:55 EDT, Andy Gospodarek
no flags Details | Diff
skge-possible-interrupt-fix.patch (2.06 KB, patch)
2011-08-09 16:48 EDT, Andy Gospodarek
no flags Details | Diff

  None (edit)
Description Kay Diederichs 2007-10-13 13:51:46 EDT
Description of problem:
The skge driver produces an extremely high number of interrupts (about 180,000
per second) for my SK-9821 adapters, without any network traffic. This starts
about 20 minutes after loading the driver and keeps the CPU busy at 20% load.
The network is usable nevertheless, but slow.

Version-Release number of selected component (if applicable):
all version RHEL5-2.6.18 kernels . I tried a 2.6.22.9 kernel; it did not show
these symptoms.

How reproducible:
always. All 3 available machines show the same symptoms.


Steps to Reproduce:
1. boot
2. wait about 20 minutes, watching /proc/interrupts
3. see number of interrupts starting to rise extremely
  
Actual results:
interrupts for skge are generated at a rate of about 180,000 per second, without
any network traffic

Expected results:
interrupts should depend on network traffic

Additional info:
The sk98lin driver (available with RHEL4, or when I compile the RHEL5 kernel
after enabling that driver as a module) does not show this problem.
Comment 3 Andy Gospodarek 2007-10-19 09:41:26 EDT
Sorry, man.  sk98lin isn't enabled because it had been replaced by skge.  Sounds
like we need to fixup skge.  Based soley on the description from this customer
it sounds like this patch might help:

commit 4ebabfcb1d6af5191ef5c8305717ccbc24979f6c
Author: Stephen Hemminger <shemminger@linux-foundation.org>
Date:   Fri Mar 16 14:01:27 2007 -0700

    skge: mask irqs when device down

    Wheen a port on the skge driver is not used, it should
    mask off interrupts from theat port.

    Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
    Signed-off-by: Jeff Garzik <jeff@garzik.org>

I need to do some new test kernels today, so maybe I'll add this one.
Comment 4 Andy Gospodarek 2007-10-19 09:42:26 EDT
Sorry the 'man' reference was to a previous private comment -- please excuse the
informality. :)
Comment 5 Andy Gospodarek 2007-10-19 09:43:35 EDT
(In reply to comment #0)
> Description of problem:
> The skge driver produces an extremely high number of interrupts (about 180,000
> per second) for my SK-9821 adapters, without any network traffic.

Was this while the interface was down, or just with the link up and no traffic?
Comment 6 Kay Diederichs 2007-10-19 10:16:03 EDT
No problem with the 'man' - in Germany Kay is a male name.

To answer your question: this was with the link up and (almost) no traffic. So
the description "skge: mask irqs when device down" does not seem to apply here.
Comment 7 Andy Gospodarek 2007-10-19 12:00:31 EDT
(In reply to comment #6)
> No problem with the 'man' - in Germany Kay is a male name.

Good to know. :-)

> To answer your question: this was with the link up and (almost) no traffic. So
> the description "skge: mask irqs when device down" does not seem to apply here.

Thanks for the feedback, I'll sift through the git logs and see if anything else
looks like a good candidate to resolve this.  If your card is working fine in
2.6.22.9 then the fix must be around somewhere.

Comment 8 Andy Gospodarek 2009-10-28 13:26:28 EDT
Kay, this is an extremely old bug, but I would like to fix it.

Can you still reproduce the problem with the skge device?
Comment 9 Andy Gospodarek 2009-10-28 13:42:44 EDT
I was just looking at some possible patches and this may have been fixed in 2.6.18-128.el5.

Have you tried the 5.3 kernel and did it fix the issue?
Comment 10 Andy Gospodarek 2009-10-28 13:55:27 EDT
Created attachment 366474 [details]
skge-screaming-interrupt.patch

Please disregard my last comment.  This was not resolved in the 2.6.18-128.el5.  This patch may fix it however.  I will add it to my test kernels and post a link here when builds have been completed.  Until then, feel free to build this patch against the RHEL5 skge driver and test it.
Comment 11 Kay Diederichs 2009-10-29 08:29:34 EDT
The bug still exists on 2.6.18-164.el5
Comment 12 Andy Gospodarek 2009-10-29 11:10:44 EDT
(In reply to comment #11)
> The bug still exists on 2.6.18-164.el5  

I would expect that.  I will build a test kernel sometime in the next week, but feel free to build test the patch in comment #10 against 2.6.18-164, to see if it helps.
Comment 13 Andy Gospodarek 2009-11-17 22:38:29 EST
My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.
Comment 14 Kay Diederichs 2009-11-18 08:26:43 EST
I downloaded the kernel and am running it now:
dikay@loop3:-people/dikay% uname -a
Linux loop3 2.6.18-174.el5.gtest.77 #1 SMP Tue Nov 17 15:42:39 EST 2009 i686 i686 i386 GNU/Linux

however I still see the number of skge interrupts rising at millions per second - the output is after one hour of uptime, on an otherwise idle machine:
dikay@loop3:-people/dikay% cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:    9977531          0          0          0    IO-APIC-edge  timer
  1:        174       1446         73       6209    IO-APIC-edge  i8042
  6:         31          0         13          0    IO-APIC-edge  floppy
  7:          2          0          0          0    IO-APIC-edge  parport0
  8:    1399005          0          0          0    IO-APIC-edge  rtc
  9:          1          0          0          0   IO-APIC-level  acpi
 14:       5615      38641          0      30909    IO-APIC-edge  ide0
 15:         26      11128       5234      77779    IO-APIC-edge  ide1
169:        627       8322        134      69166   IO-APIC-level  uhci_hcd:usb1
177:          0          0          0          0   IO-APIC-level  uhci_hcd:usb2
185:         15          0          0          0   IO-APIC-level  aic7xxx
193:      21731          0 1479663036          0   IO-APIC-level  skge
201:       2823          0          0          0   IO-APIC-level  Intel 82801BA-ICH2
209:       5896      53787     793335          0   IO-APIC-level  nvidia
NMI:          0          0          0          0
LOC:    9976477    9976475    9976474    9976473
ERR:          0
MIS:          0
Comment 15 Andy Gospodarek 2011-08-09 14:47:37 EDT
This is a really old bug, but I'd like to clear it out.  If you are still interested and willing to test, the patch I am about to attach should resolve this.
Comment 16 Andy Gospodarek 2011-08-09 16:48:18 EDT
Created attachment 517486 [details]
skge-possible-interrupt-fix.patch

This patch seems to have the best chance to resolve the problem reported.  This is a combination of the following upstream patches:

29365c900963d4986b74a0dadea46872bf283d76
78bc218663e3bd6cbbaf6a363d2f88f17541adfb

and based on the reporters feedback that 2.6.22.9 worked fine, these two patches together seem like the best option without performing a full backport.

The reporter tested 29365c900963d4986b74a0dadea46872bf283d76 previously, but it did not resolve the problem.  I think it should be included if we update skge at all, so I kept it and added 78bc218663e3bd6cbbaf6a363d2f88f17541adfb as I think it could be the real problem.
Comment 17 Andy Gospodarek 2011-08-10 10:19:34 EDT
I cannot reproduce this on the skge system we have locally.  My system is an older nForce motherboard and CPU and doesn't actually have more than 1 core, so that might be a factor.

I hesitate to do this, but I'm going to close this as insufficient data.  Please reopen if you find the patch in comment #16 resolves your issue or need anything else.  If you do that soon I can try to get that added to the next RHEL update.

Note You need to log in before you can comment on or make changes to this bug.