Bug 182617

Summary: irqbalance makes SMP system unstable
Product: [Fedora] Fedora Reporter: Alexandre Oliva <oliva>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: rhbz
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-08-08 18:44:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 182618    
Bug Blocks:    
Attachments:
Description Flags
/proc/interrupts snapshots
none
patch to build the ping crash module
none
All I saw in the console when disks started failing because of irqbalance being on
none
SysRq-T after the network stopped working
none
Oopses I got after reloading skge, after a network failure none

Description Alexandre Oliva 2006-02-23 18:05:17 UTC
Description of problem:
Evidence is mounting that it is irqbalance that is causing me headaches, leading
to numerous different kinds of failures (bug 181347, bug 181920, bug 181310). 
The box would display any of the symptoms of these bugs within hours of booting
up.  Ever since I ran `service irqbalance stop´, the box has been rock solid.  I
didn't find it frozen, as it would always be, when I got up this, erhm, morning
:-), which is a good sign, and it's heavy on duty since then, without any
casualties so far.

I realize this is unlikely to be a bug in irqbalance per se, but I'll clone this
as a kernel bug momentarily, and block the irqbalance bug on the kernel clone.

Version-Release number of selected component (if applicable):
kernel-2.6.15-1.1975_FC5.x86_64
irqbalance-1.12-1.24

How reproducible:
Never failed me after leaving the several boxes with similar configuration on
overnight

Steps to Reproduce:
1.Boot the system up
2.Leave it up overnight
  
Actual results:
You'll find that networking died, or that the SATA subsystem is dead, or that
the mouse is jerky, or God knows what else.

Expected results:
No such undesirable surprises.

Additional info:
Hardware is Athlon64X2 3800+, Asus A8V Deluxe, A4Tech USB mouse, 2 SATA disks
connected to the sata_promise controller built into the MoBo.

$ cat /proc/interrupts
           CPU0       CPU1
  0:      87445    9375771    IO-APIC-edge  timer
  1:      24239          0    IO-APIC-edge  i8042
  7:          0          0    IO-APIC-edge  parport0
  8:          0          0    IO-APIC-edge  rtc
  9:          0          0   IO-APIC-level  acpi
 15:     112601        552    IO-APIC-edge  ide1
 16:          0          0   IO-APIC-level  libata
 17:      25254    2148074   IO-APIC-level  libata
 18:    3274119      36277   IO-APIC-level  skge
 19:          0          0   IO-APIC-level  VIA8237
 20:          7    1660312   IO-APIC-level  ohci1394
 21:         48      67448   IO-APIC-level  ehci_hcd:usb1, uhci_hcd:usb2,
uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5
NMI:       2744       4102
LOC:    9463891    9463534
ERR:          0
MIS:          0

I still haven't determined what happens if I never run irqbalance after boot up;
so far all I've tested is irqbalance running for some time, and then stopped, so
that every IRQ is assigned to a single CPU, and that appears to make the system
stable.

Comment 1 Neil Horman 2006-02-23 18:29:57 UTC
Hmmm, I'd certainly agree with you that this likely isn't a bug in irqbalance
specifically.  It is certainly possible however that migration of irqs might be
causing a problem, although ideally once irqbalance distributes irq's, it trys
rather hard to not move them around again.  I'd be interested to see a copy of
/proc/interrupts with irqbalance running on your system for a few hours, and
/proc/interrupts without irqbalance running on your system for a few hours, to
compare and see if irqbalance is actually migrating any interrupts more than it
should.  

Also, since I've not heard of this happening with many systems (taking all your
referenced bz's in aggregate) I'd wonder if you don't have a specific system
problem (perhaps a quirk with the Via chipset in your box, or an ACPI error of
some sort).  Can you try booting with acpi=off (I think thats the right syntax)
and see if you get the same effects? Thanks!

Comment 2 Alexandre Oliva 2006-02-23 20:32:16 UTC
It's 4 different boxes experiencing the problem, so it's unlikely to be
something specific to this one I have at home (the other 3 are at the uni). 
I've also found reports of skge problems on the net, so there is something to it.

As for acpi, I didn't think acpi=off was supported at all on x86_64, but I can
try that on my next reboot.  I'll also try to get you /proc/interrupts with
irqbalance running, although the sort of workloads the box experiences vary
widely depending on the time of day.  I'm also thinking of trying a 32-bit OS on
it just to determine whether the problem is 64-bit specific.

Anyhow, the more I think about it, the more it makes sense: the box would often
freeze when I switched from one major activity to another, e.g., it wouldn't
crash half-way through a big build, but it would often crash in the beginning or
at the end, and generally logging a CRC error.  I'd often have network problems
right after connecting to the box over vnc from another box, or right after
disconnecting.  Putting this all together made me wonder if CPU affinity could
solve the problem, and so I got to irqbalance.

Comment 3 Neil Horman 2006-02-23 21:12:16 UTC
Again, I don't disagree that migrating irqs may have a problem, but its not
going to be the irqbalance daemon thats causing it.  There may be a problem with
migrating irqs between cpu's which is causing panics/deadlocks/etc, but thats
going to be a kernel problem.  I can certainly help you fix that, but I'm going
to need more to go on.  If you can provide some of the panic backtraces (I
checked the other bugs you reported and there doesn't seem to be any
panic/backtrace info in any of them) that would be helpful. It would also be
helpful (for the purposes of my debugging any potential problem in irqbalance)
to see those /proc/interrupt before/after snapshots.  

Comment 4 Alexandre Oliva 2006-02-24 01:48:26 UTC
Created attachment 125153 [details]
/proc/interrupts snapshots

Here are some /proc/interrupts dumps.  As soon as I got the gdm login prompt, I
switched to VT1 and, as root, dumped /proc/interrupts to a file, and then
scripted an automated
sleep-for-one-minute-then-append-the-date-and-the-contents-of-/proc/interrupts
to the same file, and left it running for a few minutes.  Clearly, interrupts
are dancing back and forth between the two processors...

I tried disabling acpi (you meant acpi, not apic, right?), and that had the
unfortunate side effect of disabling Cool&Quiet, so cpuspeed wouldn't work and
I figured I didn't want to leave the machine running like that for very long.

As for panics, I don't ever get any, which is why the bug reports do not
contain them :-)  This is what makes this bug particularly tricky to debug, I
guess.	irqbalance was a shot in the dark, and I'm happy it hit something.  In
case you suspect cpuspeed, that's not it.  Before I updated the BIOS to enable
Cool&Quiet with a dual-core processor, I'd already got the very same kind of
problem.

Comment 5 Neil Horman 2006-02-24 12:16:06 UTC
Ok, thats the /proc/interrupts files with irqbalance turned on I assume.  What
about with irqbalance off?  The fact that you are getting any given interrupt on
multiple cpu's means either that irqs are being migrated at the same time that a
steady stream of interrupts is arriving, or that irqbalance has decided that a
given interrupt occurs at a low enough frequency that it can be masked to a
subset  of, or all of the cpu's in the system.  Getting /proc/interrupts with
irqbalance off will help me compare that.  In fact if you could provide a
sysreport as well, so I could check on the state of the rest of your system
without having to ask you for things bit by bit, that would be very helpful.

Regarding the lack of panics, I assume that you have tried to establish a serial
console, or tried to capture a vmcore via netdump or diskdump?  If not, thats a
road we should explore.  How about sysrq's?  Is the system responsive to a sysrq
key sequence after an error occurs?  If so, gathering a sysrq-t and sysrq-m
would be helpful



Comment 6 Alexandre Oliva 2006-02-24 17:17:17 UTC
Without irqbalance having ever run since boot up:

$ cat /proc/interrupts
           CPU0       CPU1
  0:   14167756          0    IO-APIC-edge  timer
  1:      42507          0    IO-APIC-edge  i8042
  7:          0          0    IO-APIC-edge  parport0
  8:          0          0    IO-APIC-edge  rtc
  9:          0          0   IO-APIC-level  acpi
 15:     169347          0    IO-APIC-edge  ide1
 16:          0          0   IO-APIC-level  libata
 17:    1862640          0   IO-APIC-level  libata
 18:    7518595          0   IO-APIC-level  skge
 19:          6          0   IO-APIC-level  ohci1394
 20:     143902          0   IO-APIC-level  uhci_hcd:usb1, uhci_hcd:usb2,
uhci_hcd:usb3, uhci_hcd:usb4, ehci_hcd:usb5
 21:          0          0   IO-APIC-level  VIA8237
NMI:       5951       5226
LOC:   14168372   14168692
ERR:          0
MIS:          0

No serial console here, and not really necessary, since the console is still
usable: in the case of the disk subsystem failure, it's trickier because I have
to have everything I need already in memory, and I don't get a permanent record
unless I set up some external disk to collect a copy of /var/log/messages, but
for networking or mouse the system is still usable and perfectly recoverable as
long as I'm physicall in front of it, which is not that uncommon given that this
is my primary desktop (which is what makes this box not a very good choice of a
system on which to run random testing configurations ;-)

I haven't set up netdump or diskdump mainly because I don't know how to do that,
and considering that I can't tell in in advance whether it's the disk or the
network that is going to fail, and they fail just as often as each other, it's
hard to decide which one to set up.

The disk failure is more serious, since I generally can't bring the system back
up without a reset after it hits.  Even in this case, however, the system keeps
running (for some arguable definition of running :-), to the point that I can
often switch to VT1 (as long as I don't need to page code in to accomplish that;
most often I don't) and see SysRq output.

I've already looked for interesting stuff when the network failed, and found
nothing: no held locks were shown by SysRq-D.  For disk failures, there are
generally lots of held locks, all of them related with ext3 inodes.  Memory is
not a problem in either case.

I'm collecting the sysreport and will attach it as soon as it is done.  It's
taking forever to collect the list of packages (everything in today's rawhide).

Comment 9 Neil Horman 2006-03-27 15:27:45 UTC
ok, so I've been going over this, and as far as I can see, unless we can capture
an oops when this happens (or get a sysrq-t when the system deadlocks), we're
not going to make much progress.  I suggest the following plan:

1) Since FC5 has been released, make sure the problem still occurs on the latest
kernel.  If the problem is gone it will be easier to track down what we fixed,
than to fix the problem all over again.

2) You have alot of modules loaded, lets play guess and check.  I'd start by
removing non-essential modules (your sound modules are probably a good place to
start, as sound cards can cause a good deal of interrupts.  For those modules
that you can't remove, do whatever you can to isolate and minimize the number of
interrupts the device generates.  the skge driver springs to mind here, either
maximize the interrupt coalescence factor on the card, or filter the segment
that the crashing system is on so that it only receives essential traffic (and
minimize the received traffic if you can, I noted that this system seems to be a
pretty busy named server, if you can, move dns resolution to a backup DNS).  The
idea here clearly is to isolate which interrupts (potential) migration is
triggering your deadlock.

3) Is this system ping responsive during the deadlock?  If so, I can provide you
a special module to trigger an oops on the reception of a malformed ping packet.
 We can use that to capture a core dump.

4) Lets monitor the system more closely.  In your attached sysreport, do you
have a timestamp you can reference when a hang occured that we can correlate to
a point in the sar log?  It would be good to know what happened on the system
leading up to the hang

Comment 10 Alexandre Oliva 2006-08-01 03:39:18 UTC
1) The problem still occurs in the latest rawhide kernel (2.6.17-1.2642.fc6)

2) I can't really do much in terms of removing the loaded modules et al.  The
system is my primary desktop, and I won't have another to play with for a while
yet.  Add to that the fact that the bug doesn't hit very often (once a day or
so) and you see that I can't do much, really.  I'm trying to get a serial cable
to get stack traces, but even that is proving to be very difficult :-(

It is not an active name server at all, BTW.  It only runs named locally,
serving itself, forwarding requests to an internal Red Hat name server or to my
main home DNS server.

3) The exact symptoms of the failure vary.  When it is the network card that
dies, it's no longer responsive to pings.  When it's something else, it is.

4) Sorry, I dropped the ball here.  I don't have anything that reliably (or even
unreliably) triggers the problem; it's not high load, low load, building stuff,
browsing the web, watching movies, nothing particular.  It just hits all of a
sudden, and then the box exposes one of the various problems.  I've even tried
disabling cpu frequency switching to see if it helped any, but the problem still
hit.

Comment 11 Neil Horman 2006-08-01 10:36:44 UTC
so that still leaves us where we were before.  Unless we can get a stack trace
or core dump there isn't much at all I can do here.  I would focus on getting
the serial cable attached to get that stack trace or vmcore.  Also I'm attaching
my ping crash module code.  You can build that for your system and start
auto-loading it in the event that you get a lockup, but your system is still
ping responsive.

Comment 12 Neil Horman 2006-08-01 10:38:49 UTC
Created attachment 133386 [details]
patch to build the ping crash module

heres the module code I mentioned.  Fair warning, it makes your system able to
be crashed through the reception of ICMP echo frames with properly formatted
pad data.  So don't use it if your not comfortable with that security risk.

Comment 13 Alexandre Oliva 2006-08-05 16:06:58 UTC
I've finally got a serial cable.  I re-enabled irqbalance and quickly got two
disk failures, both of which started with nothing but a command time out :-( 
I'll attach the console log in a moment.

Comment 14 Alexandre Oliva 2006-08-05 16:18:23 UTC
Created attachment 133691 [details]
All I saw in the console when disks started failing because of irqbalance being on

Nothing informative, I'm afraid...  Disks became inaccessible without anything
useful sent to the serial console, and then I reset the box as it became
unusable.

Comment 15 Neil Horman 2006-08-07 11:48:05 UTC
Did you have sysrq's enabled?  Were you able to dump a sysrq-t?  I'm afraid
whats on here doesn't provide anything to go on really, except to say that it
appears that your disks have started to operate poorly.  Given that this is what
we currently have to go on, I would suggest the following:

1) make sure that sysrq's are enabled in sysctl.conf, and capture a sysrq-t if
you can when this happens again.

2) enable smart (It should support sata controllers as of fc5 I think.  If the
drive itself is actually having a problem, that may help detect it early.

3) check with the sata card manufacturer, see if there is a firmware update
available for the controller.  Perhaps check to see if there is anything
repaired relating to interrupt migration or movement (which may explain why
enabling irqbalance triggers this crash).

4) Try the latest FC6 kernel.  There were a few problems fixed in the libata
code understanding drive return codes I think.  Some may be applicable here.

5) If possible, archive the data on the raid array, switch the drive controller
to legacy (pata) mode, and rebuild the array.  Perhaps there is a heretofore
undiscovered bug in the libata code or the sata driver you are using.


Comment 16 Alexandre Oliva 2006-08-08 19:45:35 UTC
1) sysrq-t will do, now that I have serial console, doh!  I forgot about it.

2) smart is active and reports no problems

3) I've got the latest non-beta BIOS for the motherboard, and the controller is
built into the motherboard.

4) I'm always running the latest FC devel kernel on this box, unless (i) it
hasn't finished installing yet, or (ii) it breaks badly.  With irqbalance off,
that is.

5) I'm using software raid and the controller is already in regular, non-raid
mode.  Is this what you meant by legacy (pata) mode?  If not, please clue me in ;-)

Thanks,

Comment 17 Neil Horman 2006-08-08 19:54:34 UTC
1) waiting on sysrq-t

2) good to know, although something seems awry between smart not delivering
errors and those log messages that you sent in.

3) have you looked at the beta bios errata list to see if anything relates to
your problem?

4) Ok, what were you running on the last crash?

5) by pata I mean parallel ATA.  A.K.A IDE mode.  many SATA controllers have a
bios option by which they will identify themsleves to the bios, and the O/S as a
n ide controller.  Changing this mode is not recommended normally as it means
you will need to change your hardware config in your OS and rebuild your raid
array, but if there is a driver problem, this lets you use the ide driver to get
to your drives, which may alieviate the problem.

Comment 18 Alexandre Oliva 2006-08-15 08:13:51 UTC
Created attachment 134195 [details]
SysRq-T after the network stopped working

With today's kernel, I've been unable to duplicate the disk failures so far,
but I got a mouse failure and a network failure, both fixed by reloading the
corresponding modules ([eu]hci_hcd and skge, respectively), although I'm seeing
some slab corruption errors after reloading skge.

I don't see anything useful in the state dump, do you?

Comment 19 Alexandre Oliva 2006-08-15 08:16:26 UTC
Created attachment 134196 [details]
Oopses I got after reloading skge, after a network failure

There are the oopses I got over the several minutes after I reloaded skge. 
Kernel is 2.6.17-1.2564.fc6.x86_64.

Comment 20 Alexandre Oliva 2006-08-15 08:29:47 UTC
2) why/how would smart deliver errors if the entire disk subsystem stopped
working?  (actually, that might not be entirely true; I didn't try plain IDE HDs
during a failure scenario)

3) I have the beta BIOS handy, but I can't see any change list for it.  Maybe as
soon as I get a new box I purchased I'll give it a try.

4) err, sorry, I don't remember what that was any more, sorry that I didn't
mention it :-(

5) I don't see any BIOS options to switch the SATA controllers to plain PATA
mode :-(

Comment 21 Neil Horman 2006-08-15 12:48:54 UTC
2) The short answer is that smart won't deliver errors if the entire disk
subsystem just stops flat out.  But hopefully, if the disk was starting to die,
it hopefully reports that to smartd before such a catastrophic failure.

3) Where did you get the BIOS from?  I can hunt for a change list if you like.

Don't worry about 4 and 5.  It just would have been helpful in an analysis if
you remembered, and not all sata controllers let you do what I suggested.  It
just would have been a good test if you were able.

As for the oops,  I'm guessing that the slab corruption is just a result of an
isolated skge bug.  I expect that its not as good at cleaning up after itself as
it thinks on module unload/reload, and the result is some leaked/reused buffers.  

Its good news though that you can't reproduce your previous failure.  When you
say that your mouse and your network driver failed, can you elaborate?  Clearly
your system didn't hang when these failure occured, as they did before.  What
did you observe that made you reload those modules?

Comment 22 Alexandre Oliva 2006-08-15 17:23:35 UTC
2) the disks are perfectly fine, it's the entire disk subsystem (or perhaps the
sata subsystem, or the Promise controller only, although I've seen such failures
affect disks on the VIA SATA controller as well on similar boxes that have more
disks) that becomes inoperative.

3)
http://support.asus.com/download/download_item.aspx?model=A8V%20Deluxe&type=Latest&SLanguage=en-us#

I'm running 1017; 1018.001 is the latest beta.

It's not clear that I can't reproduce the previous failure; it sometimes took
2-3 days to get one such failure.  As for the network and mouse problems,
they're described in detail in this and in other bug reports such as bug 182618,
bug 181347, bug 181920, bug 181310.  All of them are triggered by having
irqbalance enabled, as stated in the beginning of this bug report.  Do you need
any other info as to the symptoms?

Comment 23 Neil Horman 2006-08-15 17:57:17 UTC
You may be in luck.  I was trolling about for other who may have had your same
set of problems, adn I ran accross this:
http://lkml.org/lkml/2006/5/16/89

Apparently, someone else at least has had your sata problems with your
motherboard/chipset.  It appears to be fixed in the patch referenced in this bug:
http://bugzilla.kernel.org/show_bug.cgi?id=5533
I'd rebuild your kernel with that patch to see how you fare (or alternatively,
check to be sure that it made it into 2.6.18 and just get that kernel from
kernel.org).


Comment 24 Alexandre Oliva 2006-08-15 18:31:55 UTC
The URLs mentioned in comment 23 appear to refer to a significantly different
problem.  At least the symptoms don't match at all what I'm observing.  It's not
a boot-time problem, the problem only shows up at random after hours (although
sometimes just minutes) or regular use.

It's true that there's a chance that the patch you mention will fix the SATA
problems I've got, but the other problems still remain.  Maybe they are
independent, after all?

Comment 25 Neil Horman 2006-08-15 18:45:45 UTC
It would appear so.  Besides, we don't really have anything else to go on here.
 Please confirm that the referenced patch is in the latest 2.6.18 kernel, and
try it out.