Bug 225399

Summary: "No irq handler for vector" error, sluggish system
Product: [Fedora] Fedora Reporter: Joe Orton <jorton>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 6CC: adrian, bugzilla, cdk, d.rye, ebiederm, elver.loho, erik_horn, hugh, i.norton, jarod, k.georgiou, konradr, lars, mishu, mjw, rhbugs, shubnub, triage, wmealing, wouter, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: bzcl34nup
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-06 19:07:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/proc/interrupts
none
dmesg output
none
Excerpt from the log
none
irq table
none
Complete log from April 5 test of the 2943 kernel
none
log for 2944-kernel and "No irq handler for vector"-errors none

Description Joe Orton 2007-01-30 13:05:41 UTC
Description of problem:
My workstation just got this kernel error and immediately became very very
sluggish (it freezes for a second or so then unfreezes).  It's done this twice
now both times after a large volume of network traffic, running firefox from a
remote box using the local display.

Version-Release number of selected component (if applicable):
2.6.19-1.2895.fc6

How reproducible:
unclear

Comment 2 Joe Orton 2007-01-30 13:08:56 UTC
Created attachment 146914 [details]
/proc/interrupts

Comment 3 Joe Orton 2007-01-30 15:18:06 UTC
The system was still sluggish after reboot, problem seemed to be I/O load since
a raid0 array had started synching.

Comment 4 Oleksiy Kohany 2007-02-08 18:54:09 UTC
I'm having similar problem with kernel-2.6.19-1.2895.fc6 on dual quad-core Xeon.
The system freezes completely during boot up. The last message from kernel:
kernel: do_IRQ: 0.161 No irq handler for vector

This happens consistently.

Comment 5 Konrad Rzeszutek 2007-02-08 18:58:40 UTC
How about a dmesg log?

Comment 6 Jarod Wilson 2007-02-08 20:33:26 UTC
I've had some do_IRQ spew a few times now since bumping to 2.6.19-1.2895.fc6 as
well... Sometimes completely obliterates the network, requiring a reboot,
sometimes doesn't.

This is perpetually showing up in dmesg, doesn't happen only when the do_IRQ
spew shows up, but here's what was there just after the last time:

----8<----
NETDEV WATCHDOG: eth2: transmit timed out
tg3: eth2: transmit timed out, resetting
tg3: eth2: Link is down.
bonding: bond0: link status definitely down for interface eth2, disabling it
device eth1 entered promiscuous mode
audit(1170965944.004:43): dev=eth1 prom=256 old_prom=0 auid=4294967295
tg3: eth2: Link is up at 1000 Mbps, full duplex.
tg3: eth2: Flow control is on for TX and on for RX.
bonding: bond0: link status definitely up for interface eth2.
device eth1 left promiscuous mode
audit(1170965954.006:44): dev=eth1 prom=0 old_prom=256 auid=4294967295
NETDEV WATCHDOG: eth2: transmit timed out
tg3: eth2: transmit timed out, resetting
tg3: eth2: Link is down.
bonding: bond0: link status definitely down for interface eth2, disabling it
device eth1 entered promiscuous mode
audit(1170966073.090:45): dev=eth1 prom=256 old_prom=0 auid=4294967295
tg3: eth2: Link is up at 1000 Mbps, full duplex.
tg3: eth2: Flow control is on for TX and on for RX.
bonding: bond0: link status definitely up for interface eth2.
device eth1 left promiscuous mode
audit(1170966083.091:46): dev=eth1 prom=0 old_prom=256 auid=4294967295
----8<----

Interesting to note that its only eth2 popping up there. Just double-checked,
and that's the onboard PCIe tg3 in this system, while eth0 and eth1 are PCI-X
tg3 cards... Are the other folks seeing this problem possibly using PCIe tg3
NICs as well?

Comment 7 Chuck Ebbert 2007-02-08 23:20:28 UTC
There is a patch for this in -mm:

http://marc2.theaimsgroup.com/?l=linux-mm-commits&m=117046481708594&q=raw



Comment 8 Joe Orton 2007-02-09 08:41:26 UTC
Created attachment 147746 [details]
dmesg output

Sorry, forgot to attach this before.

Comment 9 Joe Orton 2007-02-09 08:57:08 UTC
Like Jarod: PCIe tg3 here too, and it's always network traffic (over that
interface) which triggers the problem, and sometimes the network interface
completely dies because of it.

04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5751 Gigabit
Ethernet PCI Express (rev 01)


Comment 10 Jarod Wilson 2007-02-09 14:10:02 UTC
I'll have a test kernel based on 2.6.19-1.2895.fc6 together shortly...

Comment 11 Jarod Wilson 2007-02-12 14:35:33 UTC
Been running 3 days without spew reoccuring w/a patched up kernel, but a good
chunk of that was the weekend, so I'm withholding judgement until I have a
chance to hammer the box with some network I/O later today, but so far, so good...

Comment 12 Jarod Wilson 2007-02-12 22:19:43 UTC
Still looking good after a good amount of hammering on the system. Joe, want a
copy of the kernel rpm?

Comment 13 Joe Orton 2007-02-13 10:15:52 UTC
Yes please.

Comment 15 Chuck Ebbert 2007-02-15 20:25:04 UTC
*** Bug 224643 has been marked as a duplicate of this bug. ***

Comment 16 Jeff Genender 2007-02-16 03:41:50 UTC
Excellent fix...please indicate the Fedora release fix kernel  when it occurs ;-)

Comment 17 Mark Wielaard 2007-02-17 21:20:17 UTC
had similar issues on 2.6.19-1.2911.fc6 ("No irq handler for vector" kernel
messages, after that a sluggish system and/or lockup of the whole machine).
Disabling the irqbalance service made it go away.

Comment 18 Wouter de Jong 2007-02-20 16:03:39 UTC
Same problem here, with all our (new) Dell PowerEdge 1950's. (SATA/SAS, (dual) 
Xeon Dual-Core). Perfect for testing this, since pretty useless to put in 
production now.

Could anyone who has a rebuilt kernel that fixes this sent that to me ? :))
Saves me from building... (x86_64)

Any chance we see a new update soon that contains the fix ? :)

Comment 19 Jarod Wilson 2007-02-20 19:26:13 UTC
A test kernel carrying this patch is available here:

http://people.redhat.com/jwilson/test_kernels/bz225399/

Completely unofficial, may kill kittens, etc., but works for me. :)

Haven't looked to see what the upstream status is on this patch, last I saw it
was in -mm, so I don't know when it'll make it into an official update. Chuck,
any ideas on that front?

Comment 20 Chuck Ebbert 2007-02-20 19:29:21 UTC
A 2.6.20 kernel with the patch is building for FC5 and FC6 now.


Comment 21 Chuck Ebbert 2007-02-23 14:42:58 UTC
This should be fixed in the latest kernel, 2.6.19-1.2911.6.3, available at:

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/6/

Please test.


Comment 22 Ian Norton 2007-02-27 11:05:45 UTC
I've downloaded and installed kernel-2.6.19-1.2911.6.3.fc6.x86_64.rpm and I
still have the same problem on my Dell 2950.  Boots and then hangs within 60
seconds.

Disabling irqbalance allows it to boot normally as detailed above.

Thanks for the test kernel, but it looks like it's not fixed :(

Comment 23 Jeff Genender 2007-03-05 04:09:20 UTC
I just tried 2.6.19-1.2911.6.4.fc6 and this problem still exists.  I can confirm
also that disabling irqbalance indeed does prevent the problem from occurring. 
But I believe on quad core processor, it is beneficial to have this feature
enabled, therefore I will be sticking to 2.6.18-1.2869.fc6 until this problem
has been fixed.  I also believe the severity level should be raised to high...I
think this is an important bug.

Comment 24 Ian Norton 2007-03-05 15:41:34 UTC
(In reply to comment #23)
> I just tried 2.6.19-1.2911.6.4.fc6 and this problem still exists.  I can confirm
> also that disabling irqbalance indeed does prevent the problem from occurring. 
> But I believe on quad core processor, it is beneficial to have this feature
> enabled, therefore I will be sticking to 2.6.18-1.2869.fc6 until this problem
> has been fixed.

Same here with 2.6.19-1.2911.6.4.fc6.  Dropping back would be a great option if
you're not also being hit by http://bugzilla.kernel.org/show_bug.cgi?id=7727 in
2.6.18-1.2798.fc6.

Is there any possibility of getting this escalated please?

At the moment I either have a system that won't boot or one that crashes
randomly.  Alternatively, can anyone comment on the relative merits/woes of
running without irqbalance?

Thanks

Comment 25 Chuck Ebbert 2007-03-05 16:07:46 UTC
People tested the patch and it worked for them. Can anyone still
confirm that what was released works? Maybe we are looking at some
kind of chipset-specific problem or a bug in irqbalance itself?


Comment 26 Jeff Genender 2007-03-05 18:54:02 UTC
Two people in this thread tested the patch and it clearly does not work.  I can
confirm that 2.6.18-1.2869.x64.fc6 does work with irqbalance running and no
problems there.  All releases after that evince the "do_IRQ: 0.161 No irq
handler for vector" when irqbalance is run.  This results in a frozen system
requiring a power-off.  I have disabled irqbalance, booted fine, and upon
running irqbalance from a terminal, the console offers the aforementioned vector
error and the system seizes.  irqbalance does appear to have an impact on this,
but as I stated, it works fine on older kernels.

As for the chipset, I am running a brand spanking new Dell 2960 with a quad core
X5355 2.66GHz Intel Xeon.  I can get the motherboard specs from Dell if you need
them.

I also would be interested in knowing what would be the impact of not running
irqbalance.  Any input would be appreciated.

Comment 27 Jarod Wilson 2007-03-05 22:45:41 UTC
The folks still having problems... Do the problems exist with both 2911.6.4.fc6
as well as the bz-tagged test kernel I threw out there? Just wondering if
there's some minor difference there.

Since this is my primary workstation impacted by this issue, I've still not
switched to 2911.6.4.fc6 from my test kernel... Will try to do so tomorrow
morning to see if things are any different.

Comment 28 Charles Kizer 2007-03-06 15:56:57 UTC
Exactly the same kernel, configuration, and symptoms as Jeff (Comment #26) 
running a brand new Dell 2950 with two 3GHz dual core Xeon 5100 series 
(Woodcrest) processors.  This machine will run x86_64, so it will remain in 
test until stable. :)


Comment 29 Adrian Reber 2007-03-06 21:01:12 UTC
I have seen it today with 2.6.19-1.2911.6.4.fc6 on a 4-way dual Opteron (Dell
6950). We switched back to 2.6.18-1.2869.fc6 which seems to help.

Comment 30 Jarod Wilson 2007-03-06 21:35:49 UTC
Now running 2.6.19-1.2911.6.5.fc6 on my box. Hurrying up and waiting to see if
it reproduces or not... :)

Comment 31 Charles Kizer 2007-03-06 23:30:51 UTC
Just installed 2.6.19-1.2911.6.5.fc6.  Still get the "do_IRQ: 0.161 No irq
handler for vector" when irqbalance is run.  No change from previous 2911.6.4 
kernel.  :(


Comment 32 Szombathelyi György 2007-03-07 14:00:40 UTC
Just for record, the 2911.6.5.fc6 kernel still have this issue for me on a Dell
PowerEdge 1950 (with one dual-core Xeon). Stopping (not starting at all)
irqbalance vanishes the problem.

Comment 33 Jarod Wilson 2007-03-09 14:04:13 UTC
For whatever its worth, I'm sitting on 3 days of uptime w/kernel 1.2911.6.5.fc6,
so it looks like it fixes things for at least some people.

Comment 35 Pete Toscano 2007-03-09 14:10:51 UTC
Just adding a "me too" here.  I'm trying to get recent kernels running on a Dell
PowerEdge 1950 with two dual core 3GHz Xeons x86_64.  2.6.18-1.2869 seems to be
running well (I think).  All 2.6.19 kernels up to and including
2.6.19-1.2911.6.5 lock hard on boot with "do_IRQ: 0.161 No irq handler for vector".

Comment 36 Chuck Ebbert 2007-03-09 15:01:04 UTC
I just put a test kernel at:

http://people.redhat.com/cebbert/kernels/

There is a different bug fix in this kernel and some feedback on whether it
works would really help.



Comment 37 Ian Norton 2007-03-09 15:39:59 UTC
Hi Chuck,

Same problem with your test kernel on my Dell 2950.

I've had to switch to using 2.6.19 on my live service with irqbalance disabled
because of the reliability problems we're seeing with 2.6.18 (see my earlier
comments).  So far I'm not seeing huge performance issues, but I've only made
the change this morning.

Comment 38 Jarod Wilson 2007-03-09 20:30:06 UTC
Okay, we've got a PE1950 in-house that reproduces the problem. Trying various
things on it now...

Comment 39 Jarod Wilson 2007-03-09 21:06:34 UTC
Rawhide kernel 2.6.20-1.2981.fc7 boots up just fine.

Comment 40 Chuck Ebbert 2007-03-10 23:45:34 UTC
There is a long list of patches that need backporting to fix this, and it
appears that only a small number of systems are affected (the fixes we have work
for most.) Given that we have a workaround, disabling irqbalance, this fix will
have to wait.

In the meantime, affected people who want some kind of irq balancing will have
to do it manually using /proc/irq/*/smp_affinity. Googling for

 proc irq smp affinity

finds plenty of help. I recommend putting the timer interrupt (0) on a CPU by
itself if possible.

Comment 41 Charles Kizer 2007-03-11 17:31:19 UTC
Thanks for the work guys.  Is this an Intel 5000 series chipset issue or 
something more Dell specific?  This is to satisfy my curiosity only, don't 
answer if it's going to slow down the fix.

Comment 42 Pete Toscano 2007-03-11 19:27:55 UTC
(In reply to comment #36)
> I just put a test kernel at:
> 
> http://people.redhat.com/cebbert/kernels/
> 
> There is a different bug fix in this kernel and some feedback on whether it
> works would really help.

This works just fine for me.  Well, at least I can boot (unlike all the other
2.6.19 kernels).  With only 13 minutes of uptime, I can't say if there are other
beasties lurking beneath the surface, but the "No irq handler for vector"
beastie appears slain with 2.6.20-1.2924.fc6.

Now, the question is, do I stick with the default of leaving irqbalance running
or turn it off.  I'm a big fan of sticking to the defaults when it comes to
stuff I know nothing about, but I get the impression that irqbalancing isn't all
that anyway.

Thanks.


Comment 43 Jarod Wilson 2007-03-12 03:39:10 UTC
(In reply to comment #41)
> Thanks for the work guys.  Is this an Intel 5000 series chipset issue or 
> something more Dell specific?  This is to satisfy my curiosity only, don't 
> answer if it's going to slow down the fix.

Not sure specifically what the issue is, but all the systems that appear to have issues are Dell PowerEdge 
systems with Xeon 5000-series processors. My workstation is a Dell Precision 490, which also has Xeon 
5000-series procs, and works fine with 2.6.19-1.2911.6.5.fc6.

Comment 44 Jeff Genender 2007-03-12 05:52:59 UTC
(In reply to comment #40)
> There is a long list of patches that need backporting to fix this, and it
> appears that only a small number of systems are affected (the fixes we have work
> for most.) Given that we have a workaround, disabling irqbalance, this fix will
> have to wait.

Wait for how long?  FC7?  Next kernel?  I am curious to know because these Dell
Enterprise class servers seem to be a big RedHat user and I would think that due
to this being Enterprise class, it would be a concern as well as a priority.

More importantly, I am interested in if this will be released and at which
kernel since it has been a real PITA to update the kernel remotely, cross
fingers, reboot, find a seizure, which means a long haul to the data center to
reset the kernel to boot at 2.6.18.  Ok...this is my personal issue, but a PITA
none the less.

If you could let us know when this fix may be seen, it would allow some of us to
decide whether to go to another distro at this juncture.

As a positive note...thanks for your attention to this detail thus far.

Comment 45 Jarod Wilson 2007-03-12 13:17:49 UTC
(In reply to comment #44)
> (In reply to comment #40)
> > There is a long list of patches that need backporting to fix this, and it
> > appears that only a small number of systems are affected (the fixes we have work
> > for most.) Given that we have a workaround, disabling irqbalance, this fix will
> > have to wait.
> 
> Wait for how long?  FC7?  Next kernel?

I'll let Chuck answer that, he knows better than I do. Note that the current
in-development F7 kernel does work just fine on a PE1950 in-house where I was
able to reproduce the IRQ lockups with 2.6.19-1.2911.6.5.fc6, so worst-case is
that it'll work in F7, or you can use 2.6.19-1.2911.6.5.fc6 with irqbalance shut
off. Or use a 2.6.18 kernel. Or use the 2.6.20 FC6 updates-testing kernel.

> I am curious to know because these Dell
> Enterprise class servers seem to be a big RedHat user and I would think that due
> to this being Enterprise class, it would be a concern as well as a priority.

Are we talking Fedora or Red Hat *Enterprise* Linux here?... ;)

Since only a few systems are impacted and there's an easy work-around, this
isn't as high priority as other things. The necessary fixes are in upstream
kernels now, but the back-porting effort to 2.6.19 is non-trivial, so its a far
better use of time and resources to simply inherit this fix when we have a
2.6.20 or 2.6.21 kernel released for FC6, which ought to happen at some juncture.

> More importantly, I am interested in if this will be released and at which
> kernel since it has been a real PITA to update the kernel remotely, cross
> fingers, reboot, find a seizure, which means a long haul to the data center to
> reset the kernel to boot at 2.6.18.  Ok...this is my personal issue, but a PITA
> none the less.

Sounds like you could use a DRAC (or some sort of serial console and power
countrol)... Barring that, set a boot param of panic=60 or some such thing, and
use grub's boot once feature. See step 14 here for details:
http://togami.com/~warren/guides/remoteraidcrazies/


Comment 46 Chuck Ebbert 2007-03-12 13:35:08 UTC
(In reply to comment #45)
> Since only a few systems are impacted and there's an easy work-around, this
> isn't as high priority as other things. The necessary fixes are in upstream
> kernels now, but the back-porting effort to 2.6.19 is non-trivial, so its a far
> better use of time and resources to simply inherit this fix when we have a
> 2.6.20 or 2.6.21 kernel released for FC6, which ought to happen at some juncture.
> 

Unfortunately the fixes aren't in 2.6.20 -- they're only in 2.6.21-rc2 and rc3.



Comment 47 Chuck Ebbert 2007-03-12 19:23:21 UTC
*** Bug 231871 has been marked as a duplicate of this bug. ***

Comment 48 J. David Rye 2007-03-14 14:01:45 UTC
(In reply to comment #36)
> I just put a test kernel at:
> 
> http://people.redhat.com/cebbert/kernels/
> 
> There is a different bug fix in this kernel and some feedback on whether it
> works would really help.
> 
> 
I have a HP DL3600-G5 with Xeon 5110 CPU
Installing kernel-2.6.20-1.2925.fc6.x86_64.rpm from the above link changes
the problem.
With 2.6.19 machine stops dead within a few seconds of irqbalance starting.
With the new 2.6.20 kernel, you still get the "No irq handler for vector"
message on the console after a couple but the console keeps running.
Snag is the network is dead. 
service network restart
Brings the network back to life.

Am I right in thinking that this is the same problem as was being discussed here

http://lkml.org/lkml/2007/2/2/275

Any idea when a full fix will be available as a Fedora rpm?
   


Comment 49 Eric W. Biederman 2007-03-19 04:51:11 UTC
If people are having problems and want to be certain that this issue is fixed
please test 2.6.21-rc4.  Giving positive or negative test reports on that
configuration would be very much appreciated.

I am reasonably certain I have fixed all problems that I understand so if you do
have problems there I need to know about it so I can work with you to figure out
what is going on.

Getting "no irq for vector" occasionally is expected from the partial fix that
was easy to backport.    I would generally expect that to be completely with out
side effects when it does occur.  In a few instances it could drop an irq the
driver was expecting and confuse it.  Reinitializing the driver should be enough
to bring it back in that case.

I made a small attempt to reproduce this on a dell power edge 2950 with no luck
so I clearly don't have the right configuration.  So I suspect there is some
selection bias of among the reporters of this problem.


Eric


Comment 50 Chuck Ebbert 2007-03-19 13:13:17 UTC
Eric, were you running irqbalance? The 2950s seem to crash when that starts.

Comment 51 Chuck Ebbert 2007-03-30 20:51:53 UTC
A test kernel that may resolve this issue is available at:

  http://people.redhat.com/cebbert


Comment 52 Chuck Ebbert 2007-04-02 14:48:47 UTC
Apparently nobody is testing the bug fixes.

Should I close this bug with CANTFIX since there's no way to fix it
without testers?


Comment 53 Ian Norton 2007-04-02 23:28:59 UTC
> Apparently nobody is testing the bug fixes.
>
> Should I close this bug with CANTFIX since there's no way to fix it
> without testers?

I have a test system I can check this on but not until Wednesday, sorry. 
Damn this inconveniently timed holiday of mine.... ;)

Comment 54 Chuck Ebbert 2007-04-05 16:50:09 UTC
Please test kernel 2943, it is at http://people.redhat.com/cebbert
and is also going into fedora-updates-testing.

Comment 55 Lars E. Pettersson 2007-04-05 17:49:03 UTC
Created attachment 151790 [details]
Excerpt from the log

Comment 56 Lars E. Pettersson 2007-04-05 17:50:56 UTC
Created attachment 151791 [details]
irq table

Comment 57 Lars E. Pettersson 2007-04-05 17:54:03 UTC
Well, it did not completly hang the system, as was the case before, but it still
have problems.

I installed the 2943 kernel, and started irqbalance. About 15 minutes later, the
network stopped working.

[lars@tux ~]$ uname -a
Linux tux.home.rpz 2.6.20-1.2943.fc6 #1 SMP Wed Apr 4 15:24:50 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux

Comment 58 Chuck Ebbert 2007-04-05 17:55:19 UTC
(In reply to comment #55)
> Created an attachment (id=151790) [edit]
> Excerpt from the log
> 

Was this with kernel 2943?
Can you post the complete log from this bootup?

If it was 2943, can you try the kernel option "pci=msi,mmconf"?
Also please try just "pci=msi" and "pci=mmconf" separately.

Comment 59 Eric W. Biederman 2007-04-05 18:06:26 UTC
Lars E. Pettersson (lars) have you ever seen the message
"no irq for vector"?

From the log you posted I just see "irq 23 and nobody cared"
Irq 23 is an ioapic irq for your nic, and is not an MSI irq so we don't
need to worry about msi issues.

I don't see how "no irq for vector" could turn into a screaming irq so
this looks like a different issue.

Without the complete log that Chuck asked for I can't be certain of course.



Comment 60 Chuck Ebbert 2007-04-05 19:18:53 UTC
(In reply to comment #59)
> 
> I don't see how "no irq for vector" could turn into a screaming irq so
> this looks like a different issue.
> 
> Without the complete log that Chuck asked for I can't be certain of course.

Eric, I changed the defaults in this kernel so MSI and MMCONFIG are
disabled by default. Could some hardware not work right in that case?

Also, the Intel patch for flushing MSI registers was applied:

http://cvs.fedora.redhat.com/viewcvs/*checkout*/rpms/kernel/FC-6/linux-2.6-20.5y_msix_flush_writes.patch

Looks like it is the first patch, not the updated one but it should only affect
MSI anyway.

And the upstream patch converting apic destinations to 8-bit went in:

[PATCH] x86-64: update IO-APIC dest field to 8-bit for xAPIC
http://cvs.fedora.redhat.com/viewcvs/*checkout*/rpms/kernel/FC-6/linux-2.6-20_x86_64_xapic_8_bit_dest.patch

Otherwise this is straight 2.6.20.5 code...


Comment 61 Jarod Wilson 2007-04-05 22:33:33 UTC
I'd test the kernel and report results from the pe1950 I've got in the lab, but
at the moment, I can't get it to boot much of any kernel but an oldish rawhide
one... :\

Comment 62 Lars E. Pettersson 2007-04-09 15:09:11 UTC
Sorry for the delay in answering.

To Chuck Ebbert. OK, I'll try with "pci=msi,mmconf", and also with "pci=msi" and
"pci=mmconf" separately. I'll attach the complete log from April 5th.

To Eric W. Biederman. Yes, I have seen the "no irq for vector" with earlier
kernels and irqbalance running, but not now with the 2943 kernel, this time I
have only seen the "nobody cared" message.

I should perhaps also mention that without irqbalance running, the 2943 kernel
has worked without any problems for me.


Comment 63 Lars E. Pettersson 2007-04-09 15:11:25 UTC
Created attachment 151993 [details]
Complete log from April 5 test of the 2943 kernel

Comment 64 D. Hugh Redelmeier 2007-04-13 02:13:55 UTC
In the last couple of days, I've gotten a couple of these error messages.  Here
is the latest: kernel: do_IRQ: 1.77 No irq handler for vector

My machine is an HP Pavilion a1250n (CPU: AMD Athlon 64 X2 3800+, Chipset ATI
Radeo XPress 200)
http://h10025.www1.hp.com/ewfrf/wc/genericDocument?cc=us&docname=c00485646&lc=en

The kernel: kernel-2.6.20-1.2933.fc6 for x86_64

I had never seen these before installing this kernel.  I was running
2.6.19-1.2895.fc6 from Jan 30 to Apr 3 and did not see it.

In both times that I've seen it, I was using mplayer to play a TV program
recorded by MythTV.  The program was being served via HTTP by another machine.

The machine did not crash and the machine did not seem sluggish.  However, after
the message, mplayer could no longer play recorded programs.  It would
constantly stutter and repeat one tiny section (perhaps a half of a second). 
When I try XMMS, it too just repeats a similar small bit.

I will look into http://people.redhat.com/cebbert/kernels/

My dmesg contains a lot of lines like this, to the exclusion of anything
interesting:
APIC error on CPU0: 40(40)
This isn't new with my current kernel.


Comment 65 Lars E. Pettersson 2007-04-14 11:55:34 UTC
Just to let you know, I have now tested the new 2944-kernel, still the same
problem with irqbalance running.

Will do some more tests with "pci=msi,mmconf", and also with "pci=msi" and
"pci=mmconf" separately. I have not yet seen any irq-problems with 2943,
irqbalance running, and "pci=msi,mmconf", but I have not had time to do any
lengthy tests, so I am not sure that that is the cure.



Comment 66 Lars E. Pettersson 2007-04-14 11:58:19 UTC
Created attachment 152611 [details]
log for 2944-kernel and "No irq handler for vector"-errors

Comment 67 Lars E. Pettersson 2007-04-14 18:29:50 UTC
Kernel 2944 with irqbalance running and kernel option "pci=msi,mmconf" crashed
(not completly, but went into an unuseable state) with "No irq handler for
vector" after seven hours.

Comment 68 D. Hugh Redelmeier 2007-04-15 21:23:07 UTC
Since I posted #64, I've been running Linux version 2.6.20-1.2944.fc6 (x86_64)
from http://people.redhat.com/cebbert/kernels/

I again got:
redex kernel: do_IRQ: 0.65 No irq handler for vector
This is the first time since my last report.
This was again during mplayer playing back a recorded TV program.
The sound started looping again, and anything I play now loops (until reboot, I
assume).

Comment 69 Anssi Johansson 2007-08-02 16:13:28 UTC
I just got the following error messages on 2.6.20-1.2962.fc6 (x86_64, dual-core
Athlon 64, Asustek A8N32-SLI Deluxe, SATA hard disk, 4GB mem, ATI Radeon RV100):

jaguaari kernel: do_IRQ: 1.211 No irq handler for vector
jaguaari kernel: journal commit I/O error

As a result of this error message, the system was unable to write anything to
the filesystem and had to be forcibly rebooted. irqbalance (-0.55-2.fc6) was
running when this occurred, I'll try if disabling it helps.

Comment 70 Anssi Johansson 2007-08-02 16:52:28 UTC
http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.20.2 :

"x86-64: survive having no irq mapping for a vector
    
 Occasionally the kernel has bugs that result in no irq being found for a
 given cpu vector.  If we acknowledge the irq the system has a good chance
 of continuing even though we dropped an irq message.  If we continue to
 simply print a message and not acknowledge the irq the system is likely to
 become non-responsive shortly there after."

Sounds like an interesting fix..

Comment 71 Bug Zapper 2008-04-04 06:00:37 UTC
Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 72 Bug Zapper 2008-05-06 19:07:44 UTC
This bug is open for a Fedora version that is no longer maintained and
will not be fixed by Fedora. Therefore we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen thus bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 73 Elver Loho 2010-03-05 12:48:28 UTC
Re-opening since it's affecting our 64bit 16-core FC12 system.

This is the error message I saw in our log today:

do_IRQ: 6.233 No irq handler for vector (irq -1)

And the same bug has been cropping up in the latest versions of Ubuntu as well:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/480997

So I'm wondering whatever happened to porting that patch over. Did anyone get around to it?

And no, I don't have a reliable way of reproducing the bug. It seems to have happened while I was untarring about three million tiny files onto a hardware RAID10 array while at the same time a software RAID10 array over SSD drives was doing re-checking. Basically very high I/O load.

Comment 74 Elver Loho 2010-03-07 12:53:55 UTC
I forgot to mention. This is the kernel version:

[root@sahtel ~]# uname -a
Linux sahtel 2.6.31.12-174.2.22.fc12.x86_64 #1 SMP Fri Feb 19 18:55:03 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

Comment 75 Wade Mealing 2012-06-18 03:04:05 UTC
Heads up Elver, I don't think this is re-opened, it is still reporting as closed to me.  I know this is 2 years late, on a comment that was already two years later... but if its still affecting you please open another bug.

Previous work around/fixes were:

0) Update the firmware/bios on the motherboard.
1) disable irqbalance daemon, as there seemed to be some kind of race condition.
2) Boot the kernel with pci=nomsi,noaer  due to some odd conditions in some motherboards.

This may assist anyone else following along that has time to try these suggestions.  

In either case, if you are suffering this on a current release of Fedora, please open a new bugzilla.