Description of problem: My workstation just got this kernel error and immediately became very very sluggish (it freezes for a second or so then unfreezes). It's done this twice now both times after a large volume of network traffic, running firefox from a remote box using the local display. Version-Release number of selected component (if applicable): 2.6.19-1.2895.fc6 How reproducible: unclear
Created attachment 146914 [details] /proc/interrupts
The system was still sluggish after reboot, problem seemed to be I/O load since a raid0 array had started synching.
I'm having similar problem with kernel-2.6.19-1.2895.fc6 on dual quad-core Xeon. The system freezes completely during boot up. The last message from kernel: kernel: do_IRQ: 0.161 No irq handler for vector This happens consistently.
How about a dmesg log?
I've had some do_IRQ spew a few times now since bumping to 2.6.19-1.2895.fc6 as well... Sometimes completely obliterates the network, requiring a reboot, sometimes doesn't. This is perpetually showing up in dmesg, doesn't happen only when the do_IRQ spew shows up, but here's what was there just after the last time: ----8<---- NETDEV WATCHDOG: eth2: transmit timed out tg3: eth2: transmit timed out, resetting tg3: eth2: Link is down. bonding: bond0: link status definitely down for interface eth2, disabling it device eth1 entered promiscuous mode audit(1170965944.004:43): dev=eth1 prom=256 old_prom=0 auid=4294967295 tg3: eth2: Link is up at 1000 Mbps, full duplex. tg3: eth2: Flow control is on for TX and on for RX. bonding: bond0: link status definitely up for interface eth2. device eth1 left promiscuous mode audit(1170965954.006:44): dev=eth1 prom=0 old_prom=256 auid=4294967295 NETDEV WATCHDOG: eth2: transmit timed out tg3: eth2: transmit timed out, resetting tg3: eth2: Link is down. bonding: bond0: link status definitely down for interface eth2, disabling it device eth1 entered promiscuous mode audit(1170966073.090:45): dev=eth1 prom=256 old_prom=0 auid=4294967295 tg3: eth2: Link is up at 1000 Mbps, full duplex. tg3: eth2: Flow control is on for TX and on for RX. bonding: bond0: link status definitely up for interface eth2. device eth1 left promiscuous mode audit(1170966083.091:46): dev=eth1 prom=0 old_prom=256 auid=4294967295 ----8<---- Interesting to note that its only eth2 popping up there. Just double-checked, and that's the onboard PCIe tg3 in this system, while eth0 and eth1 are PCI-X tg3 cards... Are the other folks seeing this problem possibly using PCIe tg3 NICs as well?
There is a patch for this in -mm: http://marc2.theaimsgroup.com/?l=linux-mm-commits&m=117046481708594&q=raw
Created attachment 147746 [details] dmesg output Sorry, forgot to attach this before.
Like Jarod: PCIe tg3 here too, and it's always network traffic (over that interface) which triggers the problem, and sometimes the network interface completely dies because of it. 04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5751 Gigabit Ethernet PCI Express (rev 01)
I'll have a test kernel based on 2.6.19-1.2895.fc6 together shortly...
Been running 3 days without spew reoccuring w/a patched up kernel, but a good chunk of that was the weekend, so I'm withholding judgement until I have a chance to hammer the box with some network I/O later today, but so far, so good...
Still looking good after a good amount of hammering on the system. Joe, want a copy of the kernel rpm?
Yes please.
*** Bug 224643 has been marked as a duplicate of this bug. ***
Excellent fix...please indicate the Fedora release fix kernel when it occurs ;-)
had similar issues on 2.6.19-1.2911.fc6 ("No irq handler for vector" kernel messages, after that a sluggish system and/or lockup of the whole machine). Disabling the irqbalance service made it go away.
Same problem here, with all our (new) Dell PowerEdge 1950's. (SATA/SAS, (dual) Xeon Dual-Core). Perfect for testing this, since pretty useless to put in production now. Could anyone who has a rebuilt kernel that fixes this sent that to me ? :)) Saves me from building... (x86_64) Any chance we see a new update soon that contains the fix ? :)
A test kernel carrying this patch is available here: http://people.redhat.com/jwilson/test_kernels/bz225399/ Completely unofficial, may kill kittens, etc., but works for me. :) Haven't looked to see what the upstream status is on this patch, last I saw it was in -mm, so I don't know when it'll make it into an official update. Chuck, any ideas on that front?
A 2.6.20 kernel with the patch is building for FC5 and FC6 now.
This should be fixed in the latest kernel, 2.6.19-1.2911.6.3, available at: http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/6/ Please test.
I've downloaded and installed kernel-2.6.19-1.2911.6.3.fc6.x86_64.rpm and I still have the same problem on my Dell 2950. Boots and then hangs within 60 seconds. Disabling irqbalance allows it to boot normally as detailed above. Thanks for the test kernel, but it looks like it's not fixed :(
I just tried 2.6.19-1.2911.6.4.fc6 and this problem still exists. I can confirm also that disabling irqbalance indeed does prevent the problem from occurring. But I believe on quad core processor, it is beneficial to have this feature enabled, therefore I will be sticking to 2.6.18-1.2869.fc6 until this problem has been fixed. I also believe the severity level should be raised to high...I think this is an important bug.
(In reply to comment #23) > I just tried 2.6.19-1.2911.6.4.fc6 and this problem still exists. I can confirm > also that disabling irqbalance indeed does prevent the problem from occurring. > But I believe on quad core processor, it is beneficial to have this feature > enabled, therefore I will be sticking to 2.6.18-1.2869.fc6 until this problem > has been fixed. Same here with 2.6.19-1.2911.6.4.fc6. Dropping back would be a great option if you're not also being hit by http://bugzilla.kernel.org/show_bug.cgi?id=7727 in 2.6.18-1.2798.fc6. Is there any possibility of getting this escalated please? At the moment I either have a system that won't boot or one that crashes randomly. Alternatively, can anyone comment on the relative merits/woes of running without irqbalance? Thanks
People tested the patch and it worked for them. Can anyone still confirm that what was released works? Maybe we are looking at some kind of chipset-specific problem or a bug in irqbalance itself?
Two people in this thread tested the patch and it clearly does not work. I can confirm that 2.6.18-1.2869.x64.fc6 does work with irqbalance running and no problems there. All releases after that evince the "do_IRQ: 0.161 No irq handler for vector" when irqbalance is run. This results in a frozen system requiring a power-off. I have disabled irqbalance, booted fine, and upon running irqbalance from a terminal, the console offers the aforementioned vector error and the system seizes. irqbalance does appear to have an impact on this, but as I stated, it works fine on older kernels. As for the chipset, I am running a brand spanking new Dell 2960 with a quad core X5355 2.66GHz Intel Xeon. I can get the motherboard specs from Dell if you need them. I also would be interested in knowing what would be the impact of not running irqbalance. Any input would be appreciated.
The folks still having problems... Do the problems exist with both 2911.6.4.fc6 as well as the bz-tagged test kernel I threw out there? Just wondering if there's some minor difference there. Since this is my primary workstation impacted by this issue, I've still not switched to 2911.6.4.fc6 from my test kernel... Will try to do so tomorrow morning to see if things are any different.
Exactly the same kernel, configuration, and symptoms as Jeff (Comment #26) running a brand new Dell 2950 with two 3GHz dual core Xeon 5100 series (Woodcrest) processors. This machine will run x86_64, so it will remain in test until stable. :)
I have seen it today with 2.6.19-1.2911.6.4.fc6 on a 4-way dual Opteron (Dell 6950). We switched back to 2.6.18-1.2869.fc6 which seems to help.
Now running 2.6.19-1.2911.6.5.fc6 on my box. Hurrying up and waiting to see if it reproduces or not... :)
Just installed 2.6.19-1.2911.6.5.fc6. Still get the "do_IRQ: 0.161 No irq handler for vector" when irqbalance is run. No change from previous 2911.6.4 kernel. :(
Just for record, the 2911.6.5.fc6 kernel still have this issue for me on a Dell PowerEdge 1950 (with one dual-core Xeon). Stopping (not starting at all) irqbalance vanishes the problem.
For whatever its worth, I'm sitting on 3 days of uptime w/kernel 1.2911.6.5.fc6, so it looks like it fixes things for at least some people.
Just adding a "me too" here. I'm trying to get recent kernels running on a Dell PowerEdge 1950 with two dual core 3GHz Xeons x86_64. 2.6.18-1.2869 seems to be running well (I think). All 2.6.19 kernels up to and including 2.6.19-1.2911.6.5 lock hard on boot with "do_IRQ: 0.161 No irq handler for vector".
I just put a test kernel at: http://people.redhat.com/cebbert/kernels/ There is a different bug fix in this kernel and some feedback on whether it works would really help.
Hi Chuck, Same problem with your test kernel on my Dell 2950. I've had to switch to using 2.6.19 on my live service with irqbalance disabled because of the reliability problems we're seeing with 2.6.18 (see my earlier comments). So far I'm not seeing huge performance issues, but I've only made the change this morning.
Okay, we've got a PE1950 in-house that reproduces the problem. Trying various things on it now...
Rawhide kernel 2.6.20-1.2981.fc7 boots up just fine.
There is a long list of patches that need backporting to fix this, and it appears that only a small number of systems are affected (the fixes we have work for most.) Given that we have a workaround, disabling irqbalance, this fix will have to wait. In the meantime, affected people who want some kind of irq balancing will have to do it manually using /proc/irq/*/smp_affinity. Googling for proc irq smp affinity finds plenty of help. I recommend putting the timer interrupt (0) on a CPU by itself if possible.
Thanks for the work guys. Is this an Intel 5000 series chipset issue or something more Dell specific? This is to satisfy my curiosity only, don't answer if it's going to slow down the fix.
(In reply to comment #36) > I just put a test kernel at: > > http://people.redhat.com/cebbert/kernels/ > > There is a different bug fix in this kernel and some feedback on whether it > works would really help. This works just fine for me. Well, at least I can boot (unlike all the other 2.6.19 kernels). With only 13 minutes of uptime, I can't say if there are other beasties lurking beneath the surface, but the "No irq handler for vector" beastie appears slain with 2.6.20-1.2924.fc6. Now, the question is, do I stick with the default of leaving irqbalance running or turn it off. I'm a big fan of sticking to the defaults when it comes to stuff I know nothing about, but I get the impression that irqbalancing isn't all that anyway. Thanks.
(In reply to comment #41) > Thanks for the work guys. Is this an Intel 5000 series chipset issue or > something more Dell specific? This is to satisfy my curiosity only, don't > answer if it's going to slow down the fix. Not sure specifically what the issue is, but all the systems that appear to have issues are Dell PowerEdge systems with Xeon 5000-series processors. My workstation is a Dell Precision 490, which also has Xeon 5000-series procs, and works fine with 2.6.19-1.2911.6.5.fc6.
(In reply to comment #40) > There is a long list of patches that need backporting to fix this, and it > appears that only a small number of systems are affected (the fixes we have work > for most.) Given that we have a workaround, disabling irqbalance, this fix will > have to wait. Wait for how long? FC7? Next kernel? I am curious to know because these Dell Enterprise class servers seem to be a big RedHat user and I would think that due to this being Enterprise class, it would be a concern as well as a priority. More importantly, I am interested in if this will be released and at which kernel since it has been a real PITA to update the kernel remotely, cross fingers, reboot, find a seizure, which means a long haul to the data center to reset the kernel to boot at 2.6.18. Ok...this is my personal issue, but a PITA none the less. If you could let us know when this fix may be seen, it would allow some of us to decide whether to go to another distro at this juncture. As a positive note...thanks for your attention to this detail thus far.
(In reply to comment #44) > (In reply to comment #40) > > There is a long list of patches that need backporting to fix this, and it > > appears that only a small number of systems are affected (the fixes we have work > > for most.) Given that we have a workaround, disabling irqbalance, this fix will > > have to wait. > > Wait for how long? FC7? Next kernel? I'll let Chuck answer that, he knows better than I do. Note that the current in-development F7 kernel does work just fine on a PE1950 in-house where I was able to reproduce the IRQ lockups with 2.6.19-1.2911.6.5.fc6, so worst-case is that it'll work in F7, or you can use 2.6.19-1.2911.6.5.fc6 with irqbalance shut off. Or use a 2.6.18 kernel. Or use the 2.6.20 FC6 updates-testing kernel. > I am curious to know because these Dell > Enterprise class servers seem to be a big RedHat user and I would think that due > to this being Enterprise class, it would be a concern as well as a priority. Are we talking Fedora or Red Hat *Enterprise* Linux here?... ;) Since only a few systems are impacted and there's an easy work-around, this isn't as high priority as other things. The necessary fixes are in upstream kernels now, but the back-porting effort to 2.6.19 is non-trivial, so its a far better use of time and resources to simply inherit this fix when we have a 2.6.20 or 2.6.21 kernel released for FC6, which ought to happen at some juncture. > More importantly, I am interested in if this will be released and at which > kernel since it has been a real PITA to update the kernel remotely, cross > fingers, reboot, find a seizure, which means a long haul to the data center to > reset the kernel to boot at 2.6.18. Ok...this is my personal issue, but a PITA > none the less. Sounds like you could use a DRAC (or some sort of serial console and power countrol)... Barring that, set a boot param of panic=60 or some such thing, and use grub's boot once feature. See step 14 here for details: http://togami.com/~warren/guides/remoteraidcrazies/
(In reply to comment #45) > Since only a few systems are impacted and there's an easy work-around, this > isn't as high priority as other things. The necessary fixes are in upstream > kernels now, but the back-porting effort to 2.6.19 is non-trivial, so its a far > better use of time and resources to simply inherit this fix when we have a > 2.6.20 or 2.6.21 kernel released for FC6, which ought to happen at some juncture. > Unfortunately the fixes aren't in 2.6.20 -- they're only in 2.6.21-rc2 and rc3.
*** Bug 231871 has been marked as a duplicate of this bug. ***
(In reply to comment #36) > I just put a test kernel at: > > http://people.redhat.com/cebbert/kernels/ > > There is a different bug fix in this kernel and some feedback on whether it > works would really help. > > I have a HP DL3600-G5 with Xeon 5110 CPU Installing kernel-2.6.20-1.2925.fc6.x86_64.rpm from the above link changes the problem. With 2.6.19 machine stops dead within a few seconds of irqbalance starting. With the new 2.6.20 kernel, you still get the "No irq handler for vector" message on the console after a couple but the console keeps running. Snag is the network is dead. service network restart Brings the network back to life. Am I right in thinking that this is the same problem as was being discussed here http://lkml.org/lkml/2007/2/2/275 Any idea when a full fix will be available as a Fedora rpm?
If people are having problems and want to be certain that this issue is fixed please test 2.6.21-rc4. Giving positive or negative test reports on that configuration would be very much appreciated. I am reasonably certain I have fixed all problems that I understand so if you do have problems there I need to know about it so I can work with you to figure out what is going on. Getting "no irq for vector" occasionally is expected from the partial fix that was easy to backport. I would generally expect that to be completely with out side effects when it does occur. In a few instances it could drop an irq the driver was expecting and confuse it. Reinitializing the driver should be enough to bring it back in that case. I made a small attempt to reproduce this on a dell power edge 2950 with no luck so I clearly don't have the right configuration. So I suspect there is some selection bias of among the reporters of this problem. Eric
Eric, were you running irqbalance? The 2950s seem to crash when that starts.
A test kernel that may resolve this issue is available at: http://people.redhat.com/cebbert
Apparently nobody is testing the bug fixes. Should I close this bug with CANTFIX since there's no way to fix it without testers?
> Apparently nobody is testing the bug fixes. > > Should I close this bug with CANTFIX since there's no way to fix it > without testers? I have a test system I can check this on but not until Wednesday, sorry. Damn this inconveniently timed holiday of mine.... ;)
Please test kernel 2943, it is at http://people.redhat.com/cebbert and is also going into fedora-updates-testing.
Created attachment 151790 [details] Excerpt from the log
Created attachment 151791 [details] irq table
Well, it did not completly hang the system, as was the case before, but it still have problems. I installed the 2943 kernel, and started irqbalance. About 15 minutes later, the network stopped working. [lars@tux ~]$ uname -a Linux tux.home.rpz 2.6.20-1.2943.fc6 #1 SMP Wed Apr 4 15:24:50 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
(In reply to comment #55) > Created an attachment (id=151790) [edit] > Excerpt from the log > Was this with kernel 2943? Can you post the complete log from this bootup? If it was 2943, can you try the kernel option "pci=msi,mmconf"? Also please try just "pci=msi" and "pci=mmconf" separately.
Lars E. Pettersson (lars) have you ever seen the message "no irq for vector"? From the log you posted I just see "irq 23 and nobody cared" Irq 23 is an ioapic irq for your nic, and is not an MSI irq so we don't need to worry about msi issues. I don't see how "no irq for vector" could turn into a screaming irq so this looks like a different issue. Without the complete log that Chuck asked for I can't be certain of course.
(In reply to comment #59) > > I don't see how "no irq for vector" could turn into a screaming irq so > this looks like a different issue. > > Without the complete log that Chuck asked for I can't be certain of course. Eric, I changed the defaults in this kernel so MSI and MMCONFIG are disabled by default. Could some hardware not work right in that case? Also, the Intel patch for flushing MSI registers was applied: http://cvs.fedora.redhat.com/viewcvs/*checkout*/rpms/kernel/FC-6/linux-2.6-20.5y_msix_flush_writes.patch Looks like it is the first patch, not the updated one but it should only affect MSI anyway. And the upstream patch converting apic destinations to 8-bit went in: [PATCH] x86-64: update IO-APIC dest field to 8-bit for xAPIC http://cvs.fedora.redhat.com/viewcvs/*checkout*/rpms/kernel/FC-6/linux-2.6-20_x86_64_xapic_8_bit_dest.patch Otherwise this is straight 2.6.20.5 code...
I'd test the kernel and report results from the pe1950 I've got in the lab, but at the moment, I can't get it to boot much of any kernel but an oldish rawhide one... :\
Sorry for the delay in answering. To Chuck Ebbert. OK, I'll try with "pci=msi,mmconf", and also with "pci=msi" and "pci=mmconf" separately. I'll attach the complete log from April 5th. To Eric W. Biederman. Yes, I have seen the "no irq for vector" with earlier kernels and irqbalance running, but not now with the 2943 kernel, this time I have only seen the "nobody cared" message. I should perhaps also mention that without irqbalance running, the 2943 kernel has worked without any problems for me.
Created attachment 151993 [details] Complete log from April 5 test of the 2943 kernel
In the last couple of days, I've gotten a couple of these error messages. Here is the latest: kernel: do_IRQ: 1.77 No irq handler for vector My machine is an HP Pavilion a1250n (CPU: AMD Athlon 64 X2 3800+, Chipset ATI Radeo XPress 200) http://h10025.www1.hp.com/ewfrf/wc/genericDocument?cc=us&docname=c00485646&lc=en The kernel: kernel-2.6.20-1.2933.fc6 for x86_64 I had never seen these before installing this kernel. I was running 2.6.19-1.2895.fc6 from Jan 30 to Apr 3 and did not see it. In both times that I've seen it, I was using mplayer to play a TV program recorded by MythTV. The program was being served via HTTP by another machine. The machine did not crash and the machine did not seem sluggish. However, after the message, mplayer could no longer play recorded programs. It would constantly stutter and repeat one tiny section (perhaps a half of a second). When I try XMMS, it too just repeats a similar small bit. I will look into http://people.redhat.com/cebbert/kernels/ My dmesg contains a lot of lines like this, to the exclusion of anything interesting: APIC error on CPU0: 40(40) This isn't new with my current kernel.
Just to let you know, I have now tested the new 2944-kernel, still the same problem with irqbalance running. Will do some more tests with "pci=msi,mmconf", and also with "pci=msi" and "pci=mmconf" separately. I have not yet seen any irq-problems with 2943, irqbalance running, and "pci=msi,mmconf", but I have not had time to do any lengthy tests, so I am not sure that that is the cure.
Created attachment 152611 [details] log for 2944-kernel and "No irq handler for vector"-errors
Kernel 2944 with irqbalance running and kernel option "pci=msi,mmconf" crashed (not completly, but went into an unuseable state) with "No irq handler for vector" after seven hours.
Since I posted #64, I've been running Linux version 2.6.20-1.2944.fc6 (x86_64) from http://people.redhat.com/cebbert/kernels/ I again got: redex kernel: do_IRQ: 0.65 No irq handler for vector This is the first time since my last report. This was again during mplayer playing back a recorded TV program. The sound started looping again, and anything I play now loops (until reboot, I assume).
I just got the following error messages on 2.6.20-1.2962.fc6 (x86_64, dual-core Athlon 64, Asustek A8N32-SLI Deluxe, SATA hard disk, 4GB mem, ATI Radeon RV100): jaguaari kernel: do_IRQ: 1.211 No irq handler for vector jaguaari kernel: journal commit I/O error As a result of this error message, the system was unable to write anything to the filesystem and had to be forcibly rebooted. irqbalance (-0.55-2.fc6) was running when this occurred, I'll try if disabling it helps.
http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.20.2 : "x86-64: survive having no irq mapping for a vector Occasionally the kernel has bugs that result in no irq being found for a given cpu vector. If we acknowledge the irq the system has a good chance of continuing even though we dropped an irq message. If we continue to simply print a message and not acknowledge the irq the system is likely to become non-responsive shortly there after." Sounds like an interesting fix..
Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers
This bug is open for a Fedora version that is no longer maintained and will not be fixed by Fedora. Therefore we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen thus bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.
Re-opening since it's affecting our 64bit 16-core FC12 system. This is the error message I saw in our log today: do_IRQ: 6.233 No irq handler for vector (irq -1) And the same bug has been cropping up in the latest versions of Ubuntu as well: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/480997 So I'm wondering whatever happened to porting that patch over. Did anyone get around to it? And no, I don't have a reliable way of reproducing the bug. It seems to have happened while I was untarring about three million tiny files onto a hardware RAID10 array while at the same time a software RAID10 array over SSD drives was doing re-checking. Basically very high I/O load.
I forgot to mention. This is the kernel version: [root@sahtel ~]# uname -a Linux sahtel 2.6.31.12-174.2.22.fc12.x86_64 #1 SMP Fri Feb 19 18:55:03 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
Heads up Elver, I don't think this is re-opened, it is still reporting as closed to me. I know this is 2 years late, on a comment that was already two years later... but if its still affecting you please open another bug. Previous work around/fixes were: 0) Update the firmware/bios on the motherboard. 1) disable irqbalance daemon, as there seemed to be some kind of race condition. 2) Boot the kernel with pci=nomsi,noaer due to some odd conditions in some motherboards. This may assist anyone else following along that has time to try these suggestions. In either case, if you are suffering this on a current release of Fedora, please open a new bugzilla.