+++ This bug was initially created as a clone of Bug #182617 +++ Description of problem: Evidence is mounting that it is irqbalance that is causing me headaches, leading to numerous different kinds of failures (bug 181347, bug 181920, bug 181310). The box would display any of the symptoms of these bugs within hours of booting up. Ever since I ran `service irqbalance stop´, the box has been rock solid. I didn't find it frozen, as it would always be, when I got up this, erhm, morning :-), which is a good sign, and it's heavy on duty since then, without any casualties so far. This is unlikely to be a bug in irqbalance per se, but rather a kernel bug, so this kernel bug report blocks the irqbalance one. Version-Release number of selected component (if applicable): kernel-2.6.15-1.1975_FC5.x86_64 irqbalance-1.12-1.24 How reproducible: Never failed me after leaving the several boxes with similar configuration on overnight Steps to Reproduce: 1.Boot the system up 2.Leave it up overnight Actual results: You'll find that networking died, or that the SATA subsystem is dead, or that the mouse is jerky, or God knows what else. Expected results: No such undesirable surprises. Additional info: Hardware is Athlon64X2 3800+, Asus A8V Deluxe, A4Tech USB mouse, 2 SATA disks connected to the sata_promise controller built into the MoBo. $ cat /proc/interrupts CPU0 CPU1 0: 87445 9375771 IO-APIC-edge timer 1: 24239 0 IO-APIC-edge i8042 7: 0 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc 9: 0 0 IO-APIC-level acpi 15: 112601 552 IO-APIC-edge ide1 16: 0 0 IO-APIC-level libata 17: 25254 2148074 IO-APIC-level libata 18: 3274119 36277 IO-APIC-level skge 19: 0 0 IO-APIC-level VIA8237 20: 7 1660312 IO-APIC-level ohci1394 21: 48 67448 IO-APIC-level ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5 NMI: 2744 4102 LOC: 9463891 9463534 ERR: 0 MIS: 0 I still haven't determined what happens if I never run irqbalance after boot up; so far all I've tested is irqbalance running for some time, and then stopped, so that every IRQ is assigned to a single CPU, and that appears to make the system stable.
*** Bug 181347 has been marked as a duplicate of this bug. ***
Same problem here, on 32-bit kernel. I built myself a stock kernel after having problems, my kernel is currently: title Fedora Core (2.6.16.11) root (hd0,0) kernel /vmlinuz-2.6.16.11 ro root=LABEL=/ rhgb quiet report_lost_ticks=1 notsc clock=pmtmr console=ttyS0,115200n8 noapic initrd /initrd-2.6.16.11.img I added 'noapic' today and disabled 'irqbalance'. We'll see how things go. If it's ok after a few days, I'll remove the 'noapic'. Usually fails during heavy network activity, or randomly while I'm away. I use an offboard 3c59x NIC, cause my onboard one died.
I have the exact same problem... See http://lkml.org/lkml/2006/5/16/67 for more info! - vin
Same(?) problem, different results. Disabling irqbalance did not work for me. Asus P5N32-SLI SE Deluxe motherboard, with Core 2 Duo processor, running Kernel 2.6.7.1-2187_FC5 notable drivers: sky2 sata_sil24 sata_nv I am being bit regularly by the sata problems described in bug 181310. After disabling irqbalance and running bittorrent for many hours, I got my first occurrence of the network problem described in bug 181347. That was with kernel 2.6.17.-1_2174_FC5 I have also experienced the jerky mouse movement, but that was with FC6T2, and only when my mouse was connected through a hub - a dell 2407wfp. The mouse would start smooth, but after awhile become jerky. Motion would be smooth again if I plugged it directly into a usb port on the computer. I have since removed FC6T2, because I hadn't yet found all this other bug history. Using a non-beta OS was also important to me because all the hardware was (is) brand new. I don't even know if I have a bad motherboard or not.. My symptoms are almost exactly like what is described by Alexandre, so I'm assuming the motherboard is good, and the kernel is bad. But due to the inactivity on this and the other bz', I wish it were the other way round!
Small correction - - 2.6.7.1-2187_FC5 + 2.6.17.1-2187_FC5 And I'm running the 64 bit kernels. Let me know if I can be of any assistance in testing fixes for these issues.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
For me, this is the same problem I was having on this bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166437 The important info. is this problem surfaced after kernel-smp-2.6.13-1.1532_FC4 (>= 2.6.14). I tried disabling irqbalance and am running 2.6.17-1.2187_FC5(x84_64). Locked up after 1 hour of heavy CPU load. MB is ABIT AV8 K8T800 Pro (Via); CPU Athlon64X2 4400+; 4GB mem. Maybe I'm not seeing the same problem. All I get are lockups from 1 hour to as long as 2 days. High load seems to aggrivate the problem. I'll re-test with 2.6.18, when the 64 bit package is released (not seeing it in updates yet).
I found a way to get the machine to lock up on cue, so, able to do more rapid testing . . . My problem turned out to be an old ('95) Intel EE Pro 100 PCI NIC card. I pulled it and it's been running like a champ for almost a day. I stand by my assertion that 2.6.13 was stable even with this old NIC installed. I wasn't able to get any info from the NMI watchdog. IOMMU maybe? 2.6.18 (2200) running fine.
2.6.18 (2200) NOT running fine for me. I have reproduced the error twice. The message seems to have changed since the last kernel though. But it still is a timeout. ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata5.00: revalidation failed (errno=-5) ata5: failed to recover some devices, retrying in 5 secs ata5.00: qc timeout (cmd 0xec) ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata5.00: revalidation failed (errno=-5) ata5: failed to recover some devices, retrying in 5 secs ata5.00: qc timeout (cmd 0xec) ... Both times were after I had closed a tvtime window (hardware is a bt848 based wintv card circa 1996). I think this may point to irq mismanagement, as another person commented in this collection of related bugs - bug seems to crop up after a change to the load on the system. One big difference this time is that the timeouts did not repeat forever. The system seemed to recover after a few timeout errors. However my raid array was degraded in the process. sdd was dropped from the two drive raid-1 array.
if you added a comment above of the form "I disabled irqbalance and my problem still happened" then it's unlikely to be related to this bug, and you should open a separate one. I'm reassigning this to irqbalance in the hope that Neil has some ideas what could be going wrong in Alexandre's case.
Confirming this is still a problem with FC6 (uname -r gives 2.6.18-1.2849.fc6). K8T800Pro, Athlon64 4400+. I get the problem where the network interface (a Marvell 88e8001 controller) stops responding until I unload and reload the module. Disabling the irqbalance service resolves the problem.
Alexandre and I have been down this road before. I am completely unable to reproduce this error here on any of my systems, and thus far, the only simmilarity I can find between any of the system that reports what appears to be the same problem is that they all contain a variant of the Asus A8 motherboard. not really sure what to do with this. My reading has indicated that people with this motherboard have had more success by disabling on board video and using a separate video card.
I am using an ASUS A8V Deluxe board, so that part sort of jives. I'm using a separate video card though (an Nvidia 6800GT), the A8V Deluxe doesn't have on-board video.