Description of problem: Randomly (but often -- it's about once per day at this point), I'll stop getting any interrupts (the values in /proc/interrupts stop incrementing) from either my SATA controller or my UHCI controllers (or, occasionally, both at once). If I remove and reload the driver modules or use sysfs to unbind and then bind the devices, they immediately start working again. Version-Release number of selected component (if applicable): kernel-2.6.16-1.2096_FC5, but it's been happening off and on for months now. How reproducible: Random. Steps to Reproduce: Unknown.
Created attachment 128428 [details] lspci and dmidecode output
Created attachment 128430 [details] dmesg output You'll probably notice I'm running a custom kernel. This kernel is identical to the stock 2.6.16-1.2096_FC5 kernel except it also includes the following patch (which didn't make any difference w.r.t. this bug): http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=75cf7456dd87335f574dcd53c4ae616a2ad71a11;hp=f01f4182597a3bb4b6fbf92e041faf7a1016f4b6
Created attachment 128564 [details] updated dmesg, this time with apic=debug I forgot to mention originally that using noapic seems to make this problem go away. I also noticed that disabling all my GL screensavers (but not DRI entirely) seems to make the problem go away as well. No clue what that means.
Still happens with kernel-2.6.16-1.2111_FC5, and not using GL seems to merely make the problem less likely to happen rather than not happen at all.
Still happens with kernel-2.6.16-1.2122_FC5. When I'm experiencing interrupt loss, crash's irq command says the interrupts are actually enabled. However, the following systemtap script fixes the problem instantly: function unmask(irq:long) %{ /* This is unmask_IO_APIC_irq in kernel-2.6.16-1.2122_FC5.x86_64*/ void (*unmaskP)(int) = (void(*)(int)) 0xffffffff80119dc0; unmaskP(THIS->irq); %} probe begin { unmask(16); /* sata_via gets irq 16 */ exit(); } This leads me to believe that interrupts are getting masked at the IOAPIC level without the kernel's knowledge. (Or else the kernel is screwing up and masking them at the IOAPIC level without disabling them at the software level.)
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
Still happens with kernel-2.6.18-1.2200.fc5
Hey, there's new error messages now: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: (BMDMA stat 0x4) ata1.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout) ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1: soft resetting port ata1.00: qc timeout (cmd 0xec) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata1.00: revalidation failed (errno=-5) ata1: failed to recover some devices, retrying in 5 secs ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1: soft resetting port ata1.00: qc timeout (cmd 0xec) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata1.00: revalidation failed (errno=-5) ata1: failed to recover some devices, retrying in 5 secs ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1: soft resetting port ata1.00: qc timeout (cmd 0xec) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata1.00: revalidation failed (errno=-5) ata1.00: disabled ata1: EH complete sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 61863 sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 218871695 Buffer I/O error on device dm-2, logical block 27358906 lost page write due to I/O error on dm-2 sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 218895999 Buffer I/O error on device dm-2, logical block 27361944 lost page write due to I/O error on dm-2 sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 189010383 EXT3-fs error (device dm-2): ext3_find_entry: reading directory #11815289 offset 0 Aborting journal on device dm-2. sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 12807 Buffer I/O error on device dm-2, logical block 1545 lost page write due to I/O error on dm-2 sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 61903 Buffer I/O error on device dm-2, logical block 7682 lost page write due to I/O error on dm-2 journal commit I/O error ext3_abort called. EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 447 Buffer I/O error on device dm-2, logical block 0 lost page write due to I/O error on dm-2 sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 218932055 sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 218932055 sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 218932087 sd 0:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 218932087 Which is irritating, actually -- with the older kernels it would block indefinitely, and I could manually unmask the IRQ with grotesque SystemTap hack and everything would work fine, now I have to catch it before it timesout and things go horribly wrong.
Created attachment 141317 [details] dmesg output I'm getting something very similar on kernel 2.6.18-1.2239.fc5. This is using a Silicon Image, Inc. SiI 3112 (sata_sil) controller. Everything works fine for a while, then after copying ~24G to the drives (at least in this case), the timeout errors come up and I eventually get repeated "I/O error in filesystem" errors when XFS tries to write to the logical volume.
I am also seeing this trouble; I will gather logs and provide them upon the next occurrence. It seems to have cropped up since I went to a dual core processor (literally upgraded the processor in-place; all other hardware is unchanged). In this particular system I have five SATA controllers (3+SIL3112; latest firmware; one *Promise 378; and one*VIA) All controllers have seen this problem once; and I loose the whole controller once it happens (I have not taken steps to attempt to repair without a reboot; I have just kicked the box over). Logs to follow. - Shawn
Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers
This bug is open for a Fedora version that is no longer maintained and will not be fixed by Fedora. Therefore we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen thus bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.