Description of problem: I've been having this problem ever since the move to 2.6.17. 2.6.17 cured my other kernel errors, oops and bugs, but seems to have introduced this lost interrupt problem. The lost interrupts occur at random times. I do not leave this system on 24/7, but I do use it 3 - 7 hours every day. Depending on which kernel version, I will get one or more lost interrupts, sometimes locking up the system for up to 30 minutes before I can finally log into a virtual terminal and do a shutdown. I did not see any related bugs in bugzilla. I also did not see anything in any of the kernel.org reports, so I don't know if this is a known problem with this hardware or not. Hence the reason for filing this bug report. Version-Release number of selected component (if applicable): 2.6.17-1.2139 and 2145 seem to have it happen the least. They also typically only have one or two lost interupts and do not hang. 2.6.17-1.2157 and 2174 are the worst. Every couple of days the system will hang. It is so bad that the smart report for hda showed its first crc error ever. Additional info: Here is my system info for the problem system: Soyo Dragon Plus AthlonXP 1700+ 512MB DDR2100 using on-board lan and sound Geforce3 ti200 WD 120GB primary [hda] Maxtor 120GB on Promise Ultra [hde] [Yes memory has been checked multiple times with no failures] Here are some of the relevant sections of the error messages from the message log: 2174 kernel Aug 24 09:43:36 HAL5000 kernel: hda: dma_intr: status=0x00 { } Aug 24 09:43:36 HAL5000 kernel: ide: failed opcode was: unknown Aug 24 09:43:36 HAL5000 kernel: hda: dma_intr: status=0x00 { } Aug 24 09:43:36 HAL5000 kernel: ide: failed opcode was: unknown Aug 24 09:43:36 HAL5000 kernel: hda: dma_intr: status=0x00 { } Aug 24 09:43:36 HAL5000 kernel: ide: failed opcode was: unknown Aug 24 09:43:36 HAL5000 kernel: hda: dma_intr: status=0x00 { } Aug 24 09:43:36 HAL5000 kernel: ide: failed opcode was: unknown Aug 24 09:43:36 HAL5000 kernel: hda: DMA disabled Aug 24 09:43:36 HAL5000 kernel: ide0: reset: success Aug 24 09:44:06 HAL5000 kernel: hda: lost interrupt Aug 24 10:03:36 HAL5000 last message repeated 2 times Aug 24 10:04:06 HAL5000 kernel: hda: lost interrupt Aug 24 10:07:36 HAL5000 last message repeated 7 times Aug 24 10:07:52 HAL5000 gpm[2232]: *** info [mice.c(1766)]: Aug 24 10:07:52 HAL5000 gpm[2232]: imps2: Auto-detected intellimouse PS/2 Aug 24 10:08:06 HAL5000 kernel: hda: lost interrupt Aug 24 10:15:37 HAL5000 last message repeated 15 times 2145 kernel {one time during a period of ~10 days} Aug 15 09:59:57 HAL5000 kernel: hda: dma_intr: status=0x00 { } Aug 15 09:59:57 HAL5000 kernel: ide: failed opcode was: unknown Aug 15 09:59:58 HAL5000 avahi-daemon[2306]: Server startup complete. Host name i s HAL5000.local. Local service cookie is 3232676125. Aug 15 10:00:00 HAL5000 kernel: hda: dma_intr: status=0x00 { } Aug 15 10:00:00 HAL5000 kernel: ide: failed opcode was: unknown Aug 15 10:00:00 HAL5000 kernel: hda: dma_intr: status=0x00 { } Aug 15 10:00:00 HAL5000 kernel: ide: failed opcode was: unknown Aug 15 10:00:00 HAL5000 kernel: hda: dma_intr: status=0x20 { DeviceFault } Aug 15 10:00:00 HAL5000 kernel: ide: failed opcode was: unknown Aug 15 10:00:00 HAL5000 kernel: hda: DMA disabled Aug 15 10:00:00 HAL5000 kernel: ide0: reset: success Aug 15 10:00:30 HAL5000 kernel: hda: lost interrupt
Here is the info from hdparm: /dev/hda: Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WCA8C3977857 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40 BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: Unspecified: ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4 ATA/ATAPI-5 * signifies the current active mode
Changing to proper owner, kernel-maint.
Well, I've been running 2.6.17-1.2187_FC5 for 11 days now with absolutely no problems. I've run with and without the LIVNA Nvidia drivers with no problems. I'll give it another couple of weeks and give a final update. If 2.6.18 is released to updates before then, I'll run it a couple of weeks first and then update this bug report.
Well, I finally had a lost interrupt on the 14th day of running 2187. It resolved itself after ~30 seconds, and I was able to continue running without further problems. So, even though this kernel version is much better, the problem remains. It is at least useable, since over the 13 days It ran without a hitch, I used the computer at least a total of 40 hours.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
Well, I've been running the new 2.6.18-1.2200 kernel for 5 days and today I got my first lost interrupt. Here are a couple of the messages: Oct 28 09:44:02 HAL5000 kernel: hda: dma_intr: status=0x7f { DriveReady DeviceFault SeekComplete DataRequest CorrectedError Index Error } Oct 28 09:44:02 HAL5000 kernel: hda: dma_intr: error=0x00 { } Oct 28 09:44:02 HAL5000 kernel: ide: failed opcode was: unknown Oct 28 09:44:02 HAL5000 kernel: hda: DMA disabled Oct 28 09:44:02 HAL5000 kernel: ide0: reset: success Oct 28 09:44:02 HAL5000 kernel: hda: lost interrupt It happened while I was booting up, but other than a short delay in getting the login screen, it recovered fine. System worked fine the rest of the time I was on it for the day, about 3.5 hours.
alan, have you any idea what could be causing these ?
DeviceFault one looks like a disk problem, the 0x7F one similarly looks like the disk went bananas and eventually a reset brought it back to sanity. Question is what is the trigger - could be a dodgy drive but might be something else in theory. What do the smart utils show in the drive error log after such a reset ?
Created attachment 141922 [details] Output from smart. Output from smartctl on hda
Just a few comments. I'm currently running the 2239 kernel. Just before I updated to it, I was getting the lost interrupts about every other day. Also getting error messages from udev at the start of boot. I'm guessing the udev messages (which coincided with the days I got lost interrupts) were caused by the lost interrupts. As soon as I did the update to 2239, the system straightened out. I've run for 7 days now with no kernel messages indicating any problems. Of course, I've run up to 13 days with earlier kernels before I saw anything, so I will have to reserve judgement on whether this is finally fixed for a couple more weeks. As for the smart data, the 15 reallocated sectors have been holding at that level for several months. I have another system with a WD as the boot device and it has many more, although it has never shown a problem. As a matter of fact, most of the sectors were reallocated under Windows on the other system. I have read many comments that WD drives have a tendancy towar this. All I know is the two systems I run WD as boot devices have these, while the system that uses an IBM drive has none. The UDMA CRC Error Count occured with the 2157 and 2174 kernels. I haven't had any since then. I was wondering if the HZ being changed in this new kernel would have any impact on the interrupt?
Error 11 occurred at disk power-on lifetime: 797 hours (33 days + 5 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 13 7b 5f ea Error: UNC 8 sectors at LBA = 0x0a5f7b13 = 174029587 8 uncorrectable sectors caused the long delay/recovery and error log entry. It may just be freaking when you hit those sectors and a recovery/rewrite of them may sort the drive out entirely. Doesn't look like a Linux problem.
You might btw want to try the disk vendors tools if you have suitable "other OS" products around. That may be sufficient to recover the drive fully (or see th Bad block howto on the smartmontools.sf.net site but that is a more scary approach)
If the uncorrectable sectors where the cause of the "lost interrupt", wouldn't I be seeing more reallocated sectors in the past couple of months. Coinciding with the lost interrupt and seek messages? Also, as to my earlier comments today about the kernel looking good so far, it figures that I would jinx myself by saying that. Although I did not have a lost interrupt message I did get a seek message today. Hope it is not a harbinger of future lost interrupts. kernel: hda: dma_intr: status=0x10 { SeekComplete }
Sectors can only be reallocated if the block in question is rewritten with new data.
Thank you for the info. I thought that if there were bad sectors that had not been reallocated, then they would show as pending until reallocated. Since I don't have any Current Pending Sectors, and there haven't been any for several months, I guess I incorrectly assumed that the sectors were stable. Thanks, Lloyd
Just a final update to fully close this bug. I finally found the problem. As I suspected, it was NOT a problem with the hard drive. I turned out that the IDE cable was flaky. Early-on in my trouble shooting I had checked that the cable conectors were fully seated, but never suspected that the cable/connectors could be going bad. It has been three weeks with NO dma/interrupt error messages, and no lockups.