Bug 203929 - lost interrupt on hda
Summary: lost interrupt on hda
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 5
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-08-24 15:53 UTC by Lloyd Matthews
Modified: 2007-11-30 22:11 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-24 20:49:29 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Output from smart. (8.16 KB, text/plain)
2006-11-22 17:38 UTC, Lloyd Matthews
no flags Details

Description Lloyd Matthews 2006-08-24 15:53:58 UTC
Description of problem:

I've been having this problem ever since the move to 2.6.17.  2.6.17 cured my
other kernel errors, oops and bugs, but seems to have introduced this lost
interrupt problem.  The lost interrupts occur at random times.  I do not leave
this system on 24/7, but I do use it 3 - 7 hours every day.  Depending on which
kernel version, I will get one or more lost interrupts, sometimes locking up the
system for up to 30 minutes before I can finally log into a virtual terminal and
do a shutdown.

I did not see any related bugs in bugzilla.  I also did not see anything in any
of the kernel.org reports, so I don't know if this is a known problem with this
hardware or not.  Hence the reason for filing this bug report.

Version-Release number of selected component (if applicable):
2.6.17-1.2139 and 2145 seem to have it happen the least.  They also typically
only have one or two lost interupts and do not hang.

2.6.17-1.2157 and 2174 are the worst.  Every couple of days the system will
hang.  It is so bad that the smart report for hda showed its first crc error ever.


Additional info:

Here is my system info for the problem system:

Soyo Dragon Plus
AthlonXP 1700+
512MB DDR2100
using on-board lan and sound
Geforce3 ti200
WD 120GB primary [hda]
Maxtor 120GB on Promise Ultra [hde]

[Yes memory has been checked multiple times with no failures]

Here are some of the relevant sections of the error messages from the message log:

2174 kernel

Aug 24 09:43:36 HAL5000 kernel: hda: dma_intr: status=0x00 { }
Aug 24 09:43:36 HAL5000 kernel: ide: failed opcode was: unknown
Aug 24 09:43:36 HAL5000 kernel: hda: dma_intr: status=0x00 { }
Aug 24 09:43:36 HAL5000 kernel: ide: failed opcode was: unknown
Aug 24 09:43:36 HAL5000 kernel: hda: dma_intr: status=0x00 { }
Aug 24 09:43:36 HAL5000 kernel: ide: failed opcode was: unknown
Aug 24 09:43:36 HAL5000 kernel: hda: dma_intr: status=0x00 { }
Aug 24 09:43:36 HAL5000 kernel: ide: failed opcode was: unknown
Aug 24 09:43:36 HAL5000 kernel: hda: DMA disabled
Aug 24 09:43:36 HAL5000 kernel: ide0: reset: success
Aug 24 09:44:06 HAL5000 kernel: hda: lost interrupt
Aug 24 10:03:36 HAL5000 last message repeated 2 times
Aug 24 10:04:06 HAL5000 kernel: hda: lost interrupt
Aug 24 10:07:36 HAL5000 last message repeated 7 times
Aug 24 10:07:52 HAL5000 gpm[2232]: *** info [mice.c(1766)]:
Aug 24 10:07:52 HAL5000 gpm[2232]: imps2: Auto-detected intellimouse PS/2
Aug 24 10:08:06 HAL5000 kernel: hda: lost interrupt
Aug 24 10:15:37 HAL5000 last message repeated 15 times



2145 kernel {one time during a period of ~10 days}

Aug 15 09:59:57 HAL5000 kernel: hda: dma_intr: status=0x00 { }
Aug 15 09:59:57 HAL5000 kernel: ide: failed opcode was: unknown
Aug 15 09:59:58 HAL5000 avahi-daemon[2306]: Server startup complete. Host name i
s HAL5000.local. Local service cookie is 3232676125.
Aug 15 10:00:00 HAL5000 kernel: hda: dma_intr: status=0x00 { }
Aug 15 10:00:00 HAL5000 kernel: ide: failed opcode was: unknown
Aug 15 10:00:00 HAL5000 kernel: hda: dma_intr: status=0x00 { }
Aug 15 10:00:00 HAL5000 kernel: ide: failed opcode was: unknown
Aug 15 10:00:00 HAL5000 kernel: hda: dma_intr: status=0x20 { DeviceFault }
Aug 15 10:00:00 HAL5000 kernel: ide: failed opcode was: unknown
Aug 15 10:00:00 HAL5000 kernel: hda: DMA disabled
Aug 15 10:00:00 HAL5000 kernel: ide0: reset: success
Aug 15 10:00:30 HAL5000 kernel: hda: lost interrupt

Comment 1 Lloyd Matthews 2006-08-29 14:32:32 UTC
Here is the info from hdparm:

/dev/hda:

 Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WCA8C3977857
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
 BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3
ATA/ATAPI-4 ATA/ATAPI-5

 * signifies the current active mode


Comment 2 David Lawrence 2006-09-05 19:06:10 UTC
Changing to proper owner, kernel-maint.

Comment 3 Lloyd Matthews 2006-09-30 18:45:26 UTC
Well, I've been running 2.6.17-1.2187_FC5 for 11 days now with absolutely no
problems.  I've run with and without the LIVNA Nvidia drivers with no problems.
 I'll give it another couple of weeks and give a final update.  If 2.6.18 is
released to updates before then, I'll run it a couple of weeks first and then
update this bug report.

Comment 4 Lloyd Matthews 2006-10-06 15:50:53 UTC
Well, I finally had a lost interrupt on the 14th day of running 2187.  It
resolved itself after ~30 seconds, and I was able to continue running without
further problems.  So, even though this kernel version is much better, the
problem remains.  It is at least useable, since over the 13 days It ran without
a hitch, I used the computer at least a total of 40 hours.

Comment 5 Dave Jones 2006-10-16 19:55:19 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 6 Lloyd Matthews 2006-10-28 18:17:45 UTC
Well, I've been running the new 2.6.18-1.2200 kernel for 5 days and today I got
my first lost interrupt.  Here are a couple of the messages:

Oct 28 09:44:02 HAL5000 kernel: hda: dma_intr: status=0x7f { DriveReady
DeviceFault SeekComplete DataRequest CorrectedError Index Error }
Oct 28 09:44:02 HAL5000 kernel: hda: dma_intr: error=0x00 { }
Oct 28 09:44:02 HAL5000 kernel: ide: failed opcode was: unknown
Oct 28 09:44:02 HAL5000 kernel: hda: DMA disabled
Oct 28 09:44:02 HAL5000 kernel: ide0: reset: success
Oct 28 09:44:02 HAL5000 kernel: hda: lost interrupt

It happened while I was booting up, but other than a short delay in getting the
login screen, it recovered fine.  System worked fine the rest of the time I was
on it for the day, about 3.5 hours.

Comment 7 Dave Jones 2006-11-21 23:08:11 UTC
alan, have you any idea what could be causing these ?

Comment 8 Alan Cox 2006-11-21 23:22:40 UTC
DeviceFault one looks like a disk problem, the 0x7F one similarly looks like the
disk went bananas and eventually a reset brought it back to sanity. Question is
what is the trigger - could be a dodgy drive but might be something else in theory.

What do the smart utils show in the drive error log after such a reset ?


Comment 9 Lloyd Matthews 2006-11-22 17:38:02 UTC
Created attachment 141922 [details]
Output from smart.

Output from smartctl on hda

Comment 10 Lloyd Matthews 2006-11-22 17:57:02 UTC
Just a few comments.  I'm currently running the 2239 kernel.  Just before I
updated to it, I was getting the lost interrupts about every other day.  Also
getting error messages from udev at the start of boot.  I'm guessing the udev
messages (which coincided with the days I got lost interrupts) were caused by
the lost interrupts.  As soon as I did the update to 2239, the system
straightened out.  I've run for 7 days now with no kernel messages indicating
any problems.  Of course, I've run up to 13 days with earlier kernels before I
saw anything, so I will have to reserve judgement on whether this is finally
fixed for a couple more weeks.

As for the smart data, the 15 reallocated sectors have been holding at that
level for several months.  I have another system with a WD as the boot device
and it has many more, although it has never shown a problem.  As a matter of
fact, most of the sectors were reallocated under Windows on the other system.  
 I have read many comments that WD drives have a tendancy towar this.  All I
know is the two systems I run WD as boot devices have these, while the system
that uses an IBM drive has none.

The UDMA CRC Error Count occured with the 2157 and 2174 kernels.  I haven't had
any since then.

I was wondering if the HZ being changed in this new kernel would have any impact
on the interrupt?



Comment 11 Alan Cox 2006-11-22 18:01:56 UTC
Error 11 occurred at disk power-on lifetime: 797 hours (33 days + 5 hours)
  When the command that caused the error occurred, the device was doing SMART
Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 13 7b 5f ea  Error: UNC 8 sectors at LBA = 0x0a5f7b13 = 174029587

8 uncorrectable sectors caused the long delay/recovery and error log entry. It
may just be freaking when you hit those sectors and a recovery/rewrite of them
may sort the drive out entirely.  Doesn't look like a Linux problem.



Comment 12 Alan Cox 2006-11-22 18:03:55 UTC
You might btw want to try the disk vendors tools if you have suitable "other OS"
products around. That may be sufficient to recover the drive fully (or see th
Bad block howto on the smartmontools.sf.net site but that is a more scary approach)


Comment 13 Lloyd Matthews 2006-11-22 18:18:37 UTC
If the uncorrectable sectors where the cause of the "lost interrupt", wouldn't I
be seeing more reallocated sectors in the past couple of months.  Coinciding
with the lost interrupt and seek messages?

Also, as to my earlier comments today about the kernel looking good so far, it
figures that I would jinx myself by saying that.  Although I did not have a lost
interrupt message I did get a seek message today.  Hope it is not a harbinger of
future lost interrupts.

kernel: hda: dma_intr: status=0x10 { SeekComplete }

Comment 14 Alan Cox 2006-11-22 20:33:06 UTC
Sectors can only be reallocated if the block in question is rewritten with new data.


Comment 15 Lloyd Matthews 2006-11-22 21:31:36 UTC
Thank you for the info.  I thought that if there were bad sectors that had not
been reallocated, then they would show as pending until reallocated.  Since I
don't have any Current Pending Sectors, and there haven't been any for several
months, I guess I incorrectly assumed that the sectors were stable.

Thanks,
Lloyd

Comment 16 Lloyd Matthews 2006-12-30 19:28:59 UTC
Just a final update to fully close this bug.  I finally found the problem.  As I
suspected, it was NOT a problem with the hard drive.  I turned out that the IDE
cable was flaky.  Early-on in my trouble shooting I had checked that the cable
conectors were fully seated, but never suspected that the cable/connectors could
be going bad.  It has been three weeks with NO dma/interrupt error messages, and
no lockups.


Note You need to log in before you can comment on or make changes to this bug.