Bug 140864

Summary: dma errors on ide hard drive
Product: [Fedora] Fedora Reporter: Paul Tomblin <ptomblin>
Component: kernelAssignee: Alan Cox <alan>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: alan, ccb, davej, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-01-19 07:23:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch from 2.6.13.1 that breaks TP600 IDE DMA none

Description Paul Tomblin 2004-11-25 21:28:56 UTC
Description of problem:
The second hard drive on my system doesn't work with the 2.6 kernel. 
It worked in my system for 2 years under RedHat 7.3 and RedHat 9 with
no incidents, and it works just fine if I boot the system with Knoppix
3.6 with a 2.4 kernel.  If I boot with Knoppix with the 2.6 kernel
option, I have the same problems with it as I have with Fedora Core 3.

Version-Release number of selected component (if applicable):
2.6

How reproducible:
Very.

Steps to Reproduce:
1. Boot with a 2.6 kernel, either Fedora Core 3 or Knoppix with the
boot option "knoppix26 2"
2. Mount a partition off this second drive
3. Do a "find /mnt/hdc4 -type f -print"
  
Actual results:
The whole system becomes unresponsive and freezes up.  Messages like
"hdc: dma_timer_expiry: dma status == 0x21" or
Nov 13 23:15:12 allhats kernel: hdc: dma_timer_expiry: dma status == 0x61
Nov 13 23:15:22 allhats kernel: hdc: DMA timeout error
Nov 13 23:15:22 allhats kernel: hdc: dma timeout error: status=0xd0 {
Busy }
Nov 13 23:15:22 allhats kernel:
Nov 13 23:15:22 allhats kernel: ide: failed opcode was: unknown
Nov 13 23:15:22 allhats kernel: hdc: DMA disabled
Nov 13 23:15:22 allhats kernel: hdd: DMA disabled
Nov 13 23:15:22 allhats kernel: ide1: reset: success
Nov 13 23:15:32 allhats kernel: hdc: lost interrupt
Nov 13 23:16:12 allhats last message repeated 4 times
Nov 13 23:17:22 allhats last message repeated 7 times
Nov 13 23:18:12 allhats last message repeated 5 times
Nov 13 23:18:18 allhats kernel: hdc: status error: status=0x58 {
DriveReady Seek
Complete DataRequest }
Nov 13 23:18:18 allhats kernel:
Nov 13 23:18:18 allhats kernel: ide: failed opcode was: unknown
Nov 13 23:18:18 allhats kernel: hdc: drive not ready for command
Nov 13 23:18:18 allhats kernel: hdd: status timeout: status=0xd0 { Busy }
Nov 13 23:18:18 allhats kernel: hdd: status timeout:
error=0xd0LastFailedSense 0
appear in /var/log/messages or on the console.

Expected results:
A list of files.  This command works fine if I boot with Knoppix with
a 2.4 kernel instead.

Additional info:
I've tried this drive in two different systems, both my main Linux
system booted with Fedora Core 3 and my Windows box booted with
Knoppix.  The two boxes have different processors, different chipsets,
and different peripherals.  I've run Western Digital Diagnostics, and
found no errors on the hard drive.  I've fscked both partitions on the
drive, both the Reiserfs one and the ext3 one, and found no errors. 
I've tried different hard drive cables.  I've tried it jumpered as
Cable Select or Master/Slave.

Comment 1 Paul Tomblin 2004-11-26 02:11:49 UTC
Oh, it just struck me that one odd thing about this drive is that it
only has primary partitions numbered 1 and 4.  It used to have 1,2,3,4
but 4 was in use and the others weren't, so I deleted 1,2, and 3 and
made one big one.  Like I said before, though, it works fine in the
2.4 kernel.  Oh, and one other factor - both the original machine
where it's lived for 2 years, and the one I tried it on, both had AMD
processors.  The first machine has two AthlonMP 1800+ and the second
machine has an AthlonXP 2400+.



Comment 2 Paul Tomblin 2004-11-29 12:02:39 UTC
I've replaced the drive with a different drive (also Western Digital,
but this one is 200Gb instead of 180Gb) and everything is working
fine.  If anybody wants to borrow the drive that's causing the
problems for a few weeks to run some tests, let me know, because
otherwise I'm going to reformat it and put it in my Tivo.

Comment 3 Dave Jones 2005-07-15 20:38:24 UTC
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 4 ptomblin 2005-08-20 21:03:04 UTC
I just tried it with 2.6.12-1373_FC3smp and it still happens.  However, I'm going to have to withdraw 
my offer to ship the drive to somebody to examine, because I need it on another system so I'm going to 
have to repartition and mkfs it.

Comment 5 Alan Cox 2005-08-21 11:41:47 UTC
Thanks anyway. Would be interested to know if the new system shows the same
problem too.


Comment 6 Charlie Bennett 2005-10-20 21:12:22 UTC
Similar problems here.  Since the 2.6.13 FC4 kernels I can't boot the kernel on
my IBM TP600x w/Fujitsu MHT2080AH without suffering dma_timer_expiry: dma_status
== 21 and eventual timeouts (error 58) causing the drive to drop to pio after
half a dozen timeouts.

This problem does not exist for me with the 2.6.12-1.1447 kernels - unless I
warm boot after running a more recent kernel.  Both 1526 and 1532 have it. 
Perhaps something nasty is going on in setting up PIIX4 in recent kernels.

Comment 7 Charlie Bennett 2005-10-20 21:22:39 UTC
More data:  when the kernel finally comes up, APM appears to be messed up.
/proc/apm reads: 1.16ac 1.2 0x03 0xff 0xff 0xff -1% -1 ?
and none of the apm commands work.

If I boot the system with ACPI I have no IDE worries.
If I could get ACPI to make my laptop sleep and wake up and do nice tricks this
wouldn't be a problem...  but the user-space side is still not ready for prime time.

So...  did something break in the code that talks to APM bios during system boot?


Comment 8 Alan Cox 2005-10-20 21:46:01 UTC
Ditto on my TP600. Can't duplicate it with my upstream kernel builds but even
with ACPI I see the fail. Upstream IDE has not changed in any obvious way, could
be the SATA code perhaps but I don't see why it would touch PIIX4 as its too
early for SATA to care about.



Comment 9 Charlie Bennett 2005-10-21 20:23:01 UTC
Futher data point.  I just loaded FC4 and that kernel on a TP T20.
ACPI is off due to cutoff age.
No problem.  IDE DMA works great.

With the exception of the miniPCI card and video controller the T20 and 600x
have all the same Intel & TI chipsets.  

Comment 10 Charlie Bennett 2005-10-22 02:10:43 UTC
pulling the miniPCI delays the onset of symptoms.

I built 2.6.13.4 from kernel.org using the -i686smp Fedora kernel config file.
Turned on 1000hz, preemption and asked it to build the ALSA serial-u16550 :-).

That kernel fails as well.

Comment 11 Charlie Bennett 2005-10-23 03:32:23 UTC
Created attachment 120284 [details]
Patch from 2.6.13.1 that breaks TP600 IDE DMA

Comment 12 Charlie Bennett 2005-10-23 03:34:35 UTC
what I said.  This patch introduced in 2.6.13.1 appears to be the source of the
problem.  Determined empirically.  No idea what the actual problem is.

Comment 13 Alan Cox 2005-10-24 13:12:18 UTC
Nor me but I tested 2.6.14rc and when that also failed passed on your findings
and my confirmation on the TP600 I have to Linus, who asked for two lspci
outputs and then sent a patch to fix it about five minutes later, so it should
be fixed in 2.6.14

Thanks for doing all the hard work to find the problem change




Comment 14 Paul Tomblin 2005-10-24 15:02:40 UTC
BTW: I still have the drive that was causing the original errors.  I was going
to put it in a TiVo, but I tore a tendon in my wrist and so I haven't been in
the mood for ripping apart systems for a few weeks.  If anybody wants me to ship
it to them or to run additional tests on it, please let me know.

Comment 15 Charlie Bennett 2005-10-24 17:42:06 UTC
my pleasure.  this one was 'itchy'
now off to figure out what NFS clients lock hard under load

Comment 16 Charlie Bennett 2005-10-24 17:42:38 UTC
my pleasure.  this one was 'itchy'
now off to figure out why NFS clients lock hard under load

Comment 17 Dave Jones 2006-01-16 22:29:50 UTC
This is a mass-update to all currently open Fedora Core 3 kernel bugs.

Fedora Core 3 support has transitioned to the Fedora Legacy project.
Due to the limited resources of this project, typically only
updates for new security issues are released.

As this bug isn't security related, it has been migrated to a
Fedora Core 4 bug.  Please upgrade to this newer release, and
test if this bug is still present there.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

Thank you.


Comment 18 Charlie Bennett 2006-01-17 15:08:22 UTC
I'm not the reporter but I've been following closely.
The TP600 issues were corrected in upstream 2.6.14 and have stayed fixed.

Comment 19 Dave Jones 2006-01-19 07:23:27 UTC
thanks for the update Charlie.