Bug 97356

Summary: ide chipsets produce "dma_timeout_expiry" after 2 days of network stress and in some cases hang the system
Product: [Retired] Red Hat Linux Reporter: Joshua Giles <joshua_giles>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 9CC: alan, barryn, elliot, k.georgiou, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:41:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch from Alan none

Description Joshua Giles 2003-06-13 16:31:41 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Description of problem:
ide chipsets produce "dma_timeout_expiry" after 2 days of network stress and 
in some cases hang the system.  The cmd680 and ICH5 chipsets have hung under 
high network stress.  These chipsets are the PE650 and PE400 SC.

dmesg snip:
"Jun  9 12:57:09 dell kernel: hdc: dma_timer_expiry: dma status == 0x20
Jun  9 12:57:09 dell kernel: hdc: timeout waiting for DMA
Jun  9 12:57:09 dell kernel: hdc: timeout waiting for DMA
Jun  9 12:57:13 dell kernel: hdc: status timeout: status=0xd0 { Busy }
Jun  9 12:57:13 dell kernel: hdc: drive not ready for command
Jun  9 12:57:14 dell kernel: ide1: reset: success
"

System 1 Configuration:
BIOS: A01
1 X WD 80GB
1 X MAXTOR 250GB



Version-Release number of selected component (if applicable):
kernel-2.4.20-18.9

How reproducible:
Always

Steps to Reproduce:
1.Run network stress such as IObasher (over NFS) and multiple sendmail clients 
hitting the server.
2.After approx. 1 day you will see "dma_timeout_expiry" in dmesg
3.
    

Actual Results:  System hangs.  

Expected Results:  No errors and no hung system.

Additional info:

Sysrq is unresponsive.

Comment 1 Alan Cox 2003-06-13 16:40:57 UTC
A dma timeout expiry itself should not be fatal. There are even cases it may
happen when there is huge amounts of DMA traffic preventing the transfer or when
the drive takes a long time to respond (eg having to writeback a lot of cache
data and recalibrate and handling a bad block recovery)

The hang is more problematic. Is the hang a series of 0xD0 {busy} reset, then
still 0xD0 reset afterwards. If so is the hang only occuring on WD drives ? (I
ask this not because I suspect the WD drives are buggy Im just trying to match a
couple of other reports up)


Comment 2 Joshua Giles 2003-06-13 16:48:17 UTC
The kernel this issue was tested on is NOT 2.4.20-18.9
THe issue IS seen on 2.4.20-16.9smp

Comment 3 Arjan van de Ven 2003-06-13 16:48:19 UTC
*** Bug 97358 has been marked as a duplicate of this bug. ***

Comment 4 Alan Cox 2003-06-13 16:55:28 UTC
Arjan does -16 have the DMA recovery locking fixes that are in -ac. If the went
into -18 or are not in yet that would explain why on SMP it can go from
timeout->hang not timeout->happy


Comment 5 Robert Hentosh 2003-06-13 17:36:31 UTC
Alan, The problem was seen on Maxtor and Segate drives. 

We have also have seen an issue with IBM drives that we think is drive related.
I only mention this because the problem with the IBM drives happens under the
same stress and after the same period of time (around 2 days).  In the IBM case
it reports "unrecoverable errors" and does not get any "DMA timeouts".  We see
that as a seperate issue.

Comment 6 Robert Hentosh 2003-06-13 17:41:15 UTC
A the drivers/ide/... directory tree is the same between -16.9 and -18.9. I
checked with a diff -urN

Comment 7 Alan Cox 2003-06-13 17:43:27 UTC
The IBM one I can duplicate with a CMD680 but not with the -18 kernel only the
one before we fixed the UDMA speed bug for you guys.  With wrong UDMA speed and
high load the IBM drives I tried just powered off in disgust in the end 8)



Comment 8 Robert Hentosh 2003-06-13 17:52:30 UTC
Alan, is there a way to determine if we are seeing the timeout->hang state that
you are refering to in your previous post?  We are currently making a custom
kernel to toggle the TX line on the serial port when we get a DMA timeout to
trigger an IDE bus analyzer to see what is happening on the IDE bus.

Comment 9 Alan Cox 2003-06-13 17:55:14 UTC
You get jammed in a spinlock. If thats the problem you are seeing then a
2.4.21-rc7-ac1 tree would show the timeouts but recover. If its us crashing a
drive in error recovery you'll see continued reset/busy/reset/busy sequences. If
it just hangs dead then its probably locking


Comment 10 Robert Hentosh 2003-06-13 20:37:16 UTC
Okay. We have more data from our test organizations.  My team has been able to
reproduce the DMA timeouts that caused DMA to be disabled on the drive.  We can
turn DMA back on and after more stress it will again turn off.  

I guess that is behaving as designed, but there is some question if this
behavior should change.  That will only be determined when we really localize
what is causing the DMA timeouts and review the specs.  Right now we get with
different IDE chipsets and different drive vendors (IBM, Hitachi, Fujitsu,
Segate and Maxtor) seen the "DMA timeouts" after ~2 days of network / disk
stress.  The drive is put into PIO mode and chugs along.

The original hang that happened after DMA timeouts have not be reproducable
after about a month of testing.  It was erroneously reported just now when the
tester saw the "DMA timeouts" and thought he reproduced the orginal hang.  At
this time we are unable to reproduce the orginal hang.  The original hang didn't
respond to Sysrq.

We also have a report of Filesystem corruption with the DMA timeouts. At this
time I believe that was with the IBM drives and I am trying to confirm that
there was no corruption with known good drives.


Comment 11 Michael K. Johnson 2003-06-16 15:33:55 UTC
Created attachment 92431 [details]
Patch from Alan

Here's the test patch from Alan that might resolve the problem.

Comment 12 Joshua Giles 2003-06-20 15:19:35 UTC
OK, the kernel (up) at http://people.redhat.com/ltroan/fixes/RHL9-ide/i686/ 
has failed after ~1 day of stress.  DMA is disabled at this point and an 
additional message was seen "hda: error waiting for DMA".  Also, dmesg reports 
DMA being disabled on hda, but this was not seen on previous kernels.  
However, hdparm shows hda with DMA still on.

System Config:
2.26ghz P4
4GB RAM 333mhz
X06 BIOS
Hitachi 40GB <--- CPG Hard Drive
Seagate 80GB <--- CPG Hard Drive
Pro100S in Slot 4




Comment 13 Joshua Giles 2003-06-20 18:17:43 UTC
OK, the kernel (up) at http://people.redhat.com/ltroan/fixes/RHL9-ide/i686/ 
has failed after ~1 day of stress.  DMA is disabled at this point and an 
additional message was seen "hda: error waiting for DMA".  Also, dmesg reports 
DMA being disabled on hda, but this was not seen on previous kernels.  
However, hdparm shows hda with DMA still on.

System Config:
2.26ghz P4
4GB RAM 333mhz
X06 BIOS
Hitachi 40GB <--- CPG Hard Drive
Seagate 80GB <--- CPG Hard Drive
Pro100S in Slot 4




Comment 14 Michael K. Johnson 2003-06-23 15:13:11 UTC
What exactly does "failed" mean here?  It's unclear to me from this
description whether the code is working as expected...

Comment 15 Joshua Giles 2003-06-23 17:24:08 UTC
"failed" = DMA is still being disabled (on hdb).  It was my understanding that 
the OS would no longer have to disable dma.  How is this code supposed to work 
under these conditions?

Comment 16 Larry Troan 2003-06-25 14:46:05 UTC
I OPENED ISSUE TRACKER 24599 TO TRACK THIS IN THE DELL GROUP.....

Detailed history should be kept in this Bugzilla

Comment 17 Joshua Giles 2003-06-25 21:51:23 UTC
Ok after looking at the code (and results of our test) I think some traces 
will be needed on an analyser and a better understanding of what sort of 
behavior is expected of the OS from the IDE spec. 

Comment 18 Joshua Giles 2003-06-26 16:06:41 UTC
It looks like the DMA error (1) is being set.  THe timeout looks like a 
different sort of error.  A document on developer.intel.com that talks 
specifically about the IDE part of the ICH chips, and it says that the error 
bit is set when the ICH gets a target abort or a master abort on the PCI bus.  
FYI, a master abort will happen when no target PCI device (north bridge) could 
never complete the cycle as issued.  For example-- and this is not actually 
the case-- if the north bridge only accepts double-word writes and the ide 
controller tried a byte write.  If the doc is correct, we may have a chipset 
bug (highly unlikely since it happens on CMD680 and ICH5) or the IDE driver is 
setting DMA to read to/from or write an invalid address.  We are investigating 
the chipset w/a PCI analyser.  Can someone dig into the ide (yuck) layer?

Comment 19 Michael K. Johnson 2003-06-26 19:13:33 UTC
Does this ever happen if you use hdparm to explicitly set a lower
transfer rate?

Comment 20 Larry Troan 2003-08-07 02:13:35 UTC
FROM ISSUE TRACKER...
 Event posted 08-06-2003 10:31am by jgiles with duration of 0.00        
I have yet to reproduce the problem with DMA being turned off with the patched
kernel.  However, we did reproduce a similar problem with DMA timeouts and upon
further analysis the DMA completes in the following manner: Linux follows all
protocols w/respect to the spec and all the data is transferred.  THe drive
thinks there is more data to come; It never generates an interrupt to complete
the DMA.  We've had a tough time reproducing this problem thus I did not want to
confuse the matter by adding another variable into the picture untill I could
reproduce it consistently.  Right now the data points to a drive vendor problem,
but it is still under investigation.  Still in our court.



Comment 21 Christoph Rudorff 2004-01-03 13:12:15 UTC
Hello! 
 
Got the same trouble on two different machines. Only one thing is 
common: the harddrive is an IBM IC35L080AVVA07-0 
 
Machine one is a cheap server replacement with Intel PIIX4 Ultra 100 
Chipset and kernel 2.4.22-0.18mdk. DMA timeout happened once after 
running its cron jobs at 4:22 am. 
 
Machine two is my desktop, SiS 5513 Ultra 133 chipset (running at 
udma4, lowered from udma5 but did not help), kernel 2.4.21-0.13mdk. 
Timeout happens every ~2 days, but sometimes 3 times a day. Mostly 
while clicking on a link in konqueror web browser !?! Actually 6.3M 
cache, 416 files. Machine freezes for some seconds but I can always 
resume my work. No filesystem corruption happened yet. 
 
While googling I found another report w. same harddrive here: 
 
http://www.linuxquestions.org/questions/showthread.php?threadid=61988 
 
 

Comment 22 Alan Cox 2004-01-16 23:39:59 UTC
There is an obscure case where the timeout can occur from a race
rather than drives having a sulk (eg error correcting). Its fixed in
2.4.24


Comment 23 Barry K. Nathan 2004-01-17 02:47:54 UTC
Alan, wasn't it fixed in 2.4.25-pre4 not 2.4.24? Or were there two IDE
races fixed recently? (Just trying to make sure I understand what's
going on.)

Comment 24 Christoph Rudorff 2004-02-26 00:49:45 UTC
Hello ... again! 
 
It happend again. kernel 2.4.22, VT82C686 Chipset and IBM Notebook 
HD (IC25N020ATDA04-0). Suddenly the music went off, Internet still 
running but no more login. I'm not sure if this 70% crash is related 
to the DMA isse. I opend the box for a check and found the battery 
on the mainboard is exploded some weeks before. (uptime ~70 days ... 
w/o DMA errors) 
 
And it happend again. On my desktop box, see my previous comment. 
But this time running with a 2.6.3 ! Text from log differs a bit. 
"Nothing" happened, no hung system. I did not reboot to write this 
note ... 
 
hda: dma_timer_expiry: dma status == 0x20 
hda: DMA timeout retry 
hda: timeout waiting for DMA 
hda: status timeout: status=0xd0 { Busy } 
 
hda: drive not ready for command 
ide0: reset: success 
 
So I diffed a 2.4.22 and 2.4.25. Mh, some work done around 
drivers/ide but ide-dma.c is untouched. Well, I will install and 
watch. 
 
 

Comment 25 Bugzilla owner 2004-09-30 15:41:08 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/