Bug 97356
Summary: | ide chipsets produce "dma_timeout_expiry" after 2 days of network stress and in some cases hang the system | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Joshua Giles <joshua_giles> | ||||
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 9 | CC: | alan, barryn, elliot, k.georgiou, tao | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-09-30 15:41:08 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Joshua Giles
2003-06-13 16:31:41 UTC
A dma timeout expiry itself should not be fatal. There are even cases it may happen when there is huge amounts of DMA traffic preventing the transfer or when the drive takes a long time to respond (eg having to writeback a lot of cache data and recalibrate and handling a bad block recovery) The hang is more problematic. Is the hang a series of 0xD0 {busy} reset, then still 0xD0 reset afterwards. If so is the hang only occuring on WD drives ? (I ask this not because I suspect the WD drives are buggy Im just trying to match a couple of other reports up) The kernel this issue was tested on is NOT 2.4.20-18.9 THe issue IS seen on 2.4.20-16.9smp *** Bug 97358 has been marked as a duplicate of this bug. *** Arjan does -16 have the DMA recovery locking fixes that are in -ac. If the went into -18 or are not in yet that would explain why on SMP it can go from timeout->hang not timeout->happy Alan, The problem was seen on Maxtor and Segate drives. We have also have seen an issue with IBM drives that we think is drive related. I only mention this because the problem with the IBM drives happens under the same stress and after the same period of time (around 2 days). In the IBM case it reports "unrecoverable errors" and does not get any "DMA timeouts". We see that as a seperate issue. A the drivers/ide/... directory tree is the same between -16.9 and -18.9. I checked with a diff -urN The IBM one I can duplicate with a CMD680 but not with the -18 kernel only the one before we fixed the UDMA speed bug for you guys. With wrong UDMA speed and high load the IBM drives I tried just powered off in disgust in the end 8) Alan, is there a way to determine if we are seeing the timeout->hang state that you are refering to in your previous post? We are currently making a custom kernel to toggle the TX line on the serial port when we get a DMA timeout to trigger an IDE bus analyzer to see what is happening on the IDE bus. You get jammed in a spinlock. If thats the problem you are seeing then a 2.4.21-rc7-ac1 tree would show the timeouts but recover. If its us crashing a drive in error recovery you'll see continued reset/busy/reset/busy sequences. If it just hangs dead then its probably locking Okay. We have more data from our test organizations. My team has been able to reproduce the DMA timeouts that caused DMA to be disabled on the drive. We can turn DMA back on and after more stress it will again turn off. I guess that is behaving as designed, but there is some question if this behavior should change. That will only be determined when we really localize what is causing the DMA timeouts and review the specs. Right now we get with different IDE chipsets and different drive vendors (IBM, Hitachi, Fujitsu, Segate and Maxtor) seen the "DMA timeouts" after ~2 days of network / disk stress. The drive is put into PIO mode and chugs along. The original hang that happened after DMA timeouts have not be reproducable after about a month of testing. It was erroneously reported just now when the tester saw the "DMA timeouts" and thought he reproduced the orginal hang. At this time we are unable to reproduce the orginal hang. The original hang didn't respond to Sysrq. We also have a report of Filesystem corruption with the DMA timeouts. At this time I believe that was with the IBM drives and I am trying to confirm that there was no corruption with known good drives. Created attachment 92431 [details]
Patch from Alan
Here's the test patch from Alan that might resolve the problem.
OK, the kernel (up) at http://people.redhat.com/ltroan/fixes/RHL9-ide/i686/ has failed after ~1 day of stress. DMA is disabled at this point and an additional message was seen "hda: error waiting for DMA". Also, dmesg reports DMA being disabled on hda, but this was not seen on previous kernels. However, hdparm shows hda with DMA still on. System Config: 2.26ghz P4 4GB RAM 333mhz X06 BIOS Hitachi 40GB <--- CPG Hard Drive Seagate 80GB <--- CPG Hard Drive Pro100S in Slot 4 OK, the kernel (up) at http://people.redhat.com/ltroan/fixes/RHL9-ide/i686/ has failed after ~1 day of stress. DMA is disabled at this point and an additional message was seen "hda: error waiting for DMA". Also, dmesg reports DMA being disabled on hda, but this was not seen on previous kernels. However, hdparm shows hda with DMA still on. System Config: 2.26ghz P4 4GB RAM 333mhz X06 BIOS Hitachi 40GB <--- CPG Hard Drive Seagate 80GB <--- CPG Hard Drive Pro100S in Slot 4 What exactly does "failed" mean here? It's unclear to me from this description whether the code is working as expected... "failed" = DMA is still being disabled (on hdb). It was my understanding that the OS would no longer have to disable dma. How is this code supposed to work under these conditions? I OPENED ISSUE TRACKER 24599 TO TRACK THIS IN THE DELL GROUP..... Detailed history should be kept in this Bugzilla Ok after looking at the code (and results of our test) I think some traces will be needed on an analyser and a better understanding of what sort of behavior is expected of the OS from the IDE spec. It looks like the DMA error (1) is being set. THe timeout looks like a different sort of error. A document on developer.intel.com that talks specifically about the IDE part of the ICH chips, and it says that the error bit is set when the ICH gets a target abort or a master abort on the PCI bus. FYI, a master abort will happen when no target PCI device (north bridge) could never complete the cycle as issued. For example-- and this is not actually the case-- if the north bridge only accepts double-word writes and the ide controller tried a byte write. If the doc is correct, we may have a chipset bug (highly unlikely since it happens on CMD680 and ICH5) or the IDE driver is setting DMA to read to/from or write an invalid address. We are investigating the chipset w/a PCI analyser. Can someone dig into the ide (yuck) layer? Does this ever happen if you use hdparm to explicitly set a lower transfer rate? FROM ISSUE TRACKER... Event posted 08-06-2003 10:31am by jgiles with duration of 0.00 I have yet to reproduce the problem with DMA being turned off with the patched kernel. However, we did reproduce a similar problem with DMA timeouts and upon further analysis the DMA completes in the following manner: Linux follows all protocols w/respect to the spec and all the data is transferred. THe drive thinks there is more data to come; It never generates an interrupt to complete the DMA. We've had a tough time reproducing this problem thus I did not want to confuse the matter by adding another variable into the picture untill I could reproduce it consistently. Right now the data points to a drive vendor problem, but it is still under investigation. Still in our court. Hello! Got the same trouble on two different machines. Only one thing is common: the harddrive is an IBM IC35L080AVVA07-0 Machine one is a cheap server replacement with Intel PIIX4 Ultra 100 Chipset and kernel 2.4.22-0.18mdk. DMA timeout happened once after running its cron jobs at 4:22 am. Machine two is my desktop, SiS 5513 Ultra 133 chipset (running at udma4, lowered from udma5 but did not help), kernel 2.4.21-0.13mdk. Timeout happens every ~2 days, but sometimes 3 times a day. Mostly while clicking on a link in konqueror web browser !?! Actually 6.3M cache, 416 files. Machine freezes for some seconds but I can always resume my work. No filesystem corruption happened yet. While googling I found another report w. same harddrive here: http://www.linuxquestions.org/questions/showthread.php?threadid=61988 There is an obscure case where the timeout can occur from a race rather than drives having a sulk (eg error correcting). Its fixed in 2.4.24 Alan, wasn't it fixed in 2.4.25-pre4 not 2.4.24? Or were there two IDE races fixed recently? (Just trying to make sure I understand what's going on.) Hello ... again! It happend again. kernel 2.4.22, VT82C686 Chipset and IBM Notebook HD (IC25N020ATDA04-0). Suddenly the music went off, Internet still running but no more login. I'm not sure if this 70% crash is related to the DMA isse. I opend the box for a check and found the battery on the mainboard is exploded some weeks before. (uptime ~70 days ... w/o DMA errors) And it happend again. On my desktop box, see my previous comment. But this time running with a 2.6.3 ! Text from log differs a bit. "Nothing" happened, no hung system. I did not reboot to write this note ... hda: dma_timer_expiry: dma status == 0x20 hda: DMA timeout retry hda: timeout waiting for DMA hda: status timeout: status=0xd0 { Busy } hda: drive not ready for command ide0: reset: success So I diffed a 2.4.22 and 2.4.25. Mh, some work done around drivers/ide but ide-dma.c is untouched. Well, I will install and watch. Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/ |