From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020513 Description of problem: Turning DMA on the Tyan 2518 crashes box under cerberus. The following message is seen: Kernel error detected from system logs: >> Jun 4 19:46:50 localhost kernel: hda: timeout waiting for DMA >> Jun 4 19:46:50 localhost kernel: ide_dmaproc: chipset supported ide_dma_time out func only: 14 >> Jun 4 19:46:50 localhost kernel: hda: status error: status=0x58 { DriveReady SeekComplete DataRequest } Tue Jun 4 19:46:52 EDT 2002: SYSLOG FAILED: on 1/0 after 13s 1 fail 0 succeed 1 count This system burns in just fine under cereberus under RH 7.2. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.install 7.3 2.turn on dma via `hdparm -d 1 /dev/hda` 3.do lots of disk io Additional info: Pretty much the same as: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=64639 PS-This appears to happen with every server work chipset.
Uhh make that serverworks chipset.
The same issue occurs on the Tyan 2515.
Is this a OSB4, CSB5 or CSB6 ?
Uhm in case that wasn't clear; those are chipset names for serverworks; if you type lspci you'll get your chips and at least one of them should have a name with OSB4/CSB5 or CSB6 in it
The 2518 I've got is a OSB4: SvrWks OSB4: IDE controller on PCI bus 00 dev 79 SvrWks OSB4: chipset revision 0
On an OSB4 you *really* don't want to enable DMA. *really*. There's a flaw that, once every while, randomly corrupts data in the chipset ;(
Alan Cox explains the problem: ServerWorks data corruption with specific chipsets and UDMA. The solution is to enable MWDMA instead of UDAM. -scott ============== http://www.apachelabs.org/lkml/200109.mbox/%3cE15nnXL-0007aB-00@the-village.bc.nu%3e http://www.apachelabs.org/lkml/200110.mbox/%3CE15s9bI-0000Z7-00@the-village.bc.nu%3E http://www.apachelabs.org/lkml/200203.mbox/%3CE16lW4t-0000rc-00@the-village.bc.nu%3E http://www.apachelabs.org/lkml/200204.mbox/%3CE16sayv-00033A-00@the-village.bc.nu%3E http://www.apachelabs.org/lkml/200203.mbox/%3CE16nUx8-0000w4-00@the-village.bc.nu%3E http://www.apachelabs.org/lkml/200202.mbox/%3CE16feI5-0008WC-00@the-village.bc.nu%3E http://www.apachelabs.org/lkml/200204.mbox/%3c20020428142415.A10747@ucw.cz%3e
Oops, I forgot to mention that the problem only seems to happen with Seagate drives. -scott
Then why is the panic not only triggered if the DMA'ing device is a seagate hard disk? The condition that triggers the panic (DMA engine still active when DMA interrupt arrives) is a valid condition if e.g. a device transfers less data then expected (when a disk IO error occurs). We can reproduce the problem reliably with a CD-ROM drive. It feels sort of weird that the system stalls in an environment where there is nothing but a CD-ROM drive on the IDE bus and a normal error condition (invalid block on CD) occurs. It would be highly desireable to stall the machine only if such a valid condition can be excluded, or only if it occurs with a seagate hard disk. Btw: in a comment for bug 66143 I advocated a patch for the 7.2 kernel that enables DMA on the Serverworks CSB5 by default, as with 2.4.18-4. I withdraw my recommendation!
The driver code in drivers/ide/serverworks.c in both linux-2.4.18-3 and linux-2.4.18-4 tries to notice when the OSB4 bug occurs, and prints out messages (and I guess hangs). Here's the relevant code from linux-2.4.18-4 (with indentation removed). Sam, did you see this error message? -scott =============== printk(KERN_CRIT "Serverworks OSB4 in impossible state.\n"); printk(KERN_CRIT "Disable UDMA or if you are using Seagate then try switching disk types\n"); printk(KERN_CRIT "on this controller. Please report this event to osb4-bug.tm\n"); #if 0 /* Panic might sys_sync -> death by corrupt disk */ panic("OSB4: continuing might cause disk corruption.\n"); #else printk(KERN_CRIT "OSB4: continuing might cause disk corruption.\n"); while(1) cpu_relax(); #endif
No I did not see the above message. I'll try with a different drive.
Created attachment 60644 [details] Proposed patch to fix the problem
I sent the above patch to LKML and Alan Cox, too. It must be tested whether it handles those machines correctly that expose the "4-byte skew" bug - I can't do that.
Oops - looking more closely at the original bug report, I assume my patch will not fix that one. It should fix the problem that scott was talking about, though (by narrowing the range of cases in which the kernel deliberately panics).
Created attachment 82724 [details] Linux kernel 2.4.18 patch to use MWDMA mode 2 with ServerWorks OSB4
The attached patch file 2.4.18-svwks-osb4-mwdma.patch modifies the Linux kernel 2.4.18's ServerWorks IDE driver (drivers/ide/serverworks.c) to use MWDMA mode 2 when running on a computer with an OSB4 chipset. It has been tried at CERN on a computer with a Reliance CNB20LE motherboard and a Western Digital Caviar disk: SvrWks OSB4: IDE controller on PCI bus 00 dev 79 SvrWks OSB4: chipset revision 0 SvrWks OSB4: not 100% native mode: will probe irqs later ide0: BM-DMA at 0x5440-0x5447, BIOS settings: hda:DMA, hdb:DMA ide1: BM-DMA at 0x5448-0x544f, BIOS settings: hdc:DMA, hdd:DMA hda: WDC WD200BB-00CLB0, ATA DISK drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: 39102336 sectors (20020 MB) w/2048KiB Cache, CHS=2434/255/63, (U)DMA On this computer PIO mode (the default in 2.4.18) gives about 4 MB/s, MWDMA2 about 14 MB/s, and UDMA 2 fails catastrophically (Linux hangs after warning that it is unsafe to proceed because of the danger of disk data corruption).
I have encountered the Serverworks OSB4 bug also, using a Dell PowerEdge 2650 with RH 7.3. The system was purchased with an IDE CD-ROM which always worked flawlessly. I recently tried to replace it with an IDE CD-RW drive, but as soon as I attempted to either read or write a CD the system hung with the Serverworks OSB4 error. I tried a different IDE CD-RW which did exactly the same thing. Both of the IDE CD-RW drives work just fine in my Dell Inspiron 7500 laptop. arjanv, if RedHat is working on a fix for this, please let me know ASAP when it becomes available. Thanks. For anyone else who might be reading this, can you confirm that this bug is tickled only by DMA which involves an IDE device? We are about to install this machine into a critical environment where there will be frequent DMA involving a PCI card. If that crashes the system we are totally screwed. Thanks for any help that anyone can provide!
This relates only to IDE DMA. My patch above should be fine, an I understand that Alan Cox is doing something similar in his kernel tree. I also heard that Andre Hedrick is working towards a real solution together with ServerWorks people. The really dangerous situation arises only with old OSB4 chipsets (not CSB5/6). Unfortunately the "workaround" deliberately hangs the machine a large number of cases where this isn't necessary, including cases where - the chipset is not an OSB4, - the device has reported an error, - the device is a read-only device (eg CD) Please contact Alan Cox and/or Andre Hedrick for definitive information on this subject. A search for "OSB4" in the linux-kernel archives will also reveal a lot of information.
All the latest kernels released by RedHat still expose this bug, as does 2.4.20 "vanilla". Obviously, the fixes have a hard time getting through. Only on the latest 2.5.x kernels I have seen a different error condition. Has RedHat forgotten about this one? Do we really need to tell our customers to do without DMA (or with mode 2, we didn't test that) on this chip set?
It is solid advice to your customers. The OSB4 is a "cdrom attachment device" more than something to connect disks to.
Well, it appears that fix in newer 2.5.x (and 2.4.x-ac) kernels triggers a panic only if this condition occurs with a hard disk. This was part of my proposed patch, and it is what we'd like to see in the RedHat kernel too. It's a trivial fix and pretty obvious because data corruption obviously isn't an issue on a CD-ROM. Moreover, the panic should only be triggered on the old OSB4, not on newer CSB5/CSB6 chip sets which don't have the 4-byte shift problem. (That is what bothers us most: our systems all have the CSB5/6 but we need to not use DMA because of a workaround that is only needed for OSB4). You know, having to use the CDROM in PIO mode can also be a nuisance. Anyway, I'll try to get our testers to try mdma2. Alan and Andre know more about this, anyway. Thanks for responding.
For AS 2.1 I think we can either backport the "no udma for disk" change or you can add if(its-a-cdrom) dont-panic
Hi, I recently bought a mainboard with the serverworks OBS4 chipset, and I'm somewhat confused on what's the real issue and what the kernel (I'm looking at drivers/ide/pci/serverworks.c in 2.6.12) should do. I googled and found this bug, I hope it's a good place to report this. This is my device: 0000:00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller which should be able to do UDMA2, but linux 2.6 doesn't enable it, only MWDMA2, which is what it's expected to do: svwks_ratemask(): if (dev->device == PCI_DEVICE_ID_SERVERWORKS_OSB4IDE) { u32 reg = 0; if (isa_dev) pci_read_config_dword(isa_dev, 0x64, ®); /* * Don't enable UDMA on disk devices for the moment */ if(drive->media == ide_disk) return 0; /* Check the OSB4 DMA33 enable bit */ return ((reg & 0x00004000) == 0x00004000) ? 1 : 0; svwks_tune_chipset(): /* If we are about to put a disk into UDMA mode we screwed up. Our code assumes we never _ever_ do this on an OSB4 */ if(dev->device == PCI_DEVICE_ID_SERVERWORKS_OSB4 && drive->media == ide_disk && speed >= XFER_UDMA_0) BUG(); Disk is a Maxtor 6Y060L0 (60 gb, 7200 rpm, udma 133 capable) BTW. The problem is, if I do a hdparm -Xudma2 or something beyond mdma2 the kernel spits out errors which look like those from the start of this bug (perhaps they're not exactly the same), IO stops and I have to restart the box. So, I've some questions: Is the OSB4 really buggy, it depends on the driver....? I ask this because the code says "Our code assumes we never _ever_ do this on an OSB4", and it's not clear if there is a hardware bug, or if it's a limitation of the current code. And if the hardware it's buggy, shouldn't the kernel warn about it, and don't allow people to hang their boxes while trying to enable udma2 with hdparm?
It isnt a softare limitation. As to the hanging your box issue, you can do that a thousand ways with hdparm. The manual page is pretty specific about knowing how to use it and the functionality is root only.