Bug 66054 - Turning DMA on the Tyan 2518 crashes box
Summary: Turning DMA on the Tyan 2518 crashes box
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.3
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-06-05 00:35 UTC by Samuel Flory
Modified: 2007-04-18 16:42 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2003-06-08 16:40:14 UTC
Embargoed:


Attachments (Terms of Use)
Proposed patch to fix the problem (637 bytes, patch)
2002-06-12 10:47 UTC, Martin Wilck
no flags Details | Diff
Linux kernel 2.4.18 patch to use MWDMA mode 2 with ServerWorks OSB4 (1.44 KB, patch)
2002-10-30 17:41 UTC, Julian Blake
no flags Details | Diff

Description Samuel Flory 2002-06-05 00:35:58 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020513

Description of problem:
  Turning DMA on the Tyan 2518 crashes box under cerberus.  The following
message is seen:
Kernel error detected from system logs:
>> Jun  4 19:46:50 localhost kernel: hda: timeout waiting for DMA
>> Jun  4 19:46:50 localhost kernel: ide_dmaproc: chipset supported ide_dma_time
out func only: 14
>> Jun  4 19:46:50 localhost kernel: hda: status error: status=0x58 { DriveReady
 SeekComplete DataRequest }
Tue Jun  4 19:46:52 EDT 2002: SYSLOG FAILED: on 1/0 after 13s
1 fail 0 succeed 1 count

This system burns in just fine under cereberus under RH 7.2.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.install 7.3
2.turn on dma via `hdparm -d 1 /dev/hda`
3.do lots of disk io
	

Additional info:

Pretty much the same as:
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=64639

PS-This appears to happen with every server work chipset.

Comment 1 Samuel Flory 2002-06-05 00:37:32 UTC
Uhh make that serverworks chipset.

Comment 2 Samuel Flory 2002-06-05 00:46:16 UTC
  The same issue occurs on the Tyan 2515.

Comment 3 Arjan van de Ven 2002-06-05 08:11:45 UTC
Is this a OSB4, CSB5 or CSB6 ?

Comment 4 Arjan van de Ven 2002-06-05 08:12:40 UTC
Uhm in case that wasn't clear; those are chipset names for serverworks; if you
type lspci  you'll get your chips and at least one of them should have a name
with OSB4/CSB5 or CSB6 in it

Comment 5 Samuel Flory 2002-06-06 19:43:45 UTC
  The 2518 I've got is a OSB4:
SvrWks OSB4: IDE controller on PCI bus 00 dev 79
SvrWks OSB4: chipset revision 0



Comment 6 Arjan van de Ven 2002-06-06 21:14:16 UTC
On an OSB4 you *really* don't want to enable DMA. *really*.
There's a flaw that, once every while, randomly corrupts data in the chipset ;(

Comment 8 Scott Weikart 2002-06-11 07:03:01 UTC
Oops, I forgot to mention that the problem only seems to happen with Seagate
drives.

-scott


Comment 9 Martin Wilck 2002-06-11 16:01:08 UTC
Then why is the panic not only triggered if the DMA'ing device is a seagate
hard disk?

The condition that triggers the panic (DMA engine still active when DMA
interrupt arrives) is a valid condition if e.g. a device transfers less data
then expected (when a disk IO error occurs). We can reproduce the problem
reliably with a CD-ROM drive. It feels sort of weird that the system stalls in
an environment where there is nothing but a CD-ROM drive on the IDE bus and a
normal error condition (invalid block on CD) occurs. It would be highly
desireable to stall the machine only if such a valid condition can be excluded,
or only if it occurs with a seagate hard disk.

Btw: in a comment for bug 66143 I advocated a patch for the 7.2 kernel that
enables DMA on the Serverworks CSB5 by default, as with 2.4.18-4.
I withdraw my recommendation!


Comment 10 Scott Weikart 2002-06-12 00:21:40 UTC
The driver code in drivers/ide/serverworks.c in both linux-2.4.18-3 and
linux-2.4.18-4 tries
to notice when the OSB4 bug occurs, and prints out messages (and I guess hangs).

Here's the relevant code from linux-2.4.18-4 (with indentation removed).

Sam, did you see this error message?

-scott
===============
printk(KERN_CRIT "Serverworks OSB4 in impossible state.\n");
printk(KERN_CRIT "Disable UDMA or if you are using Seagate then try switching
disk types\n");
printk(KERN_CRIT "on this controller. Please report this event to
osb4-bug.tm\n");
#if 0
/* Panic might sys_sync -> death by corrupt disk */
panic("OSB4: continuing might cause disk corruption.\n");
#else
printk(KERN_CRIT "OSB4: continuing might cause disk corruption.\n");
while(1)
        cpu_relax();
#endif

Comment 11 Samuel Flory 2002-06-12 00:40:48 UTC
  No I did not see the above message.  I'll try with a different drive.

Comment 12 Martin Wilck 2002-06-12 10:47:32 UTC
Created attachment 60644 [details]
Proposed patch to fix the problem

Comment 13 Martin Wilck 2002-06-12 10:49:47 UTC
I sent the above patch to LKML and Alan Cox, too.
It must be tested whether it handles those machines correctly that expose
the "4-byte skew" bug - I can't do that.


Comment 14 Martin Wilck 2002-06-12 10:59:33 UTC
Oops - looking more closely at the original bug report, I assume my patch will
not fix that one.

It should fix the problem that scott was talking about, though
(by narrowing the range of cases in which the kernel deliberately panics).



Comment 15 Julian Blake 2002-10-30 17:41:08 UTC
Created attachment 82724 [details]
Linux kernel 2.4.18 patch to use MWDMA mode 2 with ServerWorks OSB4

Comment 16 Julian Blake 2002-10-30 18:00:54 UTC
The attached patch file 2.4.18-svwks-osb4-mwdma.patch modifies the Linux kernel
2.4.18's ServerWorks IDE driver (drivers/ide/serverworks.c) to use MWDMA mode 2
when running on a computer with an OSB4 chipset.  It has been tried at CERN on
a computer with a Reliance CNB20LE motherboard and a Western Digital Caviar disk:

SvrWks OSB4: IDE controller on PCI bus 00 dev 79
SvrWks OSB4: chipset revision 0
SvrWks OSB4: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0x5440-0x5447, BIOS settings: hda:DMA, hdb:DMA
    ide1: BM-DMA at 0x5448-0x544f, BIOS settings: hdc:DMA, hdd:DMA
hda: WDC WD200BB-00CLB0, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: 39102336 sectors (20020 MB) w/2048KiB Cache, CHS=2434/255/63, (U)DMA

On this computer PIO mode (the default in 2.4.18) gives about 4 MB/s, 
MWDMA2 about 14 MB/s, and UDMA 2 fails catastrophically (Linux hangs after
warning that it is unsafe to proceed because of the danger of disk data corruption).

Comment 17 Need Real Name 2002-11-06 23:00:15 UTC
I have encountered the Serverworks OSB4 bug also, using a Dell PowerEdge 2650
with RH 7.3. The system was purchased with an IDE CD-ROM which always worked
flawlessly. I recently tried to replace it with an IDE CD-RW drive,
but as soon as I attempted to either read or write a CD the system hung with
the Serverworks OSB4 error. I tried a different IDE CD-RW which did exactly
the same thing. Both of the IDE CD-RW drives work just fine in my Dell
Inspiron 7500 laptop.

arjanv, if RedHat is working on a fix for this, please let me know ASAP when
it becomes available. Thanks.

For anyone else who might be reading this, can you confirm that this bug
is tickled only by DMA which involves an IDE device? We are about to install
this machine into a critical environment where there will be frequent DMA
involving a PCI card. If that crashes the system we are totally screwed.

Thanks for any help that anyone can provide!

Comment 18 Martin Wilck 2002-11-07 09:40:44 UTC
This relates only to IDE DMA.

My patch above should be fine, an I understand that Alan Cox is doing something
similar in his kernel tree. I also heard that Andre Hedrick is working towards
a real solution together with ServerWorks people.

The really dangerous situation arises only with old OSB4 chipsets (not CSB5/6).
Unfortunately the "workaround" deliberately hangs the machine a large number of
cases where this isn't necessary, including cases where

- the chipset is not an OSB4,
- the device has reported an error,
- the device is a read-only device (eg CD)

Please contact Alan Cox and/or Andre Hedrick for definitive information on this
subject.

A search for "OSB4" in the linux-kernel archives will also reveal a lot of
information.


Comment 19 Martin Wilck 2003-02-12 08:29:04 UTC
All the latest kernels released by RedHat still expose this bug, as does 2.4.20
"vanilla". Obviously, the fixes have a hard time getting through. Only on the
latest 2.5.x kernels I have seen a different error condition.

Has RedHat forgotten about this one?
Do we really need to tell our customers to do without DMA (or with mode 2, we
didn't test that) on this chip set?

Comment 20 Arjan van de Ven 2003-02-12 09:32:59 UTC
It is solid advice to your customers. The OSB4 is a "cdrom attachment device"
more than something to connect disks to.

Comment 21 Martin Wilck 2003-02-12 10:11:42 UTC
Well, it appears that fix in newer 2.5.x (and 2.4.x-ac) kernels triggers a panic
only if this condition occurs with a hard disk. This was part of my proposed
patch, and it is what we'd like to see in the RedHat kernel too. 

It's a trivial fix and pretty obvious because data corruption obviously isn't an
issue on a CD-ROM. 

Moreover, the panic should only be triggered on the old OSB4, not on newer
CSB5/CSB6 chip sets which don't have the 4-byte shift problem. (That is what
bothers us most: our systems all have the CSB5/6 but we need to not use DMA
because of a workaround that is only needed for OSB4).

You know, having to use the CDROM in PIO mode can also be a nuisance. Anyway, 
I'll try to get our testers to try mdma2.

Alan and Andre know more about this, anyway. Thanks for responding.


Comment 22 Alan Cox 2003-11-05 19:29:30 UTC
For AS 2.1 I think we can either backport the "no udma for disk"
change or you can add

    if(its-a-cdrom) dont-panic



Comment 23 Diego Calleja 2005-04-18 12:13:22 UTC
Hi, I recently bought a mainboard with the serverworks OBS4 chipset, and I'm
somewhat confused on what's the real issue and what the kernel (I'm looking at
drivers/ide/pci/serverworks.c in 2.6.12) should do. I googled and found this
bug, I hope it's a good place to report this.

This is my device:
0000:00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller


which should be able to do UDMA2, but linux 2.6 doesn't enable it, only MWDMA2,
which is what it's expected to do:

svwks_ratemask():
        if (dev->device == PCI_DEVICE_ID_SERVERWORKS_OSB4IDE) {
                u32 reg = 0;
                if (isa_dev)
                        pci_read_config_dword(isa_dev, 0x64, &reg);

                /*
                 *      Don't enable UDMA on disk devices for the moment
                 */
                if(drive->media == ide_disk)
                        return 0;
                /* Check the OSB4 DMA33 enable bit */
                return ((reg & 0x00004000) == 0x00004000) ? 1 : 0;

svwks_tune_chipset():

        /* If we are about to put a disk into UDMA mode we screwed up.
           Our code assumes we never _ever_ do this on an OSB4 */

        if(dev->device == PCI_DEVICE_ID_SERVERWORKS_OSB4 &&
                drive->media == ide_disk && speed >= XFER_UDMA_0)
                        BUG();




Disk is a Maxtor 6Y060L0 (60 gb, 7200 rpm, udma 133 capable) BTW. The problem
is, if I do a hdparm -Xudma2 or something beyond mdma2 the kernel spits out
errors which look like those from the start of this bug (perhaps they're not
exactly the same), IO stops and I have to restart the box.

So, I've some questions: Is the OSB4 really buggy, it depends on the driver....?
I ask this because the code says "Our code assumes we never _ever_ do this on an
OSB4", and it's not clear if there is a hardware bug, or if it's a limitation of
the current code. And if the hardware it's buggy, shouldn't the kernel warn
about it, and don't allow people to hang their boxes while trying to enable
udma2 with hdparm?

Comment 24 Alan Cox 2005-04-18 12:50:20 UTC
It isnt a softare limitation.

As to the hanging your box issue, you can do that a thousand ways with hdparm.
The manual page is pretty specific about knowing how to use it and the
functionality is root only.



Note You need to log in before you can comment on or make changes to this bug.