Bug 428468 - IBM 8832 megaide in RAID mode causes data corruption
IBM 8832 megaide in RAID mode causes data corruption
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
high Severity high
: rc
: ---
Assigned To: Alan Cox
Martin Jenner
: Regression
: 427687 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-11 15:25 EST by Bryn M. Reeves
Modified: 2010-10-22 17:41 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-14 07:34:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dmesg showing IDE errors on 2.6.18-53.1.4.el5xen (14.83 KB, text/plain)
2008-01-11 15:29 EST, Bryn M. Reeves
no flags Details
lspci output for 2.6.18-53.1.4 with oem_setup patch removed and DMA workarounds in place (16.36 KB, text/plain)
2008-01-12 09:15 EST, Ian McLeod
no flags Details
lspci output - RHEL5 GA rescue environment (16.36 KB, text/plain)
2008-01-13 16:32 EST, Ian McLeod
no flags Details
lspci output - RHEL5 U1 rescue environment (16.33 KB, text/plain)
2008-01-13 16:33 EST, Ian McLeod
no flags Details
lspci output - Fedora 8 rescue environment (16.34 KB, text/plain)
2008-01-13 16:33 EST, Ian McLeod
no flags Details

  None (edit)
Description Bryn M. Reeves 2008-01-11 15:25:53 EST
Description of problem:
The 8832 blades ship with a serverworks/broadcom integrated RAID/IDE controller
that is supported by the dmraid software RAID tools.

With RAID support disabled the device is correctly configured by the firmware
and works as a standard IDE/ATA interface. Enabling the RAID support in the
system BIOS allows dmraid to detect & configure the arrays but the system begins
to throw IDE errors and bus resets shortly after boot, preventing a succesful
installation.

This behavior was originally reported agains RHEL5-GA RC builds as bug 222653
and a workaround was found that proved effective (manually forcing DMA use via
hdparm). In the course of that bug the following patch was proposed to resolve this:

--- drivers/ide/pci/serverworks.c~	2007-05-16 13:10:16.428324088 +0100
+++ drivers/ide/pci/serverworks.c	2007-05-16 13:10:16.428324088 +0100
@@ -158,6 +158,12 @@
 	pci_read_config_word(dev, 0x4A, &csb5_pio);
 	pci_read_config_byte(dev, 0x54, &ultra_enable);
 
+	/* If we are in RAID mode (eg AMI MegaIDE) then we can't it
+	   turns out trust the firmware configuration */
+
+	if ((dev->class >> 8) != PCI_CLASS_STORAGE_IDE)
+		goto oem_setup_failed;
+
 	/* Per Specified Design by OEM, and ASIC Architect */
 	if ((dev->device == PCI_DEVICE_ID_SERVERWORKS_CSB6IDE) ||
 	    (dev->device == PCI_DEVICE_ID_SERVERWORKS_CSB6IDE2)) {

Testing with the RHEL5.1 GA kernel, and subsequently with 5.0.z kernels shows
that the fix does not work and also may be preventing the workaround from being
effective.

Version-Release number of selected component (if applicable):

Summing up the testing to date:

version             works?    hdparm workaround?
2.6.18-8.el5        no        yes
2.6.18-8.1.8.el5    no        no
2.6.18-53.el5       no        no

How reproducible:
100%

Steps to Reproduce:
1. Configure a RAID device in the 8832 BIOS - In our case we selected the
automatic RAID1 option
2. Init and rebuild the array in the BIOS
3. Attempt to install RHEL5 beta2

Actual results:
With 2.6.18-8.el5, IDE errors & resets are seen but running:

    hdparm -d 1 /dev/hda /dev/hdc

To force DMA mode effectively works around the problem.

With later RHEL5 kernels, this no longer works.

Expected results:
No IDE errors, device works. This is currently a regression from the GA
situation as there is no longer an effective workaround.

Additional info:
Comment 1 Bryn M. Reeves 2008-01-11 15:29:21 EST
Created attachment 291417 [details]
dmesg showing IDE errors on 2.6.18-53.1.4.el5xen
Comment 2 Bryn M. Reeves 2008-01-11 15:30:18 EST
lspci output for the device in RAID/non-RAID mode copied over from bug 222653

RAID disabled:

root@elm3a195:~# lspci -s 0:f.1 -vxxxn
00:0f.1 0101: 1166:0213 (rev b0) (prog-if 8a)
       Subsystem: 1166:0212
       Flags: bus master, medium devsel, latency 64
       I/O ports at <ignored>
       I/O ports at <ignored>
       I/O ports at <ignored>
       I/O ports at <ignored>
       I/O ports at 0700 [size=16]
00: 66 11 13 02 55 01 00 02 b0 8a 01 01 08 40 80 00
10: f1 01 00 00 f5 03 00 00 71 01 00 00 75 03 00 00
20: 01 07 00 00 00 00 00 00 00 00 00 00 66 11 12 02
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
40: 20 20 99 99 20 20 ff ff 00 00 44 00 00 00 00 00
50: 00 00 00 00 03 00 55 00 0f 04 03 00 00 00 00 00
60: 00 00 00 00 01 08 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 6c 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

RAID enabled:

root@elm3a195:~# lspci -s 0:f.1 -vxxxn
00:0f.1 0104: 1166:0213 (rev b0) (prog-if 8f)
       Subsystem: 1014:0213
       Flags: bus master, medium devsel, latency 64, IRQ 11
       I/O ports at 0730 [size=8]
       I/O ports at 0724 [size=4]
       I/O ports at 0728 [size=8]
       I/O ports at 0720 [size=4]
       I/O ports at 0710 [size=16]
       Capabilities: [b0] Power Management version 2
00: 66 11 13 02 55 01 10 02 b0 8f 04 01 08 40 80 00
10: 31 07 00 00 25 07 00 00 29 07 00 00 21 07 00 00
20: 11 07 00 00 00 00 00 00 00 00 00 00 14 10 13 02
30: 00 00 00 00 b0 00 00 00 00 00 00 00 0b 02 00 00
40: 99 20 99 99 ff 20 ff ff 00 00 04 00 00 00 00 00
50: 00 00 00 00 01 00 05 00 0f 04 03 00 00 00 00 00
60: 00 00 00 00 01 08 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 6c 00 00 00 00 00 00 00 00
b0: 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Comment 3 Bryn M. Reeves 2008-01-11 15:39:06 EST
Initial testing suggested that forcing a specific DMA-mode (udma4 via "hdparm
-d1 -X 68") was an effective workaround for the later kernels but although the
system remains usable for longer, eventually the kernel disables DMA and the
same errors are logged.

Ian tested removing the conditional from Alan's patch in bug 222653 (forcing the
tuning function to run through "oem_setup_failed") but this also does not stop 
these errors.  The drives come up with DMA disabled and start producing errors.

Ian's now building a kernel based on 2.6.18-8.1.8.el5 but with the original
patch backed out.
Comment 7 Alan Cox 2008-01-11 19:29:14 EST
Actually if FC8 works send an lspci -vvxxx of that as well just in case it
provides any clues
Comment 8 Issue Tracker 2008-01-12 08:46:21 EST
Further testing update:  

Recompiled 2.6.18-53.1.4 with the serverworks patch backed out (Patch
21532 which is from BZ 222653), then booted with our forced DMA via dmraid
workaround in place.  The system has remained stable without disk errors
for several hours and through two full recompiles of this kernel RPM on
the local RAIDed disk.

I will gather the lspci data and post.


This event sent from IssueTracker by imcleod 
 issue 145725
Comment 10 Ian McLeod 2008-01-12 09:15:19 EST
Created attachment 291464 [details]
lspci output for 2.6.18-53.1.4 with oem_setup patch removed and DMA workarounds in place
Comment 12 Ian McLeod 2008-01-13 16:32:36 EST
Created attachment 291516 [details]
lspci output - RHEL5 GA rescue environment
Comment 13 Ian McLeod 2008-01-13 16:33:09 EST
Created attachment 291517 [details]
lspci output - RHEL5 U1 rescue environment
Comment 14 Ian McLeod 2008-01-13 16:33:53 EST
Created attachment 291518 [details]
lspci output - Fedora 8 rescue environment
Comment 15 Ian McLeod 2008-01-13 16:35:21 EST
Above rescue environment lspci output is using unmodified kernels and no forced 
DMA settings via hdparm.
Comment 16 Alan Cox 2008-01-14 05:33:41 EST
Thanks. And an lspci with the hacked kernel plus hdparm force to go with it
finally. 

There are some setting differences I'm analysing them now to see if they are
relevant (at least for the documented registers).

Would be useful to know if you can repeat this fault on other 8832 blades and if
there are any firmware/chip rev diffs between the one Darrick reported fixed and
the one you have.
Comment 17 Alan Cox 2008-01-14 05:48:32 EST
Changes:


0x45/0x47 0x00 -> 0x20    MWDMA command width timing set for MWDMA 2 v not set

Not relevant to UDMA disk devices.

0x54-0x57

Fedora 8: 0x05000555   RHEL + hdparm 0x05000505, RHEL 5GA 0x05000505, RHEL5U1
not programmed - using PIO ??

0x56/7 -> UDMA timing register

RHEL5 GA UDMA5. UDMA 5 (Master, Master)
Fedora Adds UDMA 5 (Slave channel 0)

The RHEL5 U1 data makes no sense, it should have been loaded if UDMA is in use.

Comment 18 Ed Pollard 2008-01-14 12:27:08 EST
*** Bug 427687 has been marked as a duplicate of this bug. ***
Comment 19 Darrick Wong 2008-01-14 14:52:41 EST
Hm... the 8832 that I tested on has BIOS firmware level BSE126AUS-1.12.  In any
case, IBM doesn't appear to support RHEL5 on that machine. (source:
http://www-03.ibm.com/servers/eserver/serverproven/compat/us/nos/redchate.html )
Comment 20 Ian McLeod 2008-01-15 16:34:49 EST
Regarding Comment 17:  

I have now triple checked the lspci outputs with a focus on the 0x54-0x57
region.    The results remain the same. 

With GA, and U1 minus the oem_setup patch, the values are 0x05000505.  

With unmodified U1, the values are 0x00000000.

Is there any additional data I can provide or test cases to try?
Comment 21 Alan Cox 2008-01-15 17:00:08 EST
I think I have the needed data for the moment. I'll be reviewing the serverworks
driver code tomorrow I hope. Don't have hardware but I'm sure we have some
serverworks in the building and any bug is probably findable by code inspection 
Comment 22 Alan Cox 2008-01-16 12:18:10 EST
Basically there is no way I can see that the U1 results can occur unless the
device initialisation is occuring for PIO mode. In which case the settings are
correct. Your trace of U1 (comment #1) shows a pio startup which correctly
explains the PIO setting selection as the BIOS is indicating the device should
honour BIOS PIO/DMA enables.

Your trace however later shows a DMA timeout and I'm not clear if this trace
includes you using hdparm to enable DMA ?
Comment 23 Alan Cox 2008-05-14 07:34:50 EDT
Closing as this seems to have gone quiet

Note You need to log in before you can comment on or make changes to this bug.