Bug 428468
| Summary: | IBM 8832 megaide in RAID mode causes data corruption | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Bryn M. Reeves <bmr> |
| Component: | kernel | Assignee: | Alan Cox <alan> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Martin Jenner <mjenner> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 5.1 | CC: | djwong, epollard, imcleod, konradr, tao |
| Target Milestone: | rc | Keywords: | Regression |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2008-05-14 11:34:50 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Bryn M. Reeves
2008-01-11 20:25:53 UTC
Created attachment 291417 [details]
dmesg showing IDE errors on 2.6.18-53.1.4.el5xen
lspci output for the device in RAID/non-RAID mode copied over from bug 222653 RAID disabled: root@elm3a195:~# lspci -s 0:f.1 -vxxxn 00:0f.1 0101: 1166:0213 (rev b0) (prog-if 8a) Subsystem: 1166:0212 Flags: bus master, medium devsel, latency 64 I/O ports at <ignored> I/O ports at <ignored> I/O ports at <ignored> I/O ports at <ignored> I/O ports at 0700 [size=16] 00: 66 11 13 02 55 01 00 02 b0 8a 01 01 08 40 80 00 10: f1 01 00 00 f5 03 00 00 71 01 00 00 75 03 00 00 20: 01 07 00 00 00 00 00 00 00 00 00 00 66 11 12 02 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 40: 20 20 99 99 20 20 ff ff 00 00 44 00 00 00 00 00 50: 00 00 00 00 03 00 55 00 0f 04 03 00 00 00 00 00 60: 00 00 00 00 01 08 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 6c 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 RAID enabled: root@elm3a195:~# lspci -s 0:f.1 -vxxxn 00:0f.1 0104: 1166:0213 (rev b0) (prog-if 8f) Subsystem: 1014:0213 Flags: bus master, medium devsel, latency 64, IRQ 11 I/O ports at 0730 [size=8] I/O ports at 0724 [size=4] I/O ports at 0728 [size=8] I/O ports at 0720 [size=4] I/O ports at 0710 [size=16] Capabilities: [b0] Power Management version 2 00: 66 11 13 02 55 01 10 02 b0 8f 04 01 08 40 80 00 10: 31 07 00 00 25 07 00 00 29 07 00 00 21 07 00 00 20: 11 07 00 00 00 00 00 00 00 00 00 00 14 10 13 02 30: 00 00 00 00 b0 00 00 00 00 00 00 00 0b 02 00 00 40: 99 20 99 99 ff 20 ff ff 00 00 04 00 00 00 00 00 50: 00 00 00 00 01 00 05 00 0f 04 03 00 00 00 00 00 60: 00 00 00 00 01 08 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 6c 00 00 00 00 00 00 00 00 b0: 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Initial testing suggested that forcing a specific DMA-mode (udma4 via "hdparm -d1 -X 68") was an effective workaround for the later kernels but although the system remains usable for longer, eventually the kernel disables DMA and the same errors are logged. Ian tested removing the conditional from Alan's patch in bug 222653 (forcing the tuning function to run through "oem_setup_failed") but this also does not stop these errors. The drives come up with DMA disabled and start producing errors. Ian's now building a kernel based on 2.6.18-8.1.8.el5 but with the original patch backed out. Actually if FC8 works send an lspci -vvxxx of that as well just in case it provides any clues Further testing update: Recompiled 2.6.18-53.1.4 with the serverworks patch backed out (Patch 21532 which is from BZ 222653), then booted with our forced DMA via dmraid workaround in place. The system has remained stable without disk errors for several hours and through two full recompiles of this kernel RPM on the local RAIDed disk. I will gather the lspci data and post. This event sent from IssueTracker by imcleod issue 145725 Created attachment 291464 [details]
lspci output for 2.6.18-53.1.4 with oem_setup patch removed and DMA workarounds in place
Created attachment 291516 [details]
lspci output - RHEL5 GA rescue environment
Created attachment 291517 [details]
lspci output - RHEL5 U1 rescue environment
Created attachment 291518 [details]
lspci output - Fedora 8 rescue environment
Above rescue environment lspci output is using unmodified kernels and no forced DMA settings via hdparm. Thanks. And an lspci with the hacked kernel plus hdparm force to go with it finally. There are some setting differences I'm analysing them now to see if they are relevant (at least for the documented registers). Would be useful to know if you can repeat this fault on other 8832 blades and if there are any firmware/chip rev diffs between the one Darrick reported fixed and the one you have. Changes: 0x45/0x47 0x00 -> 0x20 MWDMA command width timing set for MWDMA 2 v not set Not relevant to UDMA disk devices. 0x54-0x57 Fedora 8: 0x05000555 RHEL + hdparm 0x05000505, RHEL 5GA 0x05000505, RHEL5U1 not programmed - using PIO ?? 0x56/7 -> UDMA timing register RHEL5 GA UDMA5. UDMA 5 (Master, Master) Fedora Adds UDMA 5 (Slave channel 0) The RHEL5 U1 data makes no sense, it should have been loaded if UDMA is in use. *** Bug 427687 has been marked as a duplicate of this bug. *** Hm... the 8832 that I tested on has BIOS firmware level BSE126AUS-1.12. In any case, IBM doesn't appear to support RHEL5 on that machine. (source: http://www-03.ibm.com/servers/eserver/serverproven/compat/us/nos/redchate.html ) Regarding Comment 17: I have now triple checked the lspci outputs with a focus on the 0x54-0x57 region. The results remain the same. With GA, and U1 minus the oem_setup patch, the values are 0x05000505. With unmodified U1, the values are 0x00000000. Is there any additional data I can provide or test cases to try? I think I have the needed data for the moment. I'll be reviewing the serverworks driver code tomorrow I hope. Don't have hardware but I'm sure we have some serverworks in the building and any bug is probably findable by code inspection Basically there is no way I can see that the U1 results can occur unless the device initialisation is occuring for PIO mode. In which case the settings are correct. Your trace of U1 (comment #1) shows a pio startup which correctly explains the PIO setting selection as the BIOS is indicating the device should honour BIOS PIO/DMA enables. Your trace however later shows a DMA timeout and I'm not clear if this trace includes you using hdparm to enable DMA ? Closing as this seems to have gone quiet |