Bug 200119 - sginfo of RAID drives leads to disk corruption
Summary: sginfo of RAID drives leads to disk corruption
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: sg3_utils
Version: 4.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Dan Horák
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-07-25 16:14 UTC by Michael J. Slifcak
Modified: 2008-08-15 09:10 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-08-15 09:10:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Michael J. Slifcak 2006-07-25 16:14:09 UTC
Description of problem:
Disk corruption occurs when attempting to read serial numbers of
physical drives in a RAID configuration.
This is possibly specific to the vendor
 (Dell 2850 with PERC 4e/Di MegaRAID).

Version-Release number of selected component (if applicable):
RHEL4 Update 3

How reproducible:
There is a direct correlation to running the 'sginfo' program and disk
corruption.  The corruption is not immediately apparent. There may be some
dependency on the system activity. 

Steps to Reproduce:
1. Configure Dell 2850 BIOS to use RAID on channel A and channel B
2. Configure Dell MegaRAID BIOS for RAID-1,
        2x64kb stripes, WRBACK, ReadAdaptive, DirectIO
3. Install RHEL4 Update 3. A mininum install will do.
4. run 'sginfo -l'  .  It will list /dev/sda, /dev/sg0, /dev/sg1.
5. run 'sginfo -s /dev/sda' multiple times.
  
Actual results:
sginfo notes that the serial numbers are not accessible.
repeat a number of times, while the network and the disk are active.
After a number of reboots, you may notice that services cannot be
started due to files not found.  Also, programs may show:
   Segmentation fault
when invoked.  /sbin/reboot was one such program.

Expected results:
Expected to see the serial numbers of the physical drives.
Expected no disk corruption.


Additional info:

Comment 1 Michael J. Slifcak 2006-08-17 17:30:47 UTC
RHEL4 Update4 kernel-smp-2.6.9-42.EL  on Dell 1850 (PERC 4e/Si),
Dell 2850 (PERC 4e/Di) shows no evidence of corruption.

Comment 2 Phil Knirsch 2006-08-22 13:36:40 UTC
Hm, so this seems to have been a kernel bug then which got resolved with RHEL4
Update 4?

Read ya, Phil

Comment 3 Michael J. Slifcak 2006-08-22 15:16:57 UTC
I can apply the 2.6.9-42.EL linux-2.6.9-megaraid-update.patch to another kernel.
Which one would you deem worthy?

Comment 4 Phil Knirsch 2006-08-23 08:20:17 UTC
Could you try to apply that patch to your original kernel that caused the problems?

If that prevents the problems i think we can positively say it was a kernel
driver problem of the megaraid driver. And that the sg-tools just triggered one
bug in the old driver that could have been triggered otherwise, too.

Thanks,

Read ya, Phil

Comment 5 Michael J. Slifcak 2006-08-25 00:48:03 UTC
Applied 2.6.9-42.EL linux-2.6.9-megaraid-update.patch
to 2.6.9-34.0.2.EL source, rebuilt kernel, ran on freshly installed
Dell 2850, which has the PERC 4e/Di, and again on
Dell 1850, which has the PERC 4e/Si.
Confirmed that the disk corruption was no longer produced by
running 'sginfo -s /dev/sda' in a tight loop while transferring
gigabytes from an external source to the filesystem.



Note You need to log in before you can comment on or make changes to this bug.