Description of problem: Disk corruption occurs when attempting to read serial numbers of physical drives in a RAID configuration. This is possibly specific to the vendor (Dell 2850 with PERC 4e/Di MegaRAID). Version-Release number of selected component (if applicable): RHEL4 Update 3 How reproducible: There is a direct correlation to running the 'sginfo' program and disk corruption. The corruption is not immediately apparent. There may be some dependency on the system activity. Steps to Reproduce: 1. Configure Dell 2850 BIOS to use RAID on channel A and channel B 2. Configure Dell MegaRAID BIOS for RAID-1, 2x64kb stripes, WRBACK, ReadAdaptive, DirectIO 3. Install RHEL4 Update 3. A mininum install will do. 4. run 'sginfo -l' . It will list /dev/sda, /dev/sg0, /dev/sg1. 5. run 'sginfo -s /dev/sda' multiple times. Actual results: sginfo notes that the serial numbers are not accessible. repeat a number of times, while the network and the disk are active. After a number of reboots, you may notice that services cannot be started due to files not found. Also, programs may show: Segmentation fault when invoked. /sbin/reboot was one such program. Expected results: Expected to see the serial numbers of the physical drives. Expected no disk corruption. Additional info:
RHEL4 Update4 kernel-smp-2.6.9-42.EL on Dell 1850 (PERC 4e/Si), Dell 2850 (PERC 4e/Di) shows no evidence of corruption.
Hm, so this seems to have been a kernel bug then which got resolved with RHEL4 Update 4? Read ya, Phil
I can apply the 2.6.9-42.EL linux-2.6.9-megaraid-update.patch to another kernel. Which one would you deem worthy?
Could you try to apply that patch to your original kernel that caused the problems? If that prevents the problems i think we can positively say it was a kernel driver problem of the megaraid driver. And that the sg-tools just triggered one bug in the old driver that could have been triggered otherwise, too. Thanks, Read ya, Phil
Applied 2.6.9-42.EL linux-2.6.9-megaraid-update.patch to 2.6.9-34.0.2.EL source, rebuilt kernel, ran on freshly installed Dell 2850, which has the PERC 4e/Di, and again on Dell 1850, which has the PERC 4e/Si. Confirmed that the disk corruption was no longer produced by running 'sginfo -s /dev/sda' in a tight loop while transferring gigabytes from an external source to the filesystem.