Bug 179207

Summary:	Unstable LSI MegaRaid SATA 300-8X
Product:	Red Hat Enterprise Linux 4	Reporter:	Konstantin Olchanski <olchansk>
Component:	kernel	Assignee:	Tom Coughlan <coughlan>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.0	CC:	jbaron, tweeksbugzilla
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-06-20 16:03:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Konstantin Olchanski 2006-01-28 05:16:41 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.10) Gecko/20050909 Fedora/1.7.10-1.3.2

Description of problem:
We bought two machines with LSI MegaRaid SATA 300-8X controllers, 16 disks, 2 cards per machine. The MegaRaid cards on both machines show similar instability worth documenting here, because the Linux kernel code (megaraid_{mbox,mm} driver and/or ext3 and md-raid5 drivers) also malfunctions. Two identical machines show ing the same symptoms suggests that the fault is in the MegaRaid hardware or in the Linux drivers. We now switched to the Marvell MV88SX5081 based RocketRaid 1820A 8 port SATA cards with the proprietary hptmv.ko drivers and they work perfectly.

This is what I have:
1) SuperMicro H8DA8 dual opteron mobo, 2GB DDR400 mem, two Opteron-248 CPUs.
2) 2 LSI MegaRaid SATA cards on the PCIX-133 bus (in other slots they malfunction the same way)
3) 16 WDC WD4000YR 400 GB SATA disks
4) partitionned with a single full-disk partition (sda1, sdb1, etc)
4a) single disk performance (dd /dev/zero into /dev/sda1) is 50-60 Mbytes/sec
5) raid5 array across all 16 disks, 256k chunk
6) ext3 filesystem (mke2fs -b 4096 -j /dev/md5 1464825600) (mke2fs creates a 1.5TB filesystem unless I tell it all the numbers. Why?)
7) [root@amanda ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0              20158268   8600608  10533664  45% /
/dev/md5             5767362144 4830478824 815716892  86% /home1
none                   1028144         0   1028144   0% /dev/shm
8) run dd if=/dev/zero of=/home1/xxx bs=100k
9) in separate shells, run "vmstat 1", "iostat -x 1", "top",
   observe dd is writing data at 100-200 Mbytes/sec. All disks about equally
   busy.
10) after a while, disk use reported by iostat shows all disks go idle,
    except, say, sda, (sometimes another disk, at random), which goes 100% busy. This continues for a few minutes, then this:
    - all disks go about idle;
    - "dd" is reported at 100% system CPU usage,
    - strace of dd shows that it is making progress writing to disk,
      by the "write()" system calls take a very long time.
    - the output file grows at about 10-100 Kbytes/sec (should be 100-200 Mbytes/sec).
11) kill "dd", (it does normally), wait 5-10 minutes for cached data to flush to disk, restart "dd"
    and again everything works normally, disks are busy, data is flowing.
    until the problem repeats, maybe after 30 or so minutes.
12) same thing happens if I copy data from another machine using "rsync" ("rsync" goes to 100% system cpu usage) or NFS (one nfsd goes to 100% cpu usage).
13) if this problem does not happen, then overnight I always see at least one I/O error on one of the disks (any disk) and that disk drops out from the raid array.
14) if *that* does not happen, then overnight all 8 disks on one LSI card disappear (read: the firmware on the LSI card crashed) and the disks are unaccessible until the machine is power cycled (reboot and reset are not good enough.

K.O.


Version-Release number of selected component (if applicable):
2.6.9-22.0.1.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. blah...
2.
3.
  

Additional info:

Comment 1 Tom Weeks 2006-04-26 20:45:34 UTC

Have you tried this on a non-RAID (PATA/SATA/SCSI) drive?

Tweeks

Comment 2 Jiri Pallich 2012-06-20 16:03:11 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.