From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5) Gecko/20031007 Firebird/0.7 Description of problem: We have several dual-Xeon, hyperthreaded systems with 12 and 16 port PATA drives that are setup with all drives partitioned and then included in four raid5 arrays. Occasionally, drives in these systems will fail. Sometimes (not all the time, probably about 75% of the time), when a drive fails and gets failed out of the array, data on the array becomes corrupt, as evidenced by numeorous EXT3-fs file corruption errors. The data is lost. Almost everything is blown away. In all cases, we're using 3ware cards, but some are 3ware SATA cards (with PATA to SATA converters) and others are straight 3ware PATA cards. Drives are all Western Digital, but they vary in size and make. These arrays are under heavy read/write I/O (iostat reports drive util at 70-80% constantly). Version-Release number of selected component (if applicable): kernel-2.4.20-20.9 How reproducible: Sometimes Steps to Reproduce: 1. build RAID5 array 2. wait for drive to fail with Medium Sense error 3. witness corruption Actual Results: Filesystem atop affected RAID5 array corrupts Expected Results: Filesystem atop affected RAID5 array does not corrupt, as array is redundant and can afford to lose one drive. Additional info: I have been in contact with both 3ware and Western Digital. We have gone very low-level on this. The 3ware looks like it is passing back the errors from the drives correctly. Western Digital confirms that the drives are seeing the errors and reporting them, and they coincide with the log messages being sent to syslog from the 3ware. The drives typically fail out only after the read request (and it's always a read request that fails, usually with an ECC error from the drive) fails 5 times. Corruption is only evident AFTER the drive is removed.
I now have systems, again with 3ware cards (7810s, JBOD mode, PATA cards), that produce this corruption, this time with Seagate drives.
This is definitely a problem with the raid5 code. I have a system with 16 drives. I partition each drive into 4 partitions, then create 4 RAID5 arrays with 1 partition from each drive. I then put an array under intensive I/O (both read and write) and then do 'raidsetfaulyt <array> <partition>'. Immediately I see filesystem corruption on the filesystem on that array. Here are a few of the messages: Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1773623376, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1847329996, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=2080538872, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1963576516, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1813626640, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1945409144, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1781930152, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1795790536, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1872381408, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=2020518828, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1796667100, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=2006427588, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, limit=1751284800 Nov 4 08:17:29 r25 kernel: attempt to access beyond end of device Nov 4 08:17:29 r25 kernel: 09:04: rw=0, want=1949435972, limit=1751284800
Also, with the direct raidsetfaulty method, this is ALWAYS reproducible.
The bug report in 109251 gives a much more detailed explanation of why I believe this happening. *** This bug has been marked as a duplicate of 109251 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.