Bug 159590

Summary:	corrupted data using software-raid (md)
Product:	[Fedora] Fedora	Reporter:	Arian Prins <hgaprins>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED NOTABUG	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3	CC:	wtogami
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-06-05 13:47:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Arian Prins 2005-06-05 13:08:32 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; nl-NL; rv:1.7.5) Gecko/20041202 Firefox/1.0

Description of problem:
My system was a updated version of FC3. It had 3 180Gb drives (P.ATA) that I combined using software raid, level 5 (as /dev/md0). I created them at install time using the default partitioning tools. On top of that I used LVM and on top of that I had a few partitions (ext3), including root-dir (booting from a seperate harddisk that was not included in the raid-set).

After a few months of usage all the filesystems were suddenly completely corrupted. The system could not boot anymore (it couldn't find init) and when I tried mount the partitions using a rescue-CD, live-CD or a complete new install on a seperate drive I could not get /dev/md0 mounted.

I tried reinstalling everything and now at install-time when the installer is formatting the partitions it fails and reboots (after giving a message that something serious happened).

I have now reinstalled FC3 on the seperate harddisk (not part of the 3 180 Gb drives) without creating the raid-array at all. When I try to create a raid-5 set (after the install, using mdadm), the newly created /dev/md0 partition corrupts after a few hours of usage. After unmounting, it won't remount. Because I wanted to rule out the possibility of drive (or controller) failure I fdisk-ed the seperate drives and put an ext3 partition directly on each of them. I filled the 3 drives up with 1Gb files. No problem. Reading back a few of them (eg. cat < 1gbfile > /dev/null) gives no problem either.

This means I have tried the following "chains":
direct partitions on the drives: no problems
combine the drives using raid-5: corruption
combine the drives using raid-5 and then using LVM on top of that: corruption.

This problem may be related to bug-nr 152162 but I'm not sure.

Version-Release number of selected component (if applicable):
kernel-2.6.9-1.667 (but upgrades probably too)

How reproducible:
Always

Steps to Reproduce:
Scenario 1:
1. Start installation of FC3
2. Create raid-5 set using three 180Gb disks (for all three disks: partition 1: 256Mb swap, partition 2: remaining size of disk for software RAID)
3. continue installation process.
4. at formatting-time, just before the formating's finished the installer gives an error-message indicating something serious went wrong and reboots.

Scenario 2:
1. Install FC3 on a 40Gb harddrive, leave the 180 Gb disks empty (no partitions).
2. After the systems runs, create a partition on each 180 Gb drive (type 0xfd).
3. use mdadm to create a raid-5 set of the 3 partitions
4. mount the /dev/md0 at a dir.
5. start adding random data.
6. After a few Gb's, the data corrupts the filesystem (ls displays irregularities).
7. Unmound /dev/md0
8. reboot
9. mount /dev/md0 gives an error-message

Actual Results: see steps

Expected Results: the installer should have finished formatting/no curruption

Additional info:

I get emails from smartd with subject:
SMART error (CurrentPendingSector) detected on host: bio.lan
.........
Device: /dev/hdh, 11 Currently unreadable (pending) sectors

and in another mail:
Device: /dev/hdg, 2 Currently unreadable (pending) sectors

Comment 1 Arian Prins 2005-06-05 13:47:50 UTC

After more investigation it seems that the hardware was faulty after all
(filling the harddisks up with data didn't give any problems but I now dumped
all data to /dev/nu// and did get errors). Apologies.