150421 – --fail: fails more devices than specified. (IDE RAID5)

Bug 150421 - --fail: fails more devices than specified. (IDE RAID5)

Summary: --fail: fails more devices than specified. (IDE RAID5)

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-03-06 02:19 UTC by Need Real Name
Modified:	2007-11-30 22:11 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-10-17 17:38:28 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Need Real Name 2005-03-06 02:19:12 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6)
Gecko/20050225 Firefox/1.0.1

Description of problem:
mdadm /dev/md7 --fail /dev/hdd5
Resulted in /dev/hdd5 AND /dev/hdc5 marked as failed.

Situation: Have RAID5 (Software) installation on IDE. 
Mobo Gigabyte 7VRX (has 4 IDE channels)
Initial setup was hda, hdc, hdd (on IDE 1 & 2)

Strategy: Move third raid device (hdd) to IDE 3
Added in hdg. -> mdadm /dev/md7 --add /dev/hdg1
mdstat confirmed additional md device

Attempted to fail out hdd5

Result: Both /dev/hdc5 and /dev/hdd5 marked (F)

? Is there anyway to UNmark these devices?? OR MORE IMPORTANTLY
recover the "failed" devices?



Version-Release number of selected component (if applicable):
recent - Last up2date run 03/03/05

How reproducible:
Didn't try

Steps to Reproduce:
Not game to attempt to reporoduce such error at this stage.
    

Additional info:

Comment 1 Doug Ledford 2005-03-14 17:17:26 UTC

To move a drive from one controller to another doesn't require
removing/adding the drive from the array.  You simply shut down the
machine, move the drive, then at startup it detects that you have
moved the drive and puts it back in the array from the new device
location.  If you are wanting to change the physical disk that the
data resides on, then you have to do what you tried to do.  However, a
word of caution: IDE drives now a days are, unfortunately, not what I
would call high reliability devices.  Any time you move data from one
drive to another like this, you are taking a device offline and
forcing the array into degraded mode, at which point it is no longer
fault tolerant, and then telling it to rebuild onto a different drive.
 The risk is that something will go wrong during that rebuild.  For
IDE drives, I recommend that prior to doing something like this, you
always do something like dd if=/dev/hda of=/dev/null and do that for
each drive you have in your current array as a quick read test to make
sure there are no bad blocks hiding in rarely/never used parts of the
drive.

Now, to your specific case, when you added /dev/hdg1 it should have
just became a hot spare.  Once you then removed /dev/hdd5, it should
have been marked as Failed, and reconstruction should have started on
/dev/hdg1.  At that point, the raid subsystem would have to read every
single block on /dev/hda5 and /dev/hdc5 in order to reconstruct
/dev/hdg1, and if /dev/hdc5 had any bad sectors, then it would end up
failing as a result and taking the array offline.  I'm guessing that's
what happened here.  If you still can, check your logs for any error
messages indicating I/O errors to /dev/hdc5.

If that's what happened, then your next option is to reboot into
rescue more and use mdadm to manually assemble the raid5 array.  To do
that, do something like:

mdadm -A /dev/md7 --force --run --update=summaries /dev/hda5 /dev/hdc5
failed

I wouldn't try to add /dev/hdd5 back into the array, I would just try
to get it back into the degraded state it was in before.  However, if
you know for certain that you didn't write to the array after failing
/dev/hdd5, then you could bring the array back up with all three
devices.  The problem is, if the array was still active after you
removed /dev/hdd5, then any writes that would have went to /dev/hdd5
would have been stored in parity blocks on /dev/hda5 and /dev/hdc5
instead, and if you bring /dev/hdd5 back into the array as a clean
device, we'll read from it instead of the parity blocks and get stale
data, possibly resulting in a corrupted filesystem.  Instead, you have
to readd /dev/hdd5 as a new disk and let it get rebuilt (although
since you have to rebuild a drive anyway, rebuilding /dev/hdg1 makes
more sense than rebuilding to /dev/hdd5 and having to start the move
process over again).

Hope that helps.

Comment 2 Doug Ledford 2005-10-17 17:38:28 UTC

No activity in multiple months, closing.

Note You need to log in before you can comment on or make changes to this bug.