Red Hat Bugzilla – Bug 150421
--fail: fails more devices than specified. (IDE RAID5)
Last modified: 2007-11-30 17:11:01 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6)
Description of problem:
mdadm /dev/md7 --fail /dev/hdd5
Resulted in /dev/hdd5 AND /dev/hdc5 marked as failed.
Situation: Have RAID5 (Software) installation on IDE.
Mobo Gigabyte 7VRX (has 4 IDE channels)
Initial setup was hda, hdc, hdd (on IDE 1 & 2)
Strategy: Move third raid device (hdd) to IDE 3
Added in hdg. -> mdadm /dev/md7 --add /dev/hdg1
mdstat confirmed additional md device
Attempted to fail out hdd5
Result: Both /dev/hdc5 and /dev/hdd5 marked (F)
? Is there anyway to UNmark these devices?? OR MORE IMPORTANTLY
recover the "failed" devices?
Version-Release number of selected component (if applicable):
recent - Last up2date run 03/03/05
Steps to Reproduce:
Not game to attempt to reporoduce such error at this stage.
To move a drive from one controller to another doesn't require
removing/adding the drive from the array. You simply shut down the
machine, move the drive, then at startup it detects that you have
moved the drive and puts it back in the array from the new device
location. If you are wanting to change the physical disk that the
data resides on, then you have to do what you tried to do. However, a
word of caution: IDE drives now a days are, unfortunately, not what I
would call high reliability devices. Any time you move data from one
drive to another like this, you are taking a device offline and
forcing the array into degraded mode, at which point it is no longer
fault tolerant, and then telling it to rebuild onto a different drive.
The risk is that something will go wrong during that rebuild. For
IDE drives, I recommend that prior to doing something like this, you
always do something like dd if=/dev/hda of=/dev/null and do that for
each drive you have in your current array as a quick read test to make
sure there are no bad blocks hiding in rarely/never used parts of the
Now, to your specific case, when you added /dev/hdg1 it should have
just became a hot spare. Once you then removed /dev/hdd5, it should
have been marked as Failed, and reconstruction should have started on
/dev/hdg1. At that point, the raid subsystem would have to read every
single block on /dev/hda5 and /dev/hdc5 in order to reconstruct
/dev/hdg1, and if /dev/hdc5 had any bad sectors, then it would end up
failing as a result and taking the array offline. I'm guessing that's
what happened here. If you still can, check your logs for any error
messages indicating I/O errors to /dev/hdc5.
If that's what happened, then your next option is to reboot into
rescue more and use mdadm to manually assemble the raid5 array. To do
that, do something like:
mdadm -A /dev/md7 --force --run --update=summaries /dev/hda5 /dev/hdc5
I wouldn't try to add /dev/hdd5 back into the array, I would just try
to get it back into the degraded state it was in before. However, if
you know for certain that you didn't write to the array after failing
/dev/hdd5, then you could bring the array back up with all three
devices. The problem is, if the array was still active after you
removed /dev/hdd5, then any writes that would have went to /dev/hdd5
would have been stored in parity blocks on /dev/hda5 and /dev/hdc5
instead, and if you bring /dev/hdd5 back into the array as a clean
device, we'll read from it instead of the parity blocks and get stale
data, possibly resulting in a corrupted filesystem. Instead, you have
to readd /dev/hdd5 as a new disk and let it get rebuilt (although
since you have to rebuild a drive anyway, rebuilding /dev/hdg1 makes
more sense than rebuilding to /dev/hdd5 and having to start the move
process over again).
Hope that helps.
No activity in multiple months, closing.