Bug 573895

Summary: mdadm -Es looks strange in rescue. Does not boot - sleeping forever
Product: [Fedora] Fedora Reporter: Bob Gustafson <bobgus>
Component: mdadmAssignee: Doug Ledford <dledford>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 12CC: dledford, vedran
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-12-03 17:16:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/mnt/sysimage/boot/grub/grub.conf none

Description Bob Gustafson 2010-03-16 02:58:30 UTC
Description of problem:

I had a successfully running Fedora 12 system - installed after full wipe in November 2009.  I am using bios ICH10R Raid and have just two 500GB disks.

One of my raid disks started loosing sectors - 55, 56, 63. I bought a new disk, installed it and did not have much success adding and syncing the new disk (working within the rescue functions of the Fedora 12 Install DVD).

When I electrically disconnected the failing disk, then the add and rebuild worked fine. It took about 100 minutes to get the raid pair so that it looked good to mdadm and cat /proc/mdstat

When I tried to boot with the F12 Install DVD in the machine, boot would not see the hard disks. When I removed the F12 Install DVD, the machine would boot - almost. I could see the light blue background, the Fedora logo was building and then the machine would stop with a screen half full of

 'WARNING: deprecated config file /etc/modprobe.conf...

and a line:

Boot has failed, sleeping forever.

Rebooting with the F12 Install DVD in rescue mode - would find the raid disk this time, but the cat /proc/mdstat showed that the raid was re-syncing again. The first resync did not 'stick' (and subsequent boot attempts showed the same behavior).

When in control of the rescue DVD and the raid pair fully synced, I see:

cat /etc/mdadm.conf (copied over from system at /mnt/sysimage, the file date is in Nov 2009 indicating it is the original mdadm.conf created by anaconda at initial install)

# mdadm.conf written out by anaconda
MAILADDR root
ARRAY /dev/md0 UUID=8341ba35:d85c255b:b34402c6:4e9a2bfa
ARRAY /dev/md127 UUID=77f691c2:7f898349:fa423a3c:f888f72a

mdadm -Es   (executed on system and then copied over with scp)

ARRAY metadata=imsm UUID=8341ba35:d85c255b:b34402c6:4e9a2bfa
ARRAY /dev/md/Volume0 container=8341ba35:d85c255b:b34402c6:4e9a2bfa member=0
UUID=3a06158d:7d63baa9:9812c2d4:c3168d13

The formats seem to be different. Is the mdadm -Es correct?


Version-Release number of selected component (if applicable):

mdadm - v3.0.3 - 22nd October 2009


How reproducible:

quite

Steps to Reproduce:
1. start with running Fedora 12 system using fakeRaid on two disks.
2. remove one of the disks (stop, remove, ..)
3. don't touch the bios ctrl-I settings.
4. replace removed disk with a blank one (can repeat this function by zeroing the md superblock I think - I did this too)
5. boot up using F12 Install DVD in rescue mode
6. Wait for added disk to resync (about 100 minutes - monitor using cat /proc/mdstat)
7. Observe mdadm -Es, xxx
8. remove DVD
9. reboot system
  
Actual results:

  does not boot, gives

  Boot failed, sleeping forever

Expected results:

  Normal boot into Fedora 12

Additional info:

cat /proc/mdstat

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear] 
md0 : inactive sda[1](S) sdb[0](S)
      4514 blocks super external:imsm
       
md127 : active raid1 sda[1] sdb[0]
      487587840 blocks super external:/md_d-1/0 [2/2] [UU]
      
unused devices: <none>

Comment 1 Doug Ledford 2010-03-17 15:30:32 UTC
OK, this is going to be a bit convoluted because there are so many things to address in this one bug, so please bear with me.

First, let's start with the fact that the imsm support in mdadm is relatively recent.  So is the imsm support via mdadm in mkinitrd/dracut/install DVD rescue mode.  That being said, we aren't actually going to be able to "fix" this at least not in respect to what you have.  In other words, your DVD is already burned, we can't change what's there.  We can only address things going forward.

Now, one thing that jumps out at me right away is that you booted into the install DVD just to add the new disk.  The whole point of a raid1 array is that the machine still functions like normal on one disk.  Booting into the OS itself instead of the rescue mode on the DVD is the wiser thing to do (although if you want to play it safe you can boot into single user mode) because the version of mdadm on the DVD is fixed while you might (actually would) have an updated mdadm on the OS itself that contains bug fixes.  So, the normal action when adding in the new disk would be to boot into the OS in degraded mode, and add the new disk into the existing array while the system is live.

But, regardless, the uuid of the array shouldn't have changed and it clearly did.  I'm guessing this is related to the relatively immature status of the imsm support in mdadm on the install DVD.  The mdadm -Es output from a more up to date mdadm on the live system is definitely the one I would trust.  I would remake all of your initrd images after updating your mdadm.conf file with the new array lines so that the new uuids are present in the initrd images.

So, that all being said, the normal process when this happens would be something like this:

1) disk gives bad sector errors
2) shut down machine, replace defective drive
3) boot machine up into live OS (possibly in single user mode)
4) add new disk into existing array using mdadm /dev/md? -a
5) wait for sync to finish
6) verify mdadm -Es uuid output matches array lines in /etc/mdadm.conf (shouldn't need to do this step, but a wise safety net none the less)

done.  Don't even need to reboot again if you know that the array uuids didn't change and there was no need to rebuild your initrd images.

The one thing not addressed here is the resync on restart.  This can be caused by a failure to properly bring the imsm arrays down, which is done by switching the fs to read only, then killing the mdmon processes associated with the arrays, waiting for the mdmon processes to exit, and only then rebooting the machine.  I would not be surprised at all if the rescue mode of the DVD did not get this right, we've had to struggle with getting it right in the normal system shutdown sequence because it requires that the mdmon pid and sock files stay someplace that's read/write all the way up until the system reboots, which means they can't be on a normal filesystem they must be on a temporary filesystem (and we use /dev/md/ for them now as the /dev filesystem is read/write all the time).

So, the action items I get from this bug are:

1) make sure mdadm doesn't change uuids on device add to a degraded array in imsm containers
2) make sure rescue mode of installer properly shuts down mdmon processes by finding their files in /dev/md/ and doing the right thing

Comment 2 Bob Gustafson 2010-03-17 16:28:30 UTC
Thanks very much for your comments, particularly the distinctions between what is on the F12 Install DVD and what I (probably) need, to have a running system.

I haven't lost any data (yet). I think I have 3 more or less identical disks - the good one of the original RAID pair, the new disk that has been synced, and the old failing drive which I believe has had failing sectors remapped to good sectors.

Since the imsm hardware/software is relatively immature, I'm thinking of going with software-only RAID. I don't know if there is a hit in performance, but having secure storage of 100's of GB of stuff weighs pretty heavily here.

Another pair of new 500GB drives is relatively cheap insurance. After things have settled down, I could wipe the two good imsm formatted drives and install them on another system as a software only RAID1 pair.

---------

As I recall, I originally mounted the new disk in the box and had 3 disks running when it originally booted up. Then I did some things with the live OS mdadm - probably 'failed' the flakey disk, 'removed' the flakey disk, and then 'added' the new disk. All this without physically removing or electrically disconnecting the flakey disk.

When I then did a test reboot, my woe started with the 'sleeping forever'.

I have a suspicion that the 3 disks confused things.

I do have anaconda.log, program.log, and storage.log files for a number of trials with the DVD after this point, if that would be useful for the autopsy.

-----
A few remaining questions:

Is the output from mdadm -Es on my system totally bogus?

mdadm -Es   (executed on system and then copied over with scp)

ARRAY metadata=imsm UUID=8341ba35:d85c255b:b34402c6:4e9a2bfa
ARRAY /dev/md/Volume0 container=8341ba35:d85c255b:b34402c6:4e9a2bfa member=0
UUID=3a06158d:7d63baa9:9812c2d4:c3168d13

Do those UUID strings correspond to the RAID container and one of the raid components?

Why is there an /dev/md127 and not an /dev/md1 ?

Comment 3 Doug Ledford 2010-03-17 17:03:17 UTC
(In reply to comment #2)
> I haven't lost any data (yet). I think I have 3 more or less identical disks -
> the good one of the original RAID pair, the new disk that has been synced, and
> the old failing drive which I believe has had failing sectors remapped to good
> sectors.

The latest mdadm packages include a script that helps make sure bad sectors are remapped on a regular, timely basis (once a week).  In that situation, while bad sectors are still a bad sign, they aren't nearly as catastrophic as they used to be.

> Since the imsm hardware/software is relatively immature, I'm thinking of going
> with software-only RAID. I don't know if there is a hit in performance, but
> having secure storage of 100's of GB of stuff weighs pretty heavily here.

IMSM raid *is* software raid.  The only difference between it and the pure software raid you are referring to is that they use a different superblock format and the BIOS of the computer can understand and deal with the IMSM superblock format so that the BIOS knows how to access the raid device with it's own built in software raid stack.  Once you get beyond the superblock differences, they use the exact same raid stack code to drive the disks and have no performance difference.  The only thing that makes imsm immature is that we are working out all the kinks related to making sure that the BIOS is happy with how we handle the superblock.  With our own superblock format, we are the only ones that have to be happy with what we do.  With both IMSM and DDF, we have to do things in a way that the BIOS will accept.

> Another pair of new 500GB drives is relatively cheap insurance. After things
> have settled down, I could wipe the two good imsm formatted drives and install
> them on another system as a software only RAID1 pair.
> 
> ---------
> 
> As I recall, I originally mounted the new disk in the box and had 3 disks
> running when it originally booted up. Then I did some things with the live OS
> mdadm - probably 'failed' the flakey disk, 'removed' the flakey disk, and then
> 'added' the new disk. All this without physically removing or electrically
> disconnecting the flakey disk.

Actually, if you did a slightly different order, add new disk before removing old disk, which would temporarily convert the raid1 array to a 3 disk array, then that would probably explain why the UUID changed.

> When I then did a test reboot, my woe started with the 'sleeping forever'.
> 
> I have a suspicion that the 3 disks confused things.

Yep.

> I do have anaconda.log, program.log, and storage.log files for a number of
> trials with the DVD after this point, if that would be useful for the autopsy.

I just brought up an Intel IMSM devel box for the purpose of verifying things and making sure all is good.  I'll see if I can replicate your issue and if so determine what happened.  If I can't, I'll ping you again.

> -----
> A few remaining questions:
> 
> Is the output from mdadm -Es on my system totally bogus?
> 
> mdadm -Es   (executed on system and then copied over with scp)
> 
> ARRAY metadata=imsm UUID=8341ba35:d85c255b:b34402c6:4e9a2bfa
> ARRAY /dev/md/Volume0 container=8341ba35:d85c255b:b34402c6:4e9a2bfa member=0
> UUID=3a06158d:7d63baa9:9812c2d4:c3168d13
> 
> Do those UUID strings correspond to the RAID container and one of the raid
> components?

Yes.  The container has it's own uuid, and then each volume in the container will have a uuid.

> Why is there an /dev/md127 and not an /dev/md1 ?    

MD device numbers are going away.  The distinction that /dev/md1 means something special is fragile (and always was, just like /dev/sdb is fragile and subject to change when you add a new disk).  So, in the future, md devices will be referred to primarily by name, not number.  This allows a much greater reliability of uniqueness and correctness when referring to devices.  However, for the time being, even named devices still correspond to a number in the internal kernel structs.  So, in order to save room for devices primarily referred to by number, when we allocate a device for a named device, we start at a high number and work our way down.  Hence, the first named device (/dev/md/imsm probably) will get md127.  The next named device (/dev/md/Volume0 where Volume0 is the name assigned to the particular member in the IMSM superblock) would get md126, etc.  That way if you mix and match some imsm arrays and an old /dev/md0 array from another machine that you want to copy files across from, then things will work without having to try and sort out /dev/md0 name conflicts.

Comment 4 Bob Gustafson 2010-03-17 18:10:53 UTC
Thanks much for your extended explanation. I hope it makes its way into the RAID documentation. Bugzilla is a document source that is consulted primarily 'after the problem'.

The 'heads-up' on what might happen naming-wise would be valuable to add to the documentation, perhaps with a time-line as to when it might and did happen.

RAID storage disks tend to stay in place with whatever superblocks were written at the time of initial installation. If the disks keep going on the order of their (limited) warranty (5yrs), this information will remain important long into the future.

In my case, I wouldn't be here now if it wasn't for a few flaky sectors.

-------

Thinking of the folks that will come along behind me with their own disk problems, from your comments, can I say that the proper way to replace a disk is: ??

1) Shut down system normally.
2) electrically disconnect failed/failing disk.
3) install new disk.
  a) Using same controller port/cable as failed disk?
  b) Doesn't matter which port/cable is used for new disk?
  c) Does new disk have to be completely wiped of any superblocks?
4) Boot system normally
5) monitor cat /proc/mdstat to determine when RAID volume is rebuilt.
6) After RAID volume is rebuilt,
   to test, shutdown system and then reboot normally.
7) monitor cat /proc/mdstat
  a) If it is rebuilding again, something is wrong.
  b) If cat /proc/mdstat shows raid normal, go have breakfast.

Comment 5 Bob Gustafson 2010-03-17 20:22:28 UTC
In your Comment #3, you describe the software development process necessary to work with both IMSM and the Fedora software raid components.

I read your paragraph slightly differently than perhaps you intended.

1) You don't have a human readable copy of the Intel IMSM firmware.

2) You have a specification from Intel that describes the inputs and outputs of their IMSM firmware. You don't know if this is correct or complete.

3) You don't know when the Intel IMSM firmware will change in the future on new motherboards or on updates (if possible) to older motherboards.

4) The superblocks written by the IMSM firmware are unique to that particular motherboard and thus those disks can only be used on that motherboard.

------

If some/any of these issues are true, it is a good reason for me to stay away from using the Intel IMSM firmware and just use the Fedora Total Software RAID.

As you say "IMSM raid *is* software raid." there doesn't seem to be any reason for me to use IMSM raid, and several reasons not to use it.

Do I have the story wrong?

Comment 6 Doug Ledford 2010-03-17 20:39:59 UTC
(In reply to comment #4)
> Thanks much for your extended explanation. I hope it makes its way into the
> RAID documentation. Bugzilla is a document source that is consulted primarily
> 'after the problem'.

That's being worked on as well.

> -------
> 
> Thinking of the folks that will come along behind me with their own disk
> problems, from your comments, can I say that the proper way to replace a disk
> is: ??
> 
> 1) Shut down system normally.
> 2) electrically disconnect failed/failing disk.

Step 2 is optional.  If the disk hasn't failed entirely you can leave it in place.  However, whether or not the UUID of the array will change depends on whether it is a native software raid device or a BIOS raid device.  The BIOS raid devices don't know (to my knowledge) how to reconfigure a raid1 array from two to three disks and keep the same UUID.  I could be wrong on that though.  I do know for a fact that the linux native software RAID devices will allow you to temporarily increase a raid1 array from 2 to 3 disks, get things all synced up, then remove the failing disk only after the new disk is fully synced, all without ever changing UUID.

> 3) install new disk.
>   a) Using same controller port/cable as failed disk?
>   b) Doesn't matter which port/cable is used for new disk?

Doesn't matter which port/cable is used.

>   c) Does new disk have to be completely wiped of any superblocks?

It should be, yes.  If it's not, and it was previously a raid1 device, then the udev rules may attempt to start it before you can wipe it out and add it into the degraded array.  If that happens you just have to stop the md device that udev started when it found the disk, zero the superblock, then add the disk into the degraded array.

> 4) Boot system normally

Add new drive to existing array, then monitor mdstat for finish of rebuild.  Depending on whether or not the failing drive is in the array, you may have to grow the array to a 3 disk raid1 array when you add the disk, at which point it will resync to the new disk.

> 5) monitor cat /proc/mdstat to determine when RAID volume is rebuilt.
> 6) After RAID volume is rebuilt,
>    to test, shutdown system and then reboot normally.

If you added the new disk before removing the old one, then this is when you would remove the failing drive and perform another grow operation to shrink the array back down to just 2 disks (yes, it's an oxymoron to use grow to shrink an array ;-)

> 7) monitor cat /proc/mdstat
>   a) If it is rebuilding again, something is wrong.
>   b) If cat /proc/mdstat shows raid normal, go have breakfast.    

Pretty much.

Comment 7 Doug Ledford 2010-03-17 20:47:06 UTC
(In reply to comment #5)
> In your Comment #3, you describe the software development process necessary to
> work with both IMSM and the Fedora software raid components.
> 
> I read your paragraph slightly differently than perhaps you intended.
> 
> 1) You don't have a human readable copy of the Intel IMSM firmware.

No, but we don't need one either really.  There is a spec to follow.  But, as with any two pieces of software following a spec, there are bugs.  The real issue is to be bug compatible.  And since bugs change with different BIOS versions, it's all just a balancing act of getting the right set of common actions that work across all the known BIOS variants we need to support.

> 4) The superblocks written by the IMSM firmware are unique to that particular
> motherboard and thus those disks can only be used on that motherboard.

They are unique to the Intel Matrix RAID software, but not to a single motherboard.  As long as the motherboard supports Matrix raid, then the devices should be movable.  However, as you might imagine, this would not allow you to move the array from an Intel based machine to an AMD based machine as the AMD BIOS won't support the Intel Matrix RAID.

> ------
> 
> If some/any of these issues are true, it is a good reason for me to stay away
> from using the Intel IMSM firmware and just use the Fedora Total Software RAID.
> 
> As you say "IMSM raid *is* software raid." there doesn't seem to be any reason
> for me to use IMSM raid, and several reasons not to use it.
> 
> Do I have the story wrong?    

The one major reason to use it is if you want the BIOS to be able to read your devices.  This mainly only comes into play when setting up a boot partition and configuring the boot loader.  IMSM devices look like a regular disk to the boot loader, while Fedora RAID devices look like a bunch of different disks and the boot loader has to know how to read them.  Most boot loaders can't do things like read a raid5 array.  The BIOS on the other hand makes that raid5 array look like a single disk and the boot loader "just works".  So, if you are happy creating a standalone /boot partition that is raid1 based, which is supported by the standard boot loaders, then Fedora's raid will do just fine.  If you want fancier boot loader support, like your boot kernel on a raid5 array, they you really need IMSM or DDF BIOS supported raid devices.

Comment 8 Bob Gustafson 2010-03-17 21:41:35 UTC
From your Comment #6

>   c) Does new disk have to be completely wiped of any superblocks?
> 
> It should be, yes.  If it's not, and it was previously a raid1 device, then the
> udev rules may attempt to start it before you can wipe it out and add it into
> the degraded array.  If that happens you just have to stop the md device that
> udev started when it found the disk, zero the superblock, then add the disk
> into the degraded array.

As an enhancement, is there a way to go into a 'maintenance mode' (purgatory) between power-on and raid-driver-in-control ? The idea is to have a user friendly environment where individual disks could be inspected, superblocks could be read/zeroed, documentation read, and raid building started.

This would avoid the need for quick fingers (and wits) to barge into a raid-driver-in-control process milliseconds after power-on, and perhaps risk disk data.

There are some folks who run a highly automated shop where disks are brought up from hot spare and robots hot-unplug the dead disk - these would have to be served too. But since the power does not go down, maybe these systems step over this issue.

Comment 9 Doug Ledford 2010-03-17 21:50:01 UTC
You can always boot into the rescue CD environment just long enough to run mdadm --zero-superblock <new drive> then reboot into the normal environment.

Comment 10 Bob Gustafson 2010-03-17 22:00:33 UTC
(In reply to comment #9)
> You can always boot into the rescue CD environment just long enough to run
> mdadm --zero-superblock <new drive> then reboot into the normal environment.    

Provided of course that the CD rescue is reasonably up-to-date relative to the state of the running/degraded system.

If re-spins of the Fedora Install CD were available, or just a simple CD which would track changes in boot and disk software were available, then problems of stale software trying to repair upgraded systems could be avoided.

Comment 11 Doug Ledford 2010-03-17 22:11:51 UTC
You probably don't have anything to worry about when just referring to do a --zero-superblock.  That's been around in mdadm for a while and isn't likely to have any stale software issues ;-)  But even if you were concerned about that, another option is to mount the system disk from rescue mode, then chroot to /mnt/sysimage, then run mdadm from there.  That would get you a fully updated mdadm and entire system for that matter, the only thing that would still be from the rescue dvd would be the kernel itself.

Comment 12 Bob Gustafson 2010-03-17 22:39:07 UTC
(In reply to comment #11)

> another option is to mount the system disk from rescue mode, then chroot to
> /mnt/sysimage, then run mdadm from there.  That would get you a fully updated
> mdadm and entire system for that matter   

OK, in my system at the present state, I have been using rescue mode and I have been able to chroot to /mnt/sysimage. However, I haven't quite gotten the incantations correct to wind up with a bootable RAID1 system. It looks good when in rescue (after the rebuild time), but does not boot once the DVD is removed.

>, the only thing that would still be
> from the rescue dvd would be the kernel itself.

My /boot partition has several kernels - from yum upgrades since Nov 2009.

Is there a series of commands that will get me a bootable system?

Comment 13 Bob Gustafson 2010-03-17 22:55:11 UTC
(In reply to comment #7)

> So, if you are happy
> creating a standalone /boot partition that is raid1 based, which is supported
> by the standard boot loaders, then Fedora's raid will do just fine.  If you
> want fancier boot loader support, like your boot kernel on a raid5 array, then
> you really need IMSM or DDF BIOS supported raid devices.    

I have Fedora Software raid on another system. As long as I remember to write grub on both component disks (I have raid1 /boot and swap), then, as you write, I should be able to boot no matter which disk goes bad on that system.

By not depending on the BIOS IMSM it would be very easy to implement the 'purgatory' scheme mentioned in Comment #8.

Comment 14 Bob Gustafson 2010-03-18 14:55:49 UTC
I forgot a step.

If the new disk is really 'blank', it needs a partition table before it can be used by any of the mdadm tools.

The partition map can be inserted by just copying it from the good disk.

Assuming that the good disk is /dev/sda and the new blank disk is /dev/sdb

  sfdisk -d /dev/sda | sfdisk /dev/sdb 

This hint was obtained from:
  http://www.linuxconfig.org/Linux_Software_Raid_1_Setup

A couple of addition references:

  http://ubuntuforums.org/archive/index.php/t-410136.html

  http://raid.wiki.kernel.org/index.php/RAID_Boot

Comment 15 Doug Ledford 2010-03-19 01:37:57 UTC
(In reply to comment #12)
> (In reply to comment #11)
> 
> > another option is to mount the system disk from rescue mode, then chroot to
> > /mnt/sysimage, then run mdadm from there.  That would get you a fully updated
> > mdadm and entire system for that matter   
> 
> OK, in my system at the present state, I have been using rescue mode and I have
> been able to chroot to /mnt/sysimage. However, I haven't quite gotten the
> incantations correct to wind up with a bootable RAID1 system. It looks good
> when in rescue (after the rebuild time), but does not boot once the DVD is
> removed.
> 
> >, the only thing that would still be
> > from the rescue dvd would be the kernel itself.
> 
> My /boot partition has several kernels - from yum upgrades since Nov 2009.
> 
> Is there a series of commands that will get me a bootable system?    

You probably need to remake your initrd images to compensate for the new UUID of the raid array.  Since something caused the raid array's uuid to change, the mdadm.conf in the initrd images will list a bad uuid.  Run mdadm -Esb >> /etc/mdadm.conf and then edit the mdadm.conf file to remove the old array lines and adjust the array names of the new array lines to be the same as the old ones.  For example, when you installed the system it probably had something like:

ARRAY /dev/md0 metadata=imsm uuid=...
ARRAY /dev/md1 uuid=...

the mdadm -Esb may make the same lines look like:

ARRAY /dev/md127 metadata=imsm uuid=...
ARRAY /dev/md/Volume0 container=... uuid=...

The point here being that mdadm doesn't necessarily assume /dev/md0 /dev/md1 /dev/md2 nomenclature when using mdadm -Esb, while anaconda does, so you have to put the names back to what anaconda named the arrays while preserving the new uuids.  Once you've removed the old ARRAY lines and named the new ARRAY lines appropriately, then you need to remake the initrd images:

mkinitrd -vf /boot/initrd-<version>.img <version>

Once that's done, you should have a bootable system.

Comment 16 Bob Gustafson 2010-03-19 02:38:23 UTC
With my current two disk raid1, unbootable system - if I <ctrl>I into the Intel Bios raid system at boot, and select the 3) Reset disks to Non-Raid, will I lose the data on those two disks?

Comment 17 Bob Gustafson 2010-03-19 14:38:30 UTC
Is the Intel specification on their ICH10R available? Or did you have to sign an NDA to get a copy?

Comment 18 Bob Gustafson 2010-03-19 14:53:50 UTC
(In reply to comment #15)
> (In reply to comment #12)
> > (In reply to comment #11)

> > Is there a series of commands that will get me a bootable system?    
> 

> mkinitrd -vf /boot/initrd-<version>.img <version>
> 
> Once that's done, you should have a bootable system.    

Looking in my /mnt/sysimage/boot directory

I don't have any initrd files. They are all initramfs.

Comment 19 Doug Ledford 2010-03-19 15:38:07 UTC
(In reply to comment #16)
> With my current two disk raid1, unbootable system - if I <ctrl>I into the Intel
> Bios raid system at boot, and select the 3) Reset disks to Non-Raid, will I
> lose the data on those two disks?    

I can't answer that and I would be leary of doing so.(In reply to comment #18)

> > mkinitrd -vf /boot/initrd-<version>.img <version>
> > 
> > Once that's done, you should have a bootable system.    
> 
> Looking in my /mnt/sysimage/boot directory
> 
> I don't have any initrd files. They are all initramfs.    

Then you need to do the similar thing with dracut:

dracut -f /boot/initramfs-<version>.img <version>

Comment 20 Bob Gustafson 2010-03-19 16:25:53 UTC
(In reply to comment #19)
> (In reply to comment #16)
> > With my current two disk raid1, unbootable system - if I <ctrl>I into the Intel
> > Bios raid system at boot, and select the 3) Reset disks to Non-Raid, will I
> > lose the data on those two disks?    
> 
> I can't answer that and I would be leary of doing so.

I understand that. If I had a copy of Intel's spec on the ICH10R, maybe I could figure out the answer to my question.

Comment 21 Bob Gustafson 2010-03-19 18:33:45 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #16)
> > > With my current two disk raid1, unbootable system - if I <ctrl>I into the Intel
> > > Bios raid system at boot, and select the 3) Reset disks to Non-Raid, will I
> > > lose the data on those two disks?    

I found some info on:

http://information-technology.converanet.com/CVN01/cachedhtml?hl=keywords&kw=s%3Anasa\.0170W&cacheid=ds2-va:p:1001t:3319885475928:988825c2d7843dad:4ab5be2c&scopeid=defLink

It reads in part:

Intel South Bridge HostRAID Configuration User's Guide

3. Resetting to Non-RAID

Warning: Be cautious when you reset a RAID volume HDD to a non-RAID

HDD. Resetting a RAID HDD or resetting a RAID volume will reformat the

disk drive. The internal RAID structure and contents will be deleted.

=========

So, I won't go in that direction..

Comment 22 Bob Gustafson 2010-03-20 16:00:49 UTC
I tried another experiment.

I disconnected the 2nd drive so there is only the good drive from the original raid1 pair. I removed the DVD and then booted.

My expectation was that the system would be degraded (and the ICH screen at boot did say that), but would boot.

Sadly it did not - have same Boot has failed, sleeping forever.

Comment 23 Doug Ledford 2010-03-21 01:52:24 UTC
In addition to remaking the initrd image with the new mdadm.conf file with the new ARRAY lines and new UUID values, you will also need to correct the /boot/grub/grub.conf entries as with F12 there are UUIDs passed in on the kernel command line and since your device UUID changed, that command line needs changed as well.

Comment 24 Bob Gustafson 2010-03-22 17:34:28 UTC
(In reply to comment #22)
> I tried another experiment.
> 
> I disconnected the 2nd drive so there is only the good drive from the original
> raid1 pair. I removed the DVD and then booted.
> 
> My expectation was that the system would be degraded (and the ICH screen at
> boot did say that), but would boot.
> 
> Sadly it did not - have same Boot has failed, sleeping forever.    

When I plugged the 2nd drive back in and booted with the Fedora 12 Install DVD, the ICH10R screen at boot said it was rebuilding the RAID1 pair.

When I then kept going and booted up from the DVD, it said that it saw the Linux partitions and mounted the RAID1 pair under /mnt/sysimage. This did not happen with the 2nd disk disconnected.

Previously (Comment #22) even with the FC12 DVD in rescue mode, it did not see the single RAID1 drive.

Comment 25 Bob Gustafson 2010-03-22 17:37:51 UTC
(In reply to comment #23)
> In addition to remaking the initrd image with the new mdadm.conf file with the
> new ARRAY lines and new UUID values, you will also need to correct the
> /boot/grub/grub.conf entries as with F12 there are UUIDs passed in on the
> kernel command line and since your device UUID changed, that command line needs
> changed as well.    

There are no UUID values passed in on the kernel command line - see attached /mnt/sysimage/boot/grub/grub.conf

Comment 26 Bob Gustafson 2010-03-22 17:43:00 UTC
Created attachment 401833 [details]
/mnt/sysimage/boot/grub/grub.conf

The directory of /boot/grub shows that this file was modified on
2010-03-06 19:34

All of the other files in this directory have dates back in 2009 - probably from original install using anaconda.

Comment 27 Bob Gustafson 2010-03-22 21:41:20 UTC
OK, I have two new blank disks - identical to what is in the box now (Seagate 500GB).

What I would like to end up with is both of the new disks in RAID1 configuration, but NOT using the Intel ICH10R bios fake raid, but using the plain vanilla Fedora software raid.

What are reasonable steps to get my data from the current ICH10R RAID1 disk pair to the new disks configured as software only RAID1?

Along these lines, I have a couple of questions:

1) If I install Fedora 12 (using software raid) while the BIOS is set for AHCI instead of RAID, afterwards, will it boot and run if the BIOS is set for RAID (but not creating any raid volumes using the BIOS)?

2) If I install Fedora 12 (using software raid) while the BIOS is set for RAID, but not create any bios raid volumes, can I then later switch the BIOS to AHCI and boot and run the software raid volumes?

---

The idea is to create a new system, but then also be able to copy data from my existing RAID1 which was created under the BIOS RAID1 regime.

Comment 28 Bob Gustafson 2010-03-22 21:46:49 UTC
Since I will have a relatively virgin system running software raid for awhile (I don't need to install my old data immediately), I can run tests you might suggest.

Comment 29 Doug Ledford 2010-07-22 15:08:55 UTC
Hi Bob, did you ever get this problem resolved?  I don't think it's an mdadm problem so much as the uuid changing which has more to do with the internals of imsm arrays, so I'm not sure it's a valid mdadm bug regardless of whether you got it worked out.

Comment 30 Bob Gustafson 2010-07-22 23:32:35 UTC
I have two F13 systems running now - both with software RAID 1 - one on SCSI disks, the other on SATA.

It seems that the boot fail on FC12 was due to a problem with the installation of grub on the disk which was booting.

I am not using BIOS RAID - software RAID is a touch slower, but less moving parts in the software implementation.

With BIOS RAID, boot is done from the raid array (yes?).

With Software RAID, boot is done from one of the raided disks as if it was not in a RAID array (yes?). Because either disk can fail, grub needs to be written to the MBR of both disks.

In a multi-disk system, the BIOS can specify which of the disks to boot from. If that happens to be the disk that failed, there can be boot problems.

The boot disk selection feature in the BIOS can have bugs. Switching SATA cables can be helpful in solving boot problems with failed RAID systems.

Comment 31 Bug Zapper 2010-11-03 19:34:09 UTC
This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 32 Bug Zapper 2010-12-03 17:16:03 UTC
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.