Created attachment 1279596 [details]
Description of problem:
Unable to boot from a degraded MD raid1(/dev/md0) when I plug out one of block devices belonging to this /dev/md0. I have raised this issue in https://github.com/dracutdevs/dracut/issues/227 and Harald Hoyer have fixed this. And I'm wonder if I could get this fixed from an official way like "yum update dracut". Hence Harald Hoyer suggest me to file a bug here.
Version-Release number of selected component (if applicable):
CentOS Linux release 7.3.1611 (Core)
Steps to Reproduce:
1. my /boot partition is made of md raid1 with 3 partition. /dev/sda3, /dev/sdb3 and /dev/sdd
3. plug out /dev/sdd and poweron
Popup maintenance mode. press password to enter maintenance mode and execute "mdadm -D /dev/md0" showing inactive.
Successfully boot from a degraded md raid1.
I uploaded the booting log with rd.debug in kernel parameter.
/etc/fstab looks like:
UUID=21df6a4c-bce3-4fe4-8739-8e804f4893dd /boot ext2 defaults 1 2
kernel parameter: root=/dev/mapperr/os-root rd.lvm.lv=os/root rd.lvm.lv=os/swap rd.md.uuid=5068eccc:e5b886ea8:be8fe496:61ee1fd8
Harald Hoyer gave a precise description and solution of this issue:
Although /dev/md0 is not needed to boot to the real root, you want the initramfs to assemble it (even degraded) via "rd.md.uuid=5068eccc:e5b886ea8:be8fe496:61ee1fd8"
"The correct solution for this problem would be a initqueue/finished hook script, which checks for the existence of a md raid with the specified uuid.
This would lead to a timeout situation, which would trigger the initqueue/timeout scripts to run and which would activate the degraded raids."
Same problem for me.
Looks like we are missing mdadm-last-resort unit files in initrd, but adding them does not fix anything since we are also missing the related udev rules.
Hmm, it looks that this will also need some changes in the mdadm package, probably nothing we want to do for 7.6 in this phase.
My 7.5 systems do not experience this failure, but 7.6 systems do.
If a drive fails while the system is running and mdadm fails it, there is no problem - at boot, the device is not expected to be part of the array.
The problem occurs when a component of an md device does not exist in the initramfs boot phase.
It would be really nice to have this fixed to have the expected resiliency.
I have systems I can test with.
We have the same experience as schanzle.
A clean 7.5 install, root FS above mdraid 1 on two block devices, no lvm, one device removed - system starts after timeout.
A clean 7.6 install, the same settings, one device removed - system start fails and ends in a dracut prompt.
Workaround using dracut from 7.5 base:
yum downgrade dracut-033-535.el7.x86_64.rpm dracut-config-rescue-033-535.el7.x86_64.rpm dracut-network-033-535.el7.x86_64.rpm kexec-tools-2.0.15-13.el7.x86_64.rpm
echo "exclude=dracut dracut-config-rescue dracut-network kexec-tools" >>/etc/yum.conf
7.5 base dracut-033-535.el7.x86_64.rpm is working
7.6 base dracut-033-554.el7.x86_64.rpm fails to start
I sincerely hope someone is working on this serious regression to 7.6 to be fixed in 7.7.
With the current situation, a RAID1 system is now *more likely* to not boot than a single drive system (since probability of one of two drives will fail is greater than having just one drive).
This regression defeats some core reasons to set up a RAID1 system and requires significant skill to get the system back up and running (boot rescue media). [My users do not (and should not) have root access.] I don't have physical access to all systems and walking someone through the recovery process would be rather tedious.
At least the likelihood of a drive failing or being removed while the system is not running is relatively low for systems running 24/7.
Another "gotcha" scenario is when a drive gets a bad block on a non-/boot md (the 'other' 98% of the drive) and that component fails out, I get md-monitor emails leading me to think "I need to replace that drive." But if the drive hasn't failed out of the relatively small /boot md, you visit the system with new drive, shut down the system, replace the drive...it won't boot back up. UGH! [fix: put old drive back in, boot up, fail all components of the drive, try replacement process again.]
I do not mean to come across a whining ungrateful user. I realize this isn't simple to fix and I greatly appreciate the efforts. I just hope it is getting the attention it deserves, which is not clear from the comments above.
Should be fixed in 7.7.