Bug 1451660 - Unable to boot from a degraded MD raid1
Summary: Unable to boot from a degraded MD raid1
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: dracut
Version: 7.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Lukáš Nykrýn
QA Contact: Release Test Team
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 1465901
TreeView+ depends on / blocked
 
Reported: 2017-05-17 09:02 UTC by kevin chuang
Modified: 2019-04-25 08:37 UTC (History)
12 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2019-04-25 08:37:58 UTC


Attachments (Terms of Use)
booting log (176.40 KB, text/plain)
2017-05-17 09:02 UTC, kevin chuang
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1640866 None None None 2019-07-08 14:23 UTC
Red Hat Knowledge Base (Solution) 3989911 None None None 2019-03-15 05:49 UTC

Internal Trackers: 1640866

Description kevin chuang 2017-05-17 09:02:39 UTC
Created attachment 1279596 [details]
booting log

Description of problem:
Unable to boot from a degraded MD raid1(/dev/md0) when I plug out one of block devices belonging to this /dev/md0. I have raised this issue in https://github.com/dracutdevs/dracut/issues/227 and Harald Hoyer have fixed this. And I'm wonder if I could get this fixed from an official way like "yum update dracut". Hence Harald Hoyer suggest me to file a bug here.

Version-Release number of selected component (if applicable):
CentOS Linux release 7.3.1611 (Core)

How reproducible:

Steps to Reproduce:
1. my /boot partition is made of md raid1 with 3 partition. /dev/sda3, /dev/sdb3 and /dev/sdd
2. poweroff
3. plug out /dev/sdd and poweron

Actual results:
Popup maintenance mode. press password to enter maintenance mode and execute "mdadm -D /dev/md0" showing inactive.

Expected results:
Successfully boot from a degraded md raid1.

Additional info:
I uploaded the booting log with rd.debug in kernel parameter.
/etc/fstab looks like:
UUID=21df6a4c-bce3-4fe4-8739-8e804f4893dd /boot ext2 defaults 1 2
kernel parameter: root=/dev/mapperr/os-root rd.lvm.lv=os/root rd.lvm.lv=os/swap rd.md.uuid=5068eccc:e5b886ea8:be8fe496:61ee1fd8

Harald Hoyer gave a precise description and solution of this issue:
Although /dev/md0 is not needed to boot to the real root, you want the initramfs to assemble it (even degraded) via "rd.md.uuid=5068eccc:e5b886ea8:be8fe496:61ee1fd8"

"The correct solution for this problem would be a initqueue/finished hook script, which checks for the existence of a md raid with the specified uuid.
This would lead to a timeout situation, which would trigger the initqueue/timeout scripts to run and which would activate the degraded raids."

Comment 4 Olivier LAHAYE 2018-02-14 18:13:33 UTC
Same problem for me.

Comment 9 Lukáš Nykrýn 2018-07-17 12:01:30 UTC
Looks like we are missing mdadm-last-resort unit files in initrd, but adding them does not fix anything since we are also missing the related udev rules.

Comment 10 Lukáš Nykrýn 2018-07-17 12:29:11 UTC
Hmm, it looks that this will also need some changes in the mdadm package, probably nothing we want to do for 7.6 in this phase.

Comment 12 schanzle 2018-12-14 22:40:45 UTC
My 7.5 systems do not experience this failure, but 7.6 systems do.

If a drive fails while the system is running and mdadm fails it, there is no problem - at boot, the device is not expected to be part of the array.

The problem occurs when a component of an md device does not exist in the initramfs boot phase.

It would be really nice to have this fixed to have the expected resiliency.

I have systems I can test with.

Thanks!

Comment 13 Michal Žejdl 2019-03-05 08:48:17 UTC
We have the same experience as schanzle.

A clean 7.5 install, root FS above mdraid 1 on two block devices, no lvm, one device removed - system starts after timeout.
A clean 7.6 install, the same settings, one device removed - system start fails and ends in a dracut prompt.


Workaround using dracut from 7.5 base:

yum downgrade dracut-033-535.el7.x86_64.rpm dracut-config-rescue-033-535.el7.x86_64.rpm dracut-network-033-535.el7.x86_64.rpm kexec-tools-2.0.15-13.el7.x86_64.rpm

dracut -f

echo "exclude=dracut dracut-config-rescue dracut-network kexec-tools" >>/etc/yum.conf


thus:

7.5 base dracut-033-535.el7.x86_64.rpm is working
7.6 base dracut-033-554.el7.x86_64.rpm fails to start

Comment 14 Chris Schanzle 2019-03-05 16:20:34 UTC
I sincerely hope someone is working on this serious regression to 7.6 to be fixed in 7.7.

With the current situation, a RAID1 system is now *more likely* to not boot than a single drive system (since probability of one of two drives will fail is greater than having just one drive).

This regression defeats some core reasons to set up a RAID1 system and requires significant skill to get the system back up and running (boot rescue media).  [My users do not (and should not) have root access.]  I don't have physical access to all systems and walking someone through the recovery process would be rather tedious.

At least the likelihood of a drive failing or being removed while the system is not running is relatively low for systems running 24/7.

Another "gotcha" scenario is when a drive gets a bad block on a non-/boot md (the 'other' 98% of the drive) and that component fails out, I get md-monitor emails leading me to think "I need to replace that drive."  But if the drive hasn't failed out of the relatively small /boot md, you visit the system with new drive, shut down the system, replace the drive...it won't boot back up.  UGH!  [fix: put old drive back in, boot up, fail all components of the drive, try replacement process again.]

I do not mean to come across a whining ungrateful user.  I realize this isn't simple to fix and I greatly appreciate the efforts.  I just hope it is getting the attention it deserves, which is not clear from the comments above.

Thank you!

Comment 15 Lukáš Nykrýn 2019-04-25 08:35:57 UTC
Should be fixed in 7.7.


Note You need to log in before you can comment on or make changes to this bug.