Bug 1451660 - Unable to boot from a degraded MD raid1
Summary: Unable to boot from a degraded MD raid1
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: dracut   
(Show other bugs)
Version: 7.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Lukáš Nykrýn
QA Contact: Release Test Team
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 1465901
TreeView+ depends on / blocked
 
Reported: 2017-05-17 09:02 UTC by kevin chuang
Modified: 2019-04-04 22:10 UTC (History)
12 users (show)

Fixed In Version: dracut-033-546.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
booting log (176.40 KB, text/plain)
2017-05-17 09:02 UTC, kevin chuang
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1640866 None None None 2019-04-10 02:39 UTC
Red Hat Knowledge Base (Solution) 3989911 None None None 2019-03-15 05:49 UTC

Internal Trackers: 1640866

Description kevin chuang 2017-05-17 09:02:39 UTC
Created attachment 1279596 [details]
booting log

Description of problem:
Unable to boot from a degraded MD raid1(/dev/md0) when I plug out one of block devices belonging to this /dev/md0. I have raised this issue in https://github.com/dracutdevs/dracut/issues/227 and Harald Hoyer have fixed this. And I'm wonder if I could get this fixed from an official way like "yum update dracut". Hence Harald Hoyer suggest me to file a bug here.

Version-Release number of selected component (if applicable):
CentOS Linux release 7.3.1611 (Core)

How reproducible:

Steps to Reproduce:
1. my /boot partition is made of md raid1 with 3 partition. /dev/sda3, /dev/sdb3 and /dev/sdd
2. poweroff
3. plug out /dev/sdd and poweron

Actual results:
Popup maintenance mode. press password to enter maintenance mode and execute "mdadm -D /dev/md0" showing inactive.

Expected results:
Successfully boot from a degraded md raid1.

Additional info:
I uploaded the booting log with rd.debug in kernel parameter.
/etc/fstab looks like:
UUID=21df6a4c-bce3-4fe4-8739-8e804f4893dd /boot ext2 defaults 1 2
kernel parameter: root=/dev/mapperr/os-root rd.lvm.lv=os/root rd.lvm.lv=os/swap rd.md.uuid=5068eccc:e5b886ea8:be8fe496:61ee1fd8

Harald Hoyer gave a precise description and solution of this issue:
Although /dev/md0 is not needed to boot to the real root, you want the initramfs to assemble it (even degraded) via "rd.md.uuid=5068eccc:e5b886ea8:be8fe496:61ee1fd8"

"The correct solution for this problem would be a initqueue/finished hook script, which checks for the existence of a md raid with the specified uuid.
This would lead to a timeout situation, which would trigger the initqueue/timeout scripts to run and which would activate the degraded raids."

Comment 4 Olivier LAHAYE 2018-02-14 18:13:33 UTC
Same problem for me.

Comment 9 Lukáš Nykrýn 2018-07-17 12:01:30 UTC
Looks like we are missing mdadm-last-resort unit files in initrd, but adding them does not fix anything since we are also missing the related udev rules.

Comment 10 Lukáš Nykrýn 2018-07-17 12:29:11 UTC
Hmm, it looks that this will also need some changes in the mdadm package, probably nothing we want to do for 7.6 in this phase.

Comment 12 schanzle 2018-12-14 22:40:45 UTC
My 7.5 systems do not experience this failure, but 7.6 systems do.

If a drive fails while the system is running and mdadm fails it, there is no problem - at boot, the device is not expected to be part of the array.

The problem occurs when a component of an md device does not exist in the initramfs boot phase.

It would be really nice to have this fixed to have the expected resiliency.

I have systems I can test with.

Thanks!

Comment 13 Michal Žejdl 2019-03-05 08:48:17 UTC
We have the same experience as schanzle.

A clean 7.5 install, root FS above mdraid 1 on two block devices, no lvm, one device removed - system starts after timeout.
A clean 7.6 install, the same settings, one device removed - system start fails and ends in a dracut prompt.


Workaround using dracut from 7.5 base:

yum downgrade dracut-033-535.el7.x86_64.rpm dracut-config-rescue-033-535.el7.x86_64.rpm dracut-network-033-535.el7.x86_64.rpm kexec-tools-2.0.15-13.el7.x86_64.rpm

dracut -f

echo "exclude=dracut dracut-config-rescue dracut-network kexec-tools" >>/etc/yum.conf


thus:

7.5 base dracut-033-535.el7.x86_64.rpm is working
7.6 base dracut-033-554.el7.x86_64.rpm fails to start

Comment 14 Chris Schanzle 2019-03-05 16:20:34 UTC
I sincerely hope someone is working on this serious regression to 7.6 to be fixed in 7.7.

With the current situation, a RAID1 system is now *more likely* to not boot than a single drive system (since probability of one of two drives will fail is greater than having just one drive).

This regression defeats some core reasons to set up a RAID1 system and requires significant skill to get the system back up and running (boot rescue media).  [My users do not (and should not) have root access.]  I don't have physical access to all systems and walking someone through the recovery process would be rather tedious.

At least the likelihood of a drive failing or being removed while the system is not running is relatively low for systems running 24/7.

Another "gotcha" scenario is when a drive gets a bad block on a non-/boot md (the 'other' 98% of the drive) and that component fails out, I get md-monitor emails leading me to think "I need to replace that drive."  But if the drive hasn't failed out of the relatively small /boot md, you visit the system with new drive, shut down the system, replace the drive...it won't boot back up.  UGH!  [fix: put old drive back in, boot up, fail all components of the drive, try replacement process again.]

I do not mean to come across a whining ungrateful user.  I realize this isn't simple to fix and I greatly appreciate the efforts.  I just hope it is getting the attention it deserves, which is not clear from the comments above.

Thank you!


Note You need to log in before you can comment on or make changes to this bug.