1451660 – Unable to boot from a degraded MD raid1

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1451660 - Unable to boot from a degraded MD raid1

Summary: Unable to boot from a degraded MD raid1

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	dracut
Sub Component:
Version:	7.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Lukáš Nykrýn
QA Contact:	Release Test Team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1465901
TreeView+	depends on / blocked

Reported:	2017-05-17 09:02 UTC by kevin chuang
Modified:	2019-04-25 08:37 UTC (History)
CC List:	12 users (show)
Fixed In Version:	dracut-033-562.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-25 08:37:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
booting log (176.40 KB, text/plain) 2017-05-17 09:02 UTC, kevin chuang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1640866	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Knowledge Base (Solution)	3989911	0	None	None	None	2019-03-15 05:49:56 UTC

Internal Links: 1640866

Description kevin chuang 2017-05-17 09:02:39 UTC

Created attachment 1279596 [details]
booting log

Description of problem:
Unable to boot from a degraded MD raid1(/dev/md0) when I plug out one of block devices belonging to this /dev/md0. I have raised this issue in https://github.com/dracutdevs/dracut/issues/227 and Harald Hoyer have fixed this. And I'm wonder if I could get this fixed from an official way like "yum update dracut". Hence Harald Hoyer suggest me to file a bug here.

Version-Release number of selected component (if applicable):
CentOS Linux release 7.3.1611 (Core)

How reproducible:

Steps to Reproduce:
1. my /boot partition is made of md raid1 with 3 partition. /dev/sda3, /dev/sdb3 and /dev/sdd
2. poweroff
3. plug out /dev/sdd and poweron

Actual results:
Popup maintenance mode. press password to enter maintenance mode and execute "mdadm -D /dev/md0" showing inactive.

Expected results:
Successfully boot from a degraded md raid1.

Additional info:
I uploaded the booting log with rd.debug in kernel parameter.
/etc/fstab looks like:
UUID=21df6a4c-bce3-4fe4-8739-8e804f4893dd /boot ext2 defaults 1 2
kernel parameter: root=/dev/mapperr/os-root rd.lvm.lv=os/root rd.lvm.lv=os/swap rd.md.uuid=5068eccc:e5b886ea8:be8fe496:61ee1fd8

Harald Hoyer gave a precise description and solution of this issue:
Although /dev/md0 is not needed to boot to the real root, you want the initramfs to assemble it (even degraded) via "rd.md.uuid=5068eccc:e5b886ea8:be8fe496:61ee1fd8"

"The correct solution for this problem would be a initqueue/finished hook script, which checks for the existence of a md raid with the specified uuid.
This would lead to a timeout situation, which would trigger the initqueue/timeout scripts to run and which would activate the degraded raids."

Comment 2 Harald Hoyer 2017-06-29 09:22:52 UTC

commit https://github.com/dracutdevs/dracut/commit/3cea065819d0a7ae9a32e097589510df315090d3

Comment 4 Olivier LAHAYE 2018-02-14 18:13:33 UTC

Same problem for me.

Comment 5 Lukáš Nykrýn 2018-06-20 09:14:09 UTC

https://github.com/dracutdevs/dracut/commit/c9e0ee2c91de55a51d2afc6b0032f39f1f7f2ea8

Comment 9 Lukáš Nykrýn 2018-07-17 12:01:30 UTC

Looks like we are missing mdadm-last-resort unit files in initrd, but adding them does not fix anything since we are also missing the related udev rules.

Comment 10 Lukáš Nykrýn 2018-07-17 12:29:11 UTC

Hmm, it looks that this will also need some changes in the mdadm package, probably nothing we want to do for 7.6 in this phase.

Comment 12 schanzle 2018-12-14 22:40:45 UTC

My 7.5 systems do not experience this failure, but 7.6 systems do.

If a drive fails while the system is running and mdadm fails it, there is no problem - at boot, the device is not expected to be part of the array.

The problem occurs when a component of an md device does not exist in the initramfs boot phase.

It would be really nice to have this fixed to have the expected resiliency.

I have systems I can test with.

Thanks!

Comment 13 Michal Žejdl 2019-03-05 08:48:17 UTC

We have the same experience as schanzle.

A clean 7.5 install, root FS above mdraid 1 on two block devices, no lvm, one device removed - system starts after timeout.
A clean 7.6 install, the same settings, one device removed - system start fails and ends in a dracut prompt.


Workaround using dracut from 7.5 base:

yum downgrade dracut-033-535.el7.x86_64.rpm dracut-config-rescue-033-535.el7.x86_64.rpm dracut-network-033-535.el7.x86_64.rpm kexec-tools-2.0.15-13.el7.x86_64.rpm

dracut -f

echo "exclude=dracut dracut-config-rescue dracut-network kexec-tools" >>/etc/yum.conf


thus:

7.5 base dracut-033-535.el7.x86_64.rpm is working
7.6 base dracut-033-554.el7.x86_64.rpm fails to start

Comment 14 Chris Schanzle 2019-03-05 16:20:34 UTC

I sincerely hope someone is working on this serious regression to 7.6 to be fixed in 7.7.

With the current situation, a RAID1 system is now *more likely* to not boot than a single drive system (since probability of one of two drives will fail is greater than having just one drive).

This regression defeats some core reasons to set up a RAID1 system and requires significant skill to get the system back up and running (boot rescue media).  [My users do not (and should not) have root access.]  I don't have physical access to all systems and walking someone through the recovery process would be rather tedious.

At least the likelihood of a drive failing or being removed while the system is not running is relatively low for systems running 24/7.

Another "gotcha" scenario is when a drive gets a bad block on a non-/boot md (the 'other' 98% of the drive) and that component fails out, I get md-monitor emails leading me to think "I need to replace that drive."  But if the drive hasn't failed out of the relatively small /boot md, you visit the system with new drive, shut down the system, replace the drive...it won't boot back up.  UGH!  [fix: put old drive back in, boot up, fail all components of the drive, try replacement process again.]

I do not mean to come across a whining ungrateful user.  I realize this isn't simple to fix and I greatly appreciate the efforts.  I just hope it is getting the attention it deserves, which is not clear from the comments above.

Thank you!

Comment 15 Lukáš Nykrýn 2019-04-25 08:35:57 UTC

Should be fixed in 7.7.

Note You need to log in before you can comment on or make changes to this bug.