Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
RHEL 7.6 dracut-iniqueue script has a default value of 180 seconds (as defined in the RDRETRY variable), which is higher than systemd root mount service (90 seconds). This can lead to unbootable system when root resides on a degraded software RAID1 device (user is dropped to emergency shell). See https://bugzilla.redhat.com/show_bug.cgi?id=1451660# for an example of the problem. Note that this only happen when the RAID device expects itself to be healthy, but it unexpectedly found the array degraded during boot.
Passing "rd.retry=30" at boot time fixes the degraded array boot problem, as the array is forced started before the systemctl root mount service times out. Moreover, the long dracut rd.retry timeout is inconsistent with dracut.cmdline(7) man page, where it is stated the timeout should be 30 seconds.
Version-Release number of selected component (if applicable):
dracut-033-554.el7.x86_64
How reproducible:
Install OS on a MD RAID1 array, shutdown the system, remove a drive and poweron the system. User is dropped to an emergency shell.
Steps to Reproduce:
1. Install OS on a MD RAID1 array
2. Shutdown the system
3. Remove a drive (with system powered off)
4. Power on the system
5. After ~90s, user is droppen in a emergency shell
Actual results:
OS does not boot properly when a RAID1 array is unexpectedly found degraded.
Expected results:
RAID1 should have no problem booting from a degraded array.
Additional info:
I traced the problem to how mdadm --incremental, dracut timeout (rd.retry) and systemctl default timeout interact:
- mdadm --incremental will not start/run an array which is unexpectedly found degraded;
- dracut should force-start the array after 2/3 of the timeout value passed. With current RHEL default, this amount to 180/3*2 = 120s;
- systemctl expect to mount the root filesystem in at most 90s. If it does not succeed, it abort the dracut script and drop to an emergency shell. Being 90s lower than dracut timeout, it means that dracut does not have a chance to force-start the array.
Lowering rd.retry timeout (setting as the man page suggests) enables dracut to force-start the array, allowing the systemctl service to succeed.
Hi Lukas, besides merging the commit you posted above, can I suggest to decrease rd.retry time to 30s by default? This would:
- leave plenty of time for devices to appear;
- match the manpage documented value;
- prevent any new/recurring bug on systemd timeout to trigger the degraded array issue (and/or other issues too).
Thanks.
To be honest I am a bit afraid to do that. I've learned that the enterprise setups are often quite crazy and take ages to boot. So I really don't want to decrease the limit, since I am afraid that we will get a lot of reports about regressions.
Sure, I can understand it. However, if dracut times out *before* systemd, even more issue can arise. Maybe setting it to 60s (rather than 30s) can be a reasonable trade-off. Anyway, the man page should be updated.
Sure, we should fix the manpage. And the patch should make sure that system *never* times out.
Comment 10RHEL Program Management
2021-03-15 07:35:18 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.