1701504 – Reduce dracut timeout (rd.retry)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1701504 - Reduce dracut timeout (rd.retry)

Summary: Reduce dracut timeout (rd.retry)

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	dracut
Sub Component:
Version:	7.6
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Lukáš Nykrýn
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-19 09:51 UTC by g.danti
Modified:	2021-03-15 07:35 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-15 07:35:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description g.danti 2019-04-19 09:51:48 UTC

Description of problem:
RHEL 7.6 dracut-iniqueue script has a default value of 180 seconds (as defined in the RDRETRY variable), which is higher than systemd root mount service (90 seconds). This can lead to unbootable system when root resides on a degraded software RAID1 device (user is dropped to emergency shell). See https://bugzilla.redhat.com/show_bug.cgi?id=1451660# for an example of the problem. Note that this only happen when the RAID device expects itself to be healthy, but it unexpectedly found the array degraded during boot.

Passing "rd.retry=30" at boot time fixes the degraded array boot problem, as the array is forced started before the systemctl root mount service times out. Moreover, the long dracut rd.retry timeout is inconsistent with dracut.cmdline(7) man page, where it is stated the timeout should be 30 seconds.

Version-Release number of selected component (if applicable):
dracut-033-554.el7.x86_64

How reproducible:
Install OS on a MD RAID1 array, shutdown the system, remove a drive and poweron the system. User is dropped to an emergency shell.

Steps to Reproduce:
1. Install OS on a MD RAID1 array
2. Shutdown the system
3. Remove a drive (with system powered off)
4. Power on the system
5. After ~90s, user is droppen in a emergency shell

Actual results:
OS does not boot properly when a RAID1 array is unexpectedly found degraded.

Expected results:
RAID1 should have no problem booting from a degraded array.

Additional info:
I traced the problem to how mdadm --incremental, dracut timeout (rd.retry) and systemctl default timeout interact:
- mdadm --incremental will not start/run an array which is unexpectedly found degraded;
- dracut should force-start the array after 2/3 of the timeout value passed. With current RHEL default, this amount to 180/3*2 = 120s;
- systemctl expect to mount the root filesystem in at most 90s. If it does not succeed, it abort the dracut script and drop to an emergency shell. Being 90s lower than dracut timeout, it means that dracut does not have a chance to force-start the array.
Lowering rd.retry timeout (setting as the man page suggests) enables dracut to force-start the array, allowing the systemctl service to succeed.

Comment 2 Lukáš Nykrýn 2019-04-24 16:56:09 UTC

Thank you for this report! You are right the problem is that rd.retry is bigger than the systemd timeout, but I never thought about it, because we should always override the systemd timeout to infinity. And the bug is there. Our version of the rootfs generator create the drop-in only in the beginning, but not after systemd is reloaded.
We need to get 
https://github.com/dracutdevs/dracut/commit/f53ede36fb26716301d57706f889124ca20f3397#diff-3415611359dc29c3121b680f81e08ff2 to the rhel

Comment 3 g.danti 2019-04-24 19:47:31 UTC

Hi Lukas, besides merging the commit you posted above, can I suggest to decrease rd.retry time to 30s by default? This would:
- leave plenty of time for devices to appear;
- match the manpage documented value;
- prevent any new/recurring bug on systemd timeout to trigger the degraded array issue (and/or other issues too).
Thanks.

Comment 4 Lukáš Nykrýn 2019-04-24 20:10:28 UTC

To be honest I am a bit afraid to do that. I've learned that the enterprise setups are often quite crazy and take ages to boot. So I really don't want to decrease the limit, since I am afraid that we will get a lot of reports about regressions.

Comment 5 g.danti 2019-04-24 20:42:44 UTC

Sure, I can understand it. However, if dracut times out *before* systemd, even more issue can arise. Maybe setting it to 60s (rather than 30s) can be a reasonable trade-off. Anyway, the man page should be updated.

Comment 6 Lukáš Nykrýn 2019-04-24 20:47:30 UTC

Sure, we should fix the manpage. And the patch should make sure that system *never* times out.

Comment 10 RHEL Program Management 2021-03-15 07:35:18 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.