Bug 2278534
| Summary: | dracut-install triggered by kernel-core scriptlet or dracut regenerate hangs for an unlimited amount of time | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Persona non grata <nobody+496708> |
| Component: | dracut | Assignee: | dracut-maint-list |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 40 | CC: | dracut-maint-list, gmaxwell, honza, jamacku, janne-fdr, lnykryn, marcan, mironov.ivan, ngompa13, pvalena, teohhanhui |
| Target Milestone: | --- | Keywords: | Desktop, Regression |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-08-11 10:36:00 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Persona non grata
2024-05-01 21:49:31 UTC
Used filesystem is BTRFS. Used graphics cards are Intel Arc A380 and Nvidia RTX 4080 So the initramfs generation does not hang indefinitely. "time sudo dracut -vvv -f --kver=KERNELVERSION" finished after a total of about 55 minutes. I will try to further diagnose where the issue comes from but do not have that many ideas, help would be appreciated. *** Bug 2280321 has been marked as a duplicate of this bug. *** We're getting lots of user reports of unbootable systems since this started happening, because if the initramfs generation gets interrupted for any reason (machine sleep, running out of battery, user impatience, etc.) then the default kernel ends up missing its initramfs and boot fails. This is a major UX issue. If kernel installation and initramfs generation were at least atomic to some extent (the kernel is installed to /boot and the GRUB menu only *after* the initramfs is successfully generated) it wouldn't be that bad, but these two things together end up causing a lot of user pain. OK, I found the culprit:
/usr/lib/dracut/modules.d/50drm/module-setup.sh
```
for i in /sys/bus/{pci/devices,platform/devices,virtio/devices,soc/devices/soc?,vmbus/devices}/*/modalias; do
[[ -e $i ]] || continue
[[ -n $(< "$i") ]] || continue
# shellcheck disable=SC2046
if hostonly="" dracut_instmods --silent -s "drm_crtc_init|drm_dev_register|drm_encoder_init" -S "iw_handler_get_spy" $(< "$i"); then
if strstr "$(modinfo -F filename $(< "$i") 2> /dev/null)" radeon.ko; then
hostonly='' instmods amdkfd
fi
fi
done
```
This bit of code goes through *every single device on the system* (288 on my system) and calls dracut_instmods for each, even for duplicate modaliases. That call takes seconds, because it *itself* goes through every modalias and, for each device, reads /lib/modules/$KVER/modules.*.bin.
There is a ridiculous O(n^2) behavior here with a horrible constant factor. No wonder this takes anywhere from minutes to hours depending on your particular system.
This module's code hasn't changed since 2021... https://github.com/dracut-ng/dracut-ng/commits/main/modules.d/50drm/module-setup.sh Err 2022 (in released dracut versions) It looks like this commit may improve things: https://github.com/dracut-ng/dracut-ng/commit/80f2caf4f5ee47a708b5e4bd65c28e3f8ff1b9c8 It does. We should get that released ASAP given how painful this is. Re the module not changing recently, it's possible that the second n factor was introduced by dracut_instmods more recently, and the module always had one n factor. I've created https://github.com/redhat-plumbers/dracut-fedora/pull/26 with a backport for rawhide / f40 Is was this and issue in f39 as well? Timing wise I started to see this around the time I started updating to f40 but I can't say if I noticed this issue only after updating to f40. As far as I remember even the initial release of Fedora 40 did not have this issue. Only after installing from the Live-ISO and then doing an update the problem occurred. Just adding a report of my experience: I upgraded a F38 host to F40 using dnf system-upgrade today. After rebooting it made it to the kernel-core script and stopped. I waited an hour and a half before terminating it with a control-alt-delete as there appeared to be nothing else I could do. Came up with an initramfs failure. I booted into the prior kernel and attempted a dnf reinstall of the F40 kernel, which also 'hung'... but since I was on a working system I was able to debug and got enough information to bring me to this bug (e.g. that it was dracut-install that was spinning). I manually applied https://github.com/dracut-ng/dracut-ng/commit/80f2caf4f5ee47a708b5e4bd65c28e3f8ff1b9c8 (the PR linked in this thread) and the reinstall was able to complete successfully though there was still a pretty obvious delay on the scripts step it wasn't one that would have caused me any concern during the install. I'm now running the fc40 kernel. On the system in question, a 128 core epyc host with a lot of storage on it: # for i in /sys/bus/{pci/devices,platform/devices,virtio/devices,soc/devices/soc?,vmbus/devices}/*/modalias; do [[ -e $i ]] && [[ -n $(< "$i") ]] && echo $i; done | wc -l 14540 https://bodhi.fedoraproject.org/updates/FEDORA-2024-18215bc41f (dracut-102) brings the initramfs generation time back to acceptable levels. Still slower than dracut-059 but not any more brokenly slow (! saw 28 minutes). On 2 tested apple silicon machines it tooks 50 / 120 seconds. I think this can be closed now. Further optimizations are on the way upstream, but this is no longer a major UX issue. |