|Summary:||System hangs in shutdown stage - mdmon killed by dracut shutdown script|
|Product:||[Fedora] Fedora||Reporter:||Dmitriy Kargapolov <bugzilla>|
|Status:||NEW ---||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Version:||34||CC:||dracut-maint-list, jonathan, zbyszek|
|Fixed In Version:||Doc Type:||---|
|Doc Text:||Story Points:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Dmitriy Kargapolov 2021-05-02 23:35:41 UTC
Created attachment 1778770 [details] rdsosreport.txt Description of problem: After upgrading to FC34 the system hangs all the time on the shutdown stage. Last message on the console: “Unmounting /oldroot timed out.”. Version-Release number of selected component (if applicable): dracut-053-5.fc34.x86_64 How reproducible: Always. Steps to Reproduce: Just shutdown the system, for example issuing the “shutdown now” command. Actual results: Hangs after “Unmounting /oldroot timed out.” message. Expected results: System expected to complete shutdown. Additional info: Following “Debugging dracut on shutdown” recommendations I found that the process never reached “shutdown” entry of the emergency shell, while “pre-shutdown” entry placed _before_ any attempts to unmount oldroot. So I altered the /usr/lib/dracut/modules.d/99shutdown/shutdown.sh script adding emergency shell entry point just after the call to internal function umount_a() and before calling internal function _check_shutdown(), which invokes various shutdown hooks. After modifying the script following commands were used to run the process. dracut -f mkdir -p /run/initramfs/etc/cmdline.d echo "rd.debug" > /run/initramfs/etc/cmdline.d/debug.conf touch /run/initramfs/.need_shutdown shutdown -H now Unfortunately, I couldn’t figure out how to save debug prints from the dracut shutdown script execution, but I noticed that: 1. After the "umount /oldroot" timed out, it looked like unmount actually succeeded, at least there was no "/oldroot" found in the /proc/mounts. 2. Still, the unmount process appeared to be alive. It could not be killed with SIGKILL. Something locked it. root 17933 0.0 0.0 3876 1180 ? D 23:12 0:00 umount /oldroot 3. There was no process found using /oldroot and preventing it from being properly unmounted. 4. Exiting emergency shell and letting it proceed, I found the final hanging command: "mdadm -vv --wait-clean --scan" 5. Repeating the test I tried to execute "mdadm -vv --wait-clean --scan" manually from the emergency shell with the same result - command never returned and could not be killed. The hanging command is part of the /usr/lib/dracut/modules.d/90mdraid/md-shutdown.sh script (which is invoked as a hook /usr/lib/dracut/hooks/shutdown/30-md-shutdown.sh in the initramfs). It is unclear if the partially-completed unmount resulted in mdadm --wait-clean hanging, or later has a problem by itself. I also could not think something was wrong with my h/w since everything worked fine with the latest FC33.
Comment 1 Dmitriy Kargapolov 2021-05-02 23:36:57 UTC
Created attachment 1778771 [details] process list from emergency shell
Comment 2 Dmitriy Kargapolov 2021-05-03 01:54:35 UTC
I tried to alter killall_proc_mountpoint() from /usr/lib/dracut/modules.d/99base/dracut-lib.sh commenting out the line which kill the process suspected in using given mount point (/oldroot) and printing info about the process itself. The only process found was '@usr/sbin/mdmon --offroot --takeover md127'. I guess it should not be killed (especially with -9) because it is by the RAID which is a base for the filesystem still mounted as /oldroot. The man pages on mdmon (section START UP AND SHUTDOWN) say that "At shutdown time, mdmon should not be killed along with other processes." Not sure why this scenario even possible.
Comment 3 Dmitriy Kargapolov 2021-05-29 15:21:33 UTC
I would expect some attention to this problem, even if it is not exactly related to the dracut. I would add mdadm and initscripts to the list of components related to the bug. It is hard to say what exactly caused the issue. But unfortunately, I couldn't find a way to select multiple components. I selected dracut because most of the scripts participating belong to this package.