Bug 1956133

Summary: System hangs in shutdown stage - mdmon killed by dracut shutdown script
Product: [Fedora] Fedora Reporter: Dmitriy Kargapolov <bugzilla>
Component: dracutAssignee: dracut-maint-list
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 34CC: dracut-maint-list, jonathan, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
process list from emergency shell none

Description Dmitriy Kargapolov 2021-05-02 23:35:41 UTC
Created attachment 1778770 [details]

Description of problem:
After upgrading to FC34 the system hangs all the time on the shutdown stage. Last message on the console: “Unmounting /oldroot timed out.”.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Just shutdown the system, for example issuing the “shutdown now” command.

Actual results:
Hangs after “Unmounting /oldroot timed out.” message.

Expected results:
System expected to complete shutdown.

Additional info:
Following “Debugging dracut on shutdown” recommendations I found that the process never reached “shutdown” entry of the emergency shell, while “pre-shutdown” entry placed _before_ any attempts to unmount oldroot. So I altered the /usr/lib/dracut/modules.d/99shutdown/shutdown.sh script adding emergency shell entry point just after the call to internal function umount_a() and before calling internal function _check_shutdown(), which invokes various shutdown hooks. After modifying the script following commands were used to run the process.

dracut -f
mkdir -p /run/initramfs/etc/cmdline.d
echo "rd.debug" > /run/initramfs/etc/cmdline.d/debug.conf
touch /run/initramfs/.need_shutdown
shutdown -H now

Unfortunately, I couldn’t figure out how to save debug prints from the dracut shutdown script execution, but I noticed that:

1. After the "umount /oldroot" timed out, it looked like unmount actually succeeded, at least there was no "/oldroot" found in the /proc/mounts.

2. Still, the unmount process appeared to be alive. It could not be killed with SIGKILL. Something locked it.

root       17933  0.0  0.0   3876  1180 ?        D    23:12   0:00 umount /oldroot

3. There was no process found using /oldroot and preventing it from being properly unmounted.

4. Exiting emergency shell and letting it proceed, I found the final hanging command: "mdadm -vv --wait-clean --scan"

5. Repeating the test I tried to execute "mdadm -vv --wait-clean --scan" manually from the emergency shell with the same result - command never returned and could not be killed.

The hanging command is part of the /usr/lib/dracut/modules.d/90mdraid/md-shutdown.sh script (which is invoked as a hook /usr/lib/dracut/hooks/shutdown/30-md-shutdown.sh in the initramfs).

It is unclear if the partially-completed unmount resulted in mdadm --wait-clean hanging, or later has a problem by itself. I also could not think something was wrong with my h/w since everything worked fine with the latest FC33.

Comment 1 Dmitriy Kargapolov 2021-05-02 23:36:57 UTC
Created attachment 1778771 [details]
process list from emergency shell

Comment 2 Dmitriy Kargapolov 2021-05-03 01:54:35 UTC
I tried to alter killall_proc_mountpoint() from /usr/lib/dracut/modules.d/99base/dracut-lib.sh commenting out the line which kill the process suspected in using given mount point (/oldroot) and printing info about the process itself.

The only process found was '@usr/sbin/mdmon --offroot --takeover md127'.

I guess it should not be killed (especially with -9) because it is by the RAID which is a base for the filesystem still mounted as /oldroot. The man pages on mdmon (section START UP AND SHUTDOWN) say that "At shutdown time, mdmon should not be killed along with other processes."

Not sure why this scenario even possible.

Comment 3 Dmitriy Kargapolov 2021-05-29 15:21:33 UTC
I would expect some attention to this problem, even if it is not exactly related to the dracut. I would add mdadm and initscripts to the list of components related to the bug. It is hard to say what exactly caused the issue. But unfortunately, I couldn't find a way to select multiple components.
I selected dracut because most of the scripts participating belong to this package.