looks like with dracut-037-11.git20140402.fc20.x86_64 shutdown sometimes hangs with the last message "disassembling mdraid devices" at least in combination dracut-037 / Kernel 3.14 and having /boot, rottfs and data on RAID1/RAID10 that's not everytime, otherwise i would not have given positive karma but way too often given that i did not face this issue in the past
also sometimes several (random) services need very long and partly up to systemd-timeouts at shutdown to stop - recently faced also on a virtual machine sounds like similar to https://bugzilla.redhat.com/show_bug.cgi?id=1073714
I downgrade dracut rpm -q dracut dracut-034-64.git20131205.fc20.x86_64 but don't fixed the problem: Mai 02 21:34:17 x systemd[1]: session-1.scope stopping timed out. Killing. btw first time that happens was : Abr 21 05:20:43 x systemd[1]: session-1.scope stopping timed out. Killing.
Created attachment 893203 [details] photo of hanging shutdown look at the pircture - that is for 100% sure dracut what is it waiting for? * all filesystems are unmounted * swap is disabled * raid devices are detached * what has it to "disassemble mdraid" for minutes, hours and sometimes forever frankly i can reproduce this in the meantime on 8 out of 10 shutdowns in case of my office workstation - the kernel is alive for sure becasue CTRL+ALT+PRINT+S repsonds with "SysRq: Emergency Sync" and "Emergency Sync complete"
yeah , I feel the difference between dracut-034 and dracut-037 is the timeout is more short in 034, so we don't see waiting 60 seconds for timeout , but timeout is there , is what I feel with my test, not sure that is .
that i WOL'ed my workstation in the office at sunday to sync some data, typed "systemctl poweroff" around 15:00 and the "disassemble mdraid" still was on the screen and the machine not powered off 19 hours later can hardly be called a "timeout" :-)
My case , I just have service timeout, not hang, after one minute or two at most , the system processed and shutdown without problems. I saw in journalctl: systemd[1]: session-1.scope stopping timed out. Killing. maybe we have a different problem .
(In reply to Harald Reindl from comment #5) > that i WOL'ed my workstation in the office at sunday to sync some data, > typed "systemctl poweroff" around 15:00 and the "disassemble mdraid" still > was on the screen and the machine not powered off 19 hours later can hardly > be called a "timeout" :-) This should be fixed with http://git.kernel.org/cgit/boot/dracut/dracut.git/commit/?id=4e58a1ffc760e5c54e6cae5924a2439cae196848
(In reply to Harald Hoyer from comment #7) > (In reply to Harald Reindl from comment #5) > > that i WOL'ed my workstation in the office at sunday to sync some data, > > typed "systemctl poweroff" around 15:00 and the "disassemble mdraid" still > > was on the screen and the machine not powered off 19 hours later can hardly > > be called a "timeout" :-) > > This should be fixed with > http://git.kernel.org/cgit/boot/dracut/dracut.git/commit/ > ?id=4e58a1ffc760e5c54e6cae5924a2439cae196848 s/should/might/
can we have a update for F20 or at least a scratch-build, i had the same today (WOL my workstation on saturday from home, apply updates to keep both synchronous) and needed to call the office hard power off my computer to start the syncs of yesterdays work while going to subway :-(
since there is still no build some additional infos: after hamemring blindly STRG+ALT+PRINT+S and other disabled or invalid SysRQ combinationsthe output changes from "Disassembling mdraid devices" to "shutdown : line 90: 6035 quit" and "dracut: Wating for mdraid devices to be clean" and continue to hammer around with SysRQ combinations leads to a reboot well, nice workaround if you are in front of the phyiscal machine but unnaceptable in case of rebooting remote machines hundrets of miles away from your location and nobody to call there for hard power off so this should be handeled as *very critical* shutdown / reboot in general is a problem in F20 be it systemd or dracut not so long ago systemd freezed ssh clients, now it kills blindly processes like VMware guests supposed to be suspended before shutdown which was perfectly clean in F19 while it started to be broken a long time after the siwtch to systemd in F15, don#t get me wrong but the "optimizations" of the last months leaving a bad taste in the admins mouth seeking for rock stable systems as known
(In reply to Harald Reindl from comment #10) > since there is still no build some additional infos: > > after hamemring blindly STRG+ALT+PRINT+S and other disabled or invalid SysRQ > combinationsthe output changes from "Disassembling mdraid devices" to > "shutdown : line 90: 6035 quit" and "dracut: Wating for mdraid devices to > be clean" and continue to hammer around with SysRQ combinations leads to a > reboot > > well, nice workaround if you are in front of the phyiscal machine but > unnaceptable in case of rebooting remote machines hundrets of miles away > from your location and nobody to call there for hard power off > > so this should be handeled as *very critical* > > shutdown / reboot in general is a problem in F20 be it systemd or dracut > not so long ago systemd freezed ssh clients, now it kills blindly processes > like VMware guests supposed to be suspended before shutdown which was > perfectly clean in F19 while it started to be broken a long time after the > siwtch to systemd in F15, don#t get me wrong but the "optimizations" of the > last months leaving a bad taste in the admins mouth seeking for rock stable > systems as known Oh, while you are at it. Can you debug the shutdown, so that we can fix the real culprit. It could be dracut or mdadm. info "Waiting for mdraid devices to be clean." mdadm $_offroot -vv --wait-clean --scan| vinfo Please follow: https://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html#debugging-dracut-on-shutdown
(In reply to Harald Hoyer from comment #11) > (In reply to Harald Reindl from comment #10) > > since there is still no build some additional infos: > > > > after hamemring blindly STRG+ALT+PRINT+S and other disabled or invalid SysRQ > > combinationsthe output changes from "Disassembling mdraid devices" to > > "shutdown : line 90: 6035 quit" and "dracut: Wating for mdraid devices to > > be clean" and continue to hammer around with SysRQ combinations leads to a > > reboot > > > > well, nice workaround if you are in front of the phyiscal machine but > > unnaceptable in case of rebooting remote machines hundrets of miles away > > from your location and nobody to call there for hard power off > > > > so this should be handeled as *very critical* > > > > shutdown / reboot in general is a problem in F20 be it systemd or dracut > > not so long ago systemd freezed ssh clients, now it kills blindly processes > > like VMware guests supposed to be suspended before shutdown which was > > perfectly clean in F19 while it started to be broken a long time after the > > siwtch to systemd in F15, don#t get me wrong but the "optimizations" of the > > last months leaving a bad taste in the admins mouth seeking for rock stable > > systems as known > > Oh, while you are at it. Can you debug the shutdown, so that we can fix the > real culprit. It could be dracut or mdadm. > > info "Waiting for mdraid devices to be clean." > mdadm $_offroot -vv --wait-clean --scan| vinfo > > Please follow: > https://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html#debugging- > dracut-on-shutdown oh, and you might want to add "rd.debug" # echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf
where is the debug-information supposed to be stored after reboot with echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf before? as you can see in the photo all filesystems are already unmounted at this point
well, i created that script below and rebootet the machine 10 times, i see for a very short time frame debug-messages and without *any* delay the machine reboots successful - so the complete behavior with debugging on is different and it's pretty clear dracut itself [root@rh:~]$ cat /scripts/dracut-debug.sh #!/usr/bin/bash mkdir -p /run/initramfs/etc/cmdline.d/ echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf
(In reply to Harald Reindl from comment #14) > well, i created that script below and rebootet the machine 10 times, i see > for a very short time frame debug-messages and without *any* delay the > machine reboots successful - so the complete behavior with debugging on is > different and it's pretty clear dracut itself > > [root@rh:~]$ cat /scripts/dracut-debug.sh > #!/usr/bin/bash > mkdir -p /run/initramfs/etc/cmdline.d/ > echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf and the reverse is true also, that it stalls, if you don't do that? alt-sysrq-t should display the running processes. alt-sysrq-t should display the processes in 'D' state. Maybe mdadm resolved the "clean" state, if there is some time in between the commands?
(In reply to Harald Hoyer from comment #15) > alt-sysrq-t should display the processes in 'D' state. sorry, "w" according to http://en.wikipedia.org/wiki/Magic_SysRq_key alt-sysrq-w should display the processes in 'D' state.
> and the reverse is true also, that it stalls, if you don't do that? 8 out of 10 times and mostly if i want to reboot a remote-machine, that's why i wrote that bugreport what about your comment https://bugzilla.redhat.com/show_bug.cgi?id=1092937#c8 and a scratch-build? in the time we discuss and try a enduser debug initrd i could have made 50 reboots on two physical machines from yesterday to now
BTW: did you look at the photo i attached some days ago? https://bugzilla.redhat.com/attachment.cgi?id=893203 * All file systems unmounted * All swaps deactivated * All loop devices detached * All DM devices detached * Storage in finalized i don't get what is there to wait for? there is nothing left to do and no unwritten data at all
(In reply to Harald Reindl from comment #18) > BTW: did you look at the photo i attached some days ago? > https://bugzilla.redhat.com/attachment.cgi?id=893203 > > * All file systems unmounted > * All swaps deactivated > * All loop devices detached > * All DM devices detached > * Storage in finalized > > i don't get what is there to wait for? > there is nothing left to do and no unwritten data at all same thing.. most likely hanging in: mdadm --wait-clean --scan And because there is no pidof() involved in the shutdown anymore, I don't think, this has anything to do with comment 8
What you can do to test, if this is the culprit: edit /usr/lib/dracut/modules.d/90mdraid/md-shutdown.sh comment out the wait-clean and recreate the initramfs with: # dracut -f and see, if it still hangs.
(In reply to Harald Hoyer from comment #20) > What you can do to test, if this is the culprit: > > edit > /usr/lib/dracut/modules.d/90mdraid/md-shutdown.sh > > comment out the wait-clean and recreate the initramfs with: > > # dracut -f > > and see, if it still hangs. you can also "echo" some debug messages around it, which should go the console
Created attachment 895238 [details] photo with echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf ok ,now i "managed" to get a photo with debug, hopefully that gives some insight yes != yes lookes hmm - strange -> dracut-lib.sh@49 the same happened (most likely) on my remote-machine too while try to reboot with rd.debug after update to kernel-4.14.4 from koji where i thought "damned thing is knowing when i am far away from the office, well photo tomorrow" but that one decided after around 20 minutes to finsih the reboot
ok, so the machine hangs in "mdadm -vv --stop --scan" ignore the yes != yes... that is only from info()/vinfo(), which should pipe the output to the log file and the console. reassigning to mdadm. Don't know what changed in mdadm or the kernel. The last change in md-shutdown was in 2012, so I don't think this is a dracut bug.
hmm - the first Kernel 3.14 and dracut-37 arrived here at the same time maybe https://bugzilla.redhat.com/show_bug.cgi?id=1096414 (raid-check with 3.14 freezes machine) has a coommon root cause i don't understand right now
Created attachment 901132 [details] strace -fittryTv -s 111111 mdadm $_offroot -vv --stop --scan
I've ran into the same issue. dracut-037-11.git20140402.fc20.x86_64 mdadm-3.3-4.fc20.x86_64 Linux version 3.14.4-200.fc20.x86_64 (mockbuild@bkernel02) (gcc version 4.8.2 20131212 (Red Hat 4.8.2-7) (GCC) ) #1 SMP Tue May 13 13:51:08 UTC 2014 Here is the output from modified /usr/lib/dracut/modules.d/90mdraid/md-shutdown.sh where mdadm $_offroot -vv --stop --scan is running under strace -fittryTv -s 111111: https://bugzilla.redhat.com/attachment.cgi?id=901132
https://bugzilla.redhat.com/show_bug.cgi?id=1096414 https://bugzilla.redhat.com/show_bug.cgi?id=1092937 *both* seems to be fixed with 3.14.5-200.fc20.x86_64 while i am unable to find the relevant change in the kernel-upstream-changelog however, rebootet my workstation 40 times after the update from koji and 4 raid-check runs without any freeze on two different machines if it comes back i will re-open that bug
re-opened, my co-developers machine was hanging the whole night at "disassembling raid devices" and after 6 SYSRQ+S (emergency sync) it decided to shut down - unacceptable in case of remote-machines
the two commits below smell like related https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.14.6 ____________________________________________________________________________________________ commit 0bc4091108e8f2e65faef3082e5261f2c35cd2b4 Author: NeilBrown <neilb> Date: Tue May 6 09:36:08 2014 +1000 md: avoid possible spinning md thread at shutdown. commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream. If an md array with externally managed metadata (e.g. DDF or IMSM) is in use, then we should not set safemode==2 at shutdown because: 1/ this is ineffective: user-space need to be involved in any 'safemode' handling, 2/ The safemode management code doesn't cope with safemode==2 on external metadata and md_check_recover enters an infinite loop. Even at shutdown, an infinite-looping process can be problematic, so this could cause shutdown to hang. Signed-off-by: NeilBrown <neilb> Signed-off-by: Greg Kroah-Hartman <gregkh> ____________________________________________________________________________________________ commit 8c7311a1c4a8d804bde91b00a2f2c1a22a954c30 Author: NeilBrown <neilb> Date: Mon May 5 13:34:37 2014 +1000 md/raid10: call wait_barrier() for each request submitted. commit cc13b1d1500656a20e41960668f3392dda9fa6e2 upstream. wait_barrier() includes a counter, so we must call it precisely once (unless balanced by allow_barrier()) for each request submitted. Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 block: Introduce new bio_split() in 3.14-rc1, we don't call it for the extra requests generated when we need to split a bio. When this happens the counter goes negative, any resync/recovery will never start, and "mdadm --stop" will hang. Reported-by: Chris Murphy <lists> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Cc: Kent Overstreet <kmo> Signed-off-by: NeilBrown <neilb> Signed-off-by: Greg Kroah-Hartman <gregkh> ____________________________________________________________________________________________
closed again - 3.14.6 fixed it really, moved around some TB of data over days while repeatly check/resync 4 TB RAID10, no freeze and no hang at shutdown