Bug 1092937
Summary: | "mdadm --stop" of the root device takes a loooong time | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Harald Reindl <h.reindl> | ||||||||
Component: | mdadm | Assignee: | Jes Sorensen <Jes.Sorensen> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 20 | CC: | agk, amigo.elite, dledford, dracut-maint-list, harald, jblawn, Jes.Sorensen, jonathan, sergio | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2014-06-12 17:51:05 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Harald Reindl
2014-04-30 09:52:08 UTC
also sometimes several (random) services need very long and partly up to systemd-timeouts at shutdown to stop - recently faced also on a virtual machine sounds like similar to https://bugzilla.redhat.com/show_bug.cgi?id=1073714 I downgrade dracut rpm -q dracut dracut-034-64.git20131205.fc20.x86_64 but don't fixed the problem: Mai 02 21:34:17 x systemd[1]: session-1.scope stopping timed out. Killing. btw first time that happens was : Abr 21 05:20:43 x systemd[1]: session-1.scope stopping timed out. Killing. Created attachment 893203 [details]
photo of hanging shutdown
look at the pircture - that is for 100% sure dracut
what is it waiting for?
* all filesystems are unmounted
* swap is disabled
* raid devices are detached
* what has it to "disassemble mdraid" for minutes, hours and sometimes forever
frankly i can reproduce this in the meantime on 8 out of 10 shutdowns in case of my office workstation - the kernel is alive for sure becasue CTRL+ALT+PRINT+S repsonds with "SysRq: Emergency Sync" and "Emergency Sync complete"
yeah , I feel the difference between dracut-034 and dracut-037 is the timeout is more short in 034, so we don't see waiting 60 seconds for timeout , but timeout is there , is what I feel with my test, not sure that is . that i WOL'ed my workstation in the office at sunday to sync some data, typed "systemctl poweroff" around 15:00 and the "disassemble mdraid" still was on the screen and the machine not powered off 19 hours later can hardly be called a "timeout" :-) My case , I just have service timeout, not hang, after one minute or two at most , the system processed and shutdown without problems. I saw in journalctl: systemd[1]: session-1.scope stopping timed out. Killing. maybe we have a different problem . (In reply to Harald Reindl from comment #5) > that i WOL'ed my workstation in the office at sunday to sync some data, > typed "systemctl poweroff" around 15:00 and the "disassemble mdraid" still > was on the screen and the machine not powered off 19 hours later can hardly > be called a "timeout" :-) This should be fixed with http://git.kernel.org/cgit/boot/dracut/dracut.git/commit/?id=4e58a1ffc760e5c54e6cae5924a2439cae196848 (In reply to Harald Hoyer from comment #7) > (In reply to Harald Reindl from comment #5) > > that i WOL'ed my workstation in the office at sunday to sync some data, > > typed "systemctl poweroff" around 15:00 and the "disassemble mdraid" still > > was on the screen and the machine not powered off 19 hours later can hardly > > be called a "timeout" :-) > > This should be fixed with > http://git.kernel.org/cgit/boot/dracut/dracut.git/commit/ > ?id=4e58a1ffc760e5c54e6cae5924a2439cae196848 s/should/might/ can we have a update for F20 or at least a scratch-build, i had the same today (WOL my workstation on saturday from home, apply updates to keep both synchronous) and needed to call the office hard power off my computer to start the syncs of yesterdays work while going to subway :-( since there is still no build some additional infos: after hamemring blindly STRG+ALT+PRINT+S and other disabled or invalid SysRQ combinationsthe output changes from "Disassembling mdraid devices" to "shutdown : line 90: 6035 quit" and "dracut: Wating for mdraid devices to be clean" and continue to hammer around with SysRQ combinations leads to a reboot well, nice workaround if you are in front of the phyiscal machine but unnaceptable in case of rebooting remote machines hundrets of miles away from your location and nobody to call there for hard power off so this should be handeled as *very critical* shutdown / reboot in general is a problem in F20 be it systemd or dracut not so long ago systemd freezed ssh clients, now it kills blindly processes like VMware guests supposed to be suspended before shutdown which was perfectly clean in F19 while it started to be broken a long time after the siwtch to systemd in F15, don#t get me wrong but the "optimizations" of the last months leaving a bad taste in the admins mouth seeking for rock stable systems as known (In reply to Harald Reindl from comment #10) > since there is still no build some additional infos: > > after hamemring blindly STRG+ALT+PRINT+S and other disabled or invalid SysRQ > combinationsthe output changes from "Disassembling mdraid devices" to > "shutdown : line 90: 6035 quit" and "dracut: Wating for mdraid devices to > be clean" and continue to hammer around with SysRQ combinations leads to a > reboot > > well, nice workaround if you are in front of the phyiscal machine but > unnaceptable in case of rebooting remote machines hundrets of miles away > from your location and nobody to call there for hard power off > > so this should be handeled as *very critical* > > shutdown / reboot in general is a problem in F20 be it systemd or dracut > not so long ago systemd freezed ssh clients, now it kills blindly processes > like VMware guests supposed to be suspended before shutdown which was > perfectly clean in F19 while it started to be broken a long time after the > siwtch to systemd in F15, don#t get me wrong but the "optimizations" of the > last months leaving a bad taste in the admins mouth seeking for rock stable > systems as known Oh, while you are at it. Can you debug the shutdown, so that we can fix the real culprit. It could be dracut or mdadm. info "Waiting for mdraid devices to be clean." mdadm $_offroot -vv --wait-clean --scan| vinfo Please follow: https://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html#debugging-dracut-on-shutdown (In reply to Harald Hoyer from comment #11) > (In reply to Harald Reindl from comment #10) > > since there is still no build some additional infos: > > > > after hamemring blindly STRG+ALT+PRINT+S and other disabled or invalid SysRQ > > combinationsthe output changes from "Disassembling mdraid devices" to > > "shutdown : line 90: 6035 quit" and "dracut: Wating for mdraid devices to > > be clean" and continue to hammer around with SysRQ combinations leads to a > > reboot > > > > well, nice workaround if you are in front of the phyiscal machine but > > unnaceptable in case of rebooting remote machines hundrets of miles away > > from your location and nobody to call there for hard power off > > > > so this should be handeled as *very critical* > > > > shutdown / reboot in general is a problem in F20 be it systemd or dracut > > not so long ago systemd freezed ssh clients, now it kills blindly processes > > like VMware guests supposed to be suspended before shutdown which was > > perfectly clean in F19 while it started to be broken a long time after the > > siwtch to systemd in F15, don#t get me wrong but the "optimizations" of the > > last months leaving a bad taste in the admins mouth seeking for rock stable > > systems as known > > Oh, while you are at it. Can you debug the shutdown, so that we can fix the > real culprit. It could be dracut or mdadm. > > info "Waiting for mdraid devices to be clean." > mdadm $_offroot -vv --wait-clean --scan| vinfo > > Please follow: > https://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html#debugging- > dracut-on-shutdown oh, and you might want to add "rd.debug" # echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf where is the debug-information supposed to be stored after reboot with echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf before? as you can see in the photo all filesystems are already unmounted at this point well, i created that script below and rebootet the machine 10 times, i see for a very short time frame debug-messages and without *any* delay the machine reboots successful - so the complete behavior with debugging on is different and it's pretty clear dracut itself [root@rh:~]$ cat /scripts/dracut-debug.sh #!/usr/bin/bash mkdir -p /run/initramfs/etc/cmdline.d/ echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf (In reply to Harald Reindl from comment #14) > well, i created that script below and rebootet the machine 10 times, i see > for a very short time frame debug-messages and without *any* delay the > machine reboots successful - so the complete behavior with debugging on is > different and it's pretty clear dracut itself > > [root@rh:~]$ cat /scripts/dracut-debug.sh > #!/usr/bin/bash > mkdir -p /run/initramfs/etc/cmdline.d/ > echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf and the reverse is true also, that it stalls, if you don't do that? alt-sysrq-t should display the running processes. alt-sysrq-t should display the processes in 'D' state. Maybe mdadm resolved the "clean" state, if there is some time in between the commands? (In reply to Harald Hoyer from comment #15) > alt-sysrq-t should display the processes in 'D' state. sorry, "w" according to http://en.wikipedia.org/wiki/Magic_SysRq_key alt-sysrq-w should display the processes in 'D' state. > and the reverse is true also, that it stalls, if you don't do that? 8 out of 10 times and mostly if i want to reboot a remote-machine, that's why i wrote that bugreport what about your comment https://bugzilla.redhat.com/show_bug.cgi?id=1092937#c8 and a scratch-build? in the time we discuss and try a enduser debug initrd i could have made 50 reboots on two physical machines from yesterday to now BTW: did you look at the photo i attached some days ago? https://bugzilla.redhat.com/attachment.cgi?id=893203 * All file systems unmounted * All swaps deactivated * All loop devices detached * All DM devices detached * Storage in finalized i don't get what is there to wait for? there is nothing left to do and no unwritten data at all (In reply to Harald Reindl from comment #18) > BTW: did you look at the photo i attached some days ago? > https://bugzilla.redhat.com/attachment.cgi?id=893203 > > * All file systems unmounted > * All swaps deactivated > * All loop devices detached > * All DM devices detached > * Storage in finalized > > i don't get what is there to wait for? > there is nothing left to do and no unwritten data at all same thing.. most likely hanging in: mdadm --wait-clean --scan And because there is no pidof() involved in the shutdown anymore, I don't think, this has anything to do with comment 8 What you can do to test, if this is the culprit: edit /usr/lib/dracut/modules.d/90mdraid/md-shutdown.sh comment out the wait-clean and recreate the initramfs with: # dracut -f and see, if it still hangs. (In reply to Harald Hoyer from comment #20) > What you can do to test, if this is the culprit: > > edit > /usr/lib/dracut/modules.d/90mdraid/md-shutdown.sh > > comment out the wait-clean and recreate the initramfs with: > > # dracut -f > > and see, if it still hangs. you can also "echo" some debug messages around it, which should go the console Created attachment 895238 [details]
photo with echo "rd.debug" >> /run/initramfs/etc/cmdline.d/debug.conf
ok ,now i "managed" to get a photo with debug, hopefully that gives some insight
yes != yes lookes hmm - strange -> dracut-lib.sh@49
the same happened (most likely) on my remote-machine too while try to reboot with rd.debug after update to kernel-4.14.4 from koji where i thought "damned thing is knowing when i am far away from the office, well photo tomorrow" but that one decided after around 20 minutes to finsih the reboot
ok, so the machine hangs in "mdadm -vv --stop --scan" ignore the yes != yes... that is only from info()/vinfo(), which should pipe the output to the log file and the console. reassigning to mdadm. Don't know what changed in mdadm or the kernel. The last change in md-shutdown was in 2012, so I don't think this is a dracut bug. hmm - the first Kernel 3.14 and dracut-37 arrived here at the same time maybe https://bugzilla.redhat.com/show_bug.cgi?id=1096414 (raid-check with 3.14 freezes machine) has a coommon root cause i don't understand right now Created attachment 901132 [details]
strace -fittryTv -s 111111 mdadm $_offroot -vv --stop --scan
I've ran into the same issue. dracut-037-11.git20140402.fc20.x86_64 mdadm-3.3-4.fc20.x86_64 Linux version 3.14.4-200.fc20.x86_64 (mockbuild@bkernel02) (gcc version 4.8.2 20131212 (Red Hat 4.8.2-7) (GCC) ) #1 SMP Tue May 13 13:51:08 UTC 2014 Here is the output from modified /usr/lib/dracut/modules.d/90mdraid/md-shutdown.sh where mdadm $_offroot -vv --stop --scan is running under strace -fittryTv -s 111111: https://bugzilla.redhat.com/attachment.cgi?id=901132 https://bugzilla.redhat.com/show_bug.cgi?id=1096414 https://bugzilla.redhat.com/show_bug.cgi?id=1092937 *both* seems to be fixed with 3.14.5-200.fc20.x86_64 while i am unable to find the relevant change in the kernel-upstream-changelog however, rebootet my workstation 40 times after the update from koji and 4 raid-check runs without any freeze on two different machines if it comes back i will re-open that bug re-opened, my co-developers machine was hanging the whole night at "disassembling raid devices" and after 6 SYSRQ+S (emergency sync) it decided to shut down - unacceptable in case of remote-machines the two commits below smell like related https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.14.6 ____________________________________________________________________________________________ commit 0bc4091108e8f2e65faef3082e5261f2c35cd2b4 Author: NeilBrown <neilb> Date: Tue May 6 09:36:08 2014 +1000 md: avoid possible spinning md thread at shutdown. commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream. If an md array with externally managed metadata (e.g. DDF or IMSM) is in use, then we should not set safemode==2 at shutdown because: 1/ this is ineffective: user-space need to be involved in any 'safemode' handling, 2/ The safemode management code doesn't cope with safemode==2 on external metadata and md_check_recover enters an infinite loop. Even at shutdown, an infinite-looping process can be problematic, so this could cause shutdown to hang. Signed-off-by: NeilBrown <neilb> Signed-off-by: Greg Kroah-Hartman <gregkh> ____________________________________________________________________________________________ commit 8c7311a1c4a8d804bde91b00a2f2c1a22a954c30 Author: NeilBrown <neilb> Date: Mon May 5 13:34:37 2014 +1000 md/raid10: call wait_barrier() for each request submitted. commit cc13b1d1500656a20e41960668f3392dda9fa6e2 upstream. wait_barrier() includes a counter, so we must call it precisely once (unless balanced by allow_barrier()) for each request submitted. Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 block: Introduce new bio_split() in 3.14-rc1, we don't call it for the extra requests generated when we need to split a bio. When this happens the counter goes negative, any resync/recovery will never start, and "mdadm --stop" will hang. Reported-by: Chris Murphy <lists> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Cc: Kent Overstreet <kmo> Signed-off-by: NeilBrown <neilb> Signed-off-by: Greg Kroah-Hartman <gregkh> ____________________________________________________________________________________________ closed again - 3.14.6 fixed it really, moved around some TB of data over days while repeatly check/resync 4 TB RAID10, no freeze and no hang at shutdown |