Bug 1299178
Summary: | LVM "stop" jobs delay/stall shutdown, time out | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Paul DeStefano <prd-fedora> | ||||||||||||
Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 26 | CC: | agk, bmarzins, bmr, bugzilla, chkr, dkeeler, dsmithhfx, dwysocha, fedora, gilwooden, heinzm, jonathan, lachuzhnikov, lnykryn, lvm-team, martin, mkyral, msnitzer, mu, pereira.vitor.manuel, pete.philips, prajnoha, prd-fedora, prockai, rockowitz, tilltotal, zkabelac | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2018-05-03 09:01:49 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Paul DeStefano
2016-01-17 02:33:56 UTC
Correction, my other computer that has un-mirrored LVM root also does this, so, I don't think it's related to that. ...but, it's not 8 LVM jobs, it's 1 job for /dev/dm-something. So, slightly different, I guess. I am observing this bug too. It's ultimately annoying. Please, boot your machine with "systemd.log_level=debug systemd.log_target=kmsg" added to your kernel command line. Then try to shutdown. Boot again and then grab the log from previous run by running "journalctl -b -1", please. We should see more debug info for that... Created attachment 1121282 [details]
Logs (slightly redacted)
Okay, here are the logs. Force of habit, I redacted names. Let me know if this helps.
(In reply to Paul DeStefano from comment #5) > Created attachment 1121282 [details] > Logs (slightly redacted) > > Okay, here are the logs. Force of habit, I redacted names. Let me know if > this helps. (from the log) Feb 04 21:00:20 <hostname> systemd[1]: lvm2-pvscan@179:4.service: Changed stop -> dead Feb 04 21:01:52 <hostname> systemd[1]: lvm2-pvscan@179:4.service: Failed to send unit remove signal for lvm2-pvscan@179:4.service: Transport endpoint is not connected To be honest, I'm not quite sure now what this is caused by - I've asked Lukas Nykryn from systemd team... He'll try to inspect this a bit more. Can you please try this for more debug: https://wiki.freedesktop.org/www/Software/systemd/Debugging/ And the suggestion on "Shutdown Completes Eventually". Sorry, I was away last week... Sure, happy to, but the link doesn't open for me; I get 403 error. Hmm, even my bookmarks to other parts of that site are dead; I've been there plenty of times, before. I'll check again later, must be temporary problem or maintenance. Try this one instead ("wiki" instead of "www" in the address): https://freedesktop.org/wiki/Software/systemd/Debugging/ Created attachment 1129605 [details]
shutdown debug log
Sorry for the delay. I thought I had uploaded this, already. Thanks for the help finding debug techniques. This is what I got from following those instructions. I hope it helps.
As always, I'm willing to do more. Just let me know.
Also, this is not a log from my *main* system, which shows 8 LVM jobs that stall. This is from my net book which also showed the problem, recently. I will send you that one, too, when I get a chance to boot with the required kernel logging options.
Created attachment 1130904 [details]
Shutdown debug log
Okay, here is the debug shutdown log from my main system with the eight stalled LVM jobs. I assume the root cause is the same as in the previous shutdown debug log I uploaded a few days ago, but it's possible there are different problems with the same symptom, so I didn't mark that log as obsoleted by this one.
I'm suffering from the same problem, but I found at least one possible root cause (and a workaround): I recognized during some unrelated debugging: if I fully boot the system up to the login manager, but do not login and just shutdown again, then the problem does not happen. Similar to Paul's debug output, I'm also seeing the log entry: [ 465.356781] systemd-cryptsetup[5471]: Device luks-... is still in use. I tested the following: 1. login (in my case cinnamon desktop environment) 2. logout 3. go to a virtual console, login as root, check for remaining processes of my regular user chkr: a few processes work still running 4. "killall -u chkr" 5. check again: kwalletd was still running 6. some more testing: kwalletd can only be killed with SIGKILL (9), SIGTERM (15) does not help So at least one part of the problem is, that a process of the logged in user is still running and prevents umounting of the filesystem. It looks like that the login manager / systemd does not fully kill all processes of the user slice. When I manually kill kwalletd when still logged in ("killall -9 kwalletd"), I can shutdown the system without delay. However, that might be only one of the problems. When checking the journal, I see some more troublesome log messages during shutdown: 1. systemd-cryptsetup[4278]: crypt_init() failed: No such device 2. Feb 27 11:56:51 systemd[1]: systemd-cryptsetup@luks.....service: Control process exited, code=exited status=1 Feb 27 11:56:51 systemd[1]: Stopped Cryptography Setup for luks-... Feb 27 11:56:51 systemd[1]: systemd-cryptsetup@luks....service: Unit entered failed state. Feb 27 11:56:51 systemd[1]: systemd-cryptsetup@luks....service: Failed with result 'exit-code'. 3. kernel: device-mapper: ioctl: unable to remove open device luks systemd-cryptsetup[4279]: Failed to deactivate: Device or resource busy 4. Even if the problem does not happen (e.g. because I did not login and directly shutdown the system via kdm, the very last log message is: "systemd-shutdown[1]: Failed to finalize DM devices, ignoring" Due to these multiple errors, it is hard to judge which part of the system is at fault here: - Should kwalletd shutdown on SIGTERM? - Should kwalletd be even started when not using KDE? - Should the desktop environment cleanup all started user processes? - Should the login manager fully cleanup all started user processes after logout? - Should systemd stop all processes in user.slice? - Should all of the above use SIGKILL in case the processes can't be removed via SIGTERM? - Are all of the mentioned 4 error messages attributed to a process which holds a file open? If not, are these systemd(-shutdown/-cryptsetup) problems? I verified my findings and the workaround on a fresh default F23 installation: - run "killall -9 kwalletd" before shutting down the system helps as a workaround After upgrading to F24, I still see these jobs on shutdown, but they don't seem to time out. The run for 20 seconds or so and then the shutdown process proceeds. This is an improvement, but I dont' think it is quite right. Does anyone recognize this? I also have this (delay on shutdown due to LVM) on F24. Had it in F23 too. I use KDE and kwallet as well. Also seeing this in f24 with a LUKS partition. Can you please attach your 'lsblk' and 'dmsetup ls --tree' before you start your shutdown on your system ? Probably add directly also: dmsetup table dmsetup status dmsetup info -c Created attachment 1210583 [details]
Output of commands lsblk, dmsetup ls --tree, dmsetup table, dmsetup status and dmsetup info -c
Attached output.txt containing output from all commands specified.
So the core trouble is - lvm2 is ATM supposed to be 'top-level' device. In you case however you are stacking crypto-device on top of LV. This is somewhat far more complex. And it's not really a bug of lvm2. This needs to be properly handle on cryptosetup layer and their deps tracking. Lvm2 can't be turned down while on it's top is sitting crypt device. Interesting. This setup was installed by the F22 installer! I then upgraded to F24 and have been getting the stop jobs shutdown problem ever since. An installer bug? A problem introduced by the upgrade? Well this needs to be handled properly on 'shutdown' dracut processing. Looking at this man-page: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-shutdown.service.8 Killing all the processes and then starting unmount - doesn't look right when we use these types of stacks. There seems to be a few logical bugs in this process. Further down the road of systemd log debugging - There are events trying to 'execute' lvm2-pvscan services and updating lvmetad - this is quite imperfect for shutdown sequence. From the trace it's unclear what is actually real goal of shutdown - but just killing everything and switching to 'shutdown-initramfs' and loosing all the context isn't going to work so easily. Would be probably nice to see & compare log from f22 & f24. It seems f24 is trying to be more 'clever' and do an ultimate 'clean' shutdown - but how not enough know-how to do it proper ordering. Some running services should not be killed but 'restarted' from shutdown ramdisk. It's unclear what 'dbus' is actually doing there - it looks like most wait is because of dbug not wanting to die?? Using 'systemctl enable debug-shell.service' this should be the most effective way to see what's the awaited moment in this systemd world. See the 'systemctl list-jobs' when you feel the machine waits for 'something' then. Saying this - there is now some rather highly 'experimental' 'dbus' lvm2 support - so would be good to check if this experimental stuff isn't the part of problem. There is lvm.conf 'notify_dbus' settings - try to set to 0 and disable associated service with systemctl Next thing to try is also 'use_lvmetad=0' Is the shutdown then any better/faster ? (Note: will also likely need rebuild of initramdisk - since lvm.conf is installed in dracut) Few more associated comments and questions: It may need some inspection if there is some process waiting on 'dm' cookie decrementation. (Usually such task could be 'instantly' unblocked by this call: dmsetup udevcomplete_all -y If this unblocks 'shutdown' (in system debug shell) Which process was there sleeping and waiting on semaphore ? (collect ps -aux) There used to be some bug - where udev was not really running - but still had there some sockets ? Is udevd running or not ? I haven't been watching this bug lately because I haven't noticed this issue lately, since F24, possibly. But, I need to say that the problem with LVM 'underneath' LUKS, is not this bug. At least, that was not the situation when I originally reported this problem. This bug started in F22-ish and persisted through F23. If something changed in F24, that's believable. It's hard to know what is meant by "top" and "bottom", though. I have sd/block/md/luks/lvm. A think I follow you in the the filesystem convention, though, which seems the most logical. I'll add some of the requested command output next so you can see for yourself and tell me if I understand it correctly. I'm happy to provided new logs, though. (In reply to Zdenek Kabelac from comment #21) > Would be probably nice to see & compare log from f22 & f24. ... Would my logs be helpful if I'm not seeing this anymore? Do you want the 'shutdown-log.txt' from here (https://wiki.freedesktop.org/www/Software/systemd/Debugging/), again, or something else? Created attachment 1210747 [details]
requested system information
(In reply to Paul DeStefano from comment #23) > It's hard to know what is meant by "top" and "bottom", though. I have > sd/block/md/luks/lvm. A think I follow you in the the filesystem > convention, though, which seems the most logical. I'll add some of the > requested command output next so you can see for yourself and tell me if I > understand it correctly. In your case LV is top-level device. > I'm happy to provided new logs, though. > > (In reply to Zdenek Kabelac from comment #21) > > Would be probably nice to see & compare log from f22 & f24. ... > > Would my logs be helpful if I'm not seeing this anymore? Do you want the > 'shutdown-log.txt' from here > (https://wiki.freedesktop.org/www/Software/systemd/Debugging/), again, or > something else? When you run 'shutdown' and machine blocks - press Alt+F9 (when you have installed & enabled systemd debug service) Then you can try to figure out what's it' waiting for. 'systemctl list-jobs' and whatever else you may try to figure out. (i.e. is there something waiting on cookie ? - echo t >/proc/sysrq-trigger you may need to use serial console output for easy error-log capture. Apologies but if the NEEDINFO status is waiting on me, I don't think I can provide any more details. Getting slightly beyond my time/tech levels! I am willing to do that, but not sure when. As I say, I'm not sure it is doing it, now. I'll make a point of checking next reboot. Did you see the two debug shutdown logs I submitted already, as requested? Does it not show that information? I don't think serial logging is achievable. FYI, i'm seeing this on F25 as well. I have an lvm PV backed by a LUKS device inside a partition of an NVMe device. Hi Team! Similar issue time from time on F25. [ *** ] (1/8) A stop job is running for LVM PV <UUID> .... (elapse / 1m 30s) Linux 4.8.15-300.fc25.x86_64 #1 SMP Thu Dec 15 23:10:23 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/fedora-release Fedora release 25 (Twenty Five) $ sudo dmsetup table fedora-swap: 0 16482304 linear 253:0 2048 fedora-root: 0 104857600 linear 253:0 361900032 luks-e6711301-fcc1-4206-8a22-e38f9ccf740e: 0 466757632 crypt aes-xts-plain64 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0 8:2 4096 fedora-home: 0 345415680 linear 253:0 16484352 luks-ae297b9d-8657-46de-95bf-c36f5a306050: 0 468854784 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 8:17 4096 $ sudo dmsetup status fedora-swap: 0 16482304 linear fedora-root: 0 104857600 linear luks-e6711301-fcc1-4206-8a22-e38f9ccf740e: 0 466757632 crypt fedora-home: 0 345415680 linear luks-ae297b9d-8657-46de-95bf-c36f5a306050: 0 468854784 crypt $ sudo dmsetup info -c Name Maj Min Stat Open Targ Event UUID fedora-swap 253 2 L--w 2 1 0 LVM-71oRC96VTjl6cjldw258kVcs9vXT7MXIttQmLdgt0y6xSXS9zfYbq3tdeKV2hb50 fedora-root 253 1 L--w 1 1 0 LVM-71oRC96VTjl6cjldw258kVcs9vXT7MXIP8tmUThMBddsU3UAWvXlS75IhyqDlF7t luks-e6711301-fcc1-4206-8a22-e38f9ccf740e 253 0 L--w 3 1 0 CRYPT-LUKS1-e6711301fcc142068a22e38f9ccf740e-luks-e6711301-fcc1-4206-8a22-e38f9ccf740e fedora-home 253 3 L--w 1 1 0 LVM-71oRC96VTjl6cjldw258kVcs9vXT7MXIsIbCUOHnk1Nn5njKpi2ivIDdi1buBW0N luks-ae297b9d-8657-46de-95bf-c36f5a306050 253 4 L--w 1 1 0 CRYPT-LUKS1-ae297b9d865746de95bfc36f5a306050-luks-ae297b9d-8657-46de-95bf-c36f5a306050 This message is a reminder that Fedora 24 is nearing its end of life. Approximately 2 (two) weeks from now Fedora will stop maintaining and issuing updates for Fedora 24. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '24'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 24 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. So, this still was happening in F25. However, in F26, I now see that, on boot, some boot 'jobs' are "removed". I suspect this will prevent this bug from appearing in F26. However, these boot jobs are *not* actually removed, they are just bypassed on boot. It would be cool if someone could just explain to me how to actually removed these bogus jobs and then I could fix this boot warning. But, at least I *think* the situation is better than it was before. Thanks! I can confirm the bug is still ongoing in Fedora 26. It has happened to me on both bare metal and virtualized instances since installing it a week ago. Unfortunately I do not have time to contribute to debugging efforts by supplying logs and other information. I've updated this to F26. Sounds like you are waiting for debug systemd info? Okay, I'll see what I can do. Kabelac says "after installing and enabling systemd debug service". Is this the same thing as debug-shell.service? Or is there a separate systemd-debug package. If the latter, I don't see it. Currently, I am getting some blocking on shutdown, but the output is much different than before. This message is a reminder that Fedora 26 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '26'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 26 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. I believe this was related to some lvmetad killing order issue that has been addressed. Close as working on current release - if the issue is still reproducible in some way - feel free to reopen against upstream version. |