Bug 1299178 - LVM "stop" jobs delay/stall shutdown, time out
LVM "stop" jobs delay/stall shutdown, time out
Status: NEW
Product: Fedora
Classification: Fedora
Component: lvm2 (Show other bugs)
26
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Peter Rajnoha
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-16 21:33 EST by Paul DeStefano
Modified: 2017-09-11 14:38 EDT (History)
27 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Logs (slightly redacted) (190.96 KB, application/x-gzip)
2016-02-05 00:32 EST, Paul DeStefano
no flags Details
shutdown debug log (253.19 KB, text/plain)
2016-02-23 00:21 EST, Paul DeStefano
no flags Details
Shutdown debug log (1015.81 KB, text/plain)
2016-02-26 15:55 EST, Paul DeStefano
no flags Details
Output of commands lsblk, dmsetup ls --tree, dmsetup table, dmsetup status and dmsetup info -c (2.15 KB, text/plain)
2016-10-14 10:52 EDT, Pete Philips
no flags Details
requested system information (2.75 KB, text/plain)
2016-10-15 04:19 EDT, Paul DeStefano
no flags Details

  None (edit)
Description Paul DeStefano 2016-01-16 21:33:56 EST
Description of problem:
At shutdown, 8 LVM "stop" jobs run which delay or stall the shutdown process.  On the console, a message is displayed 

[ *** ] (1/8) A stop job is running for LVM PV <UUID> .... (elapse / 1m 30s)

The elapse time counts to 1m 30s and then shutdown proceeds.  One of the stop jobs is for my user session, but the reset are LVM jobs.

Version-Release number of selected component (if applicable):


How reproducible:
Since F22 with LVM mirrored disks, at least.  I cannot seem to stop it and it seems to happen every time, AFAICT.

Steps to Reproduce:
1. Shutdown system
2.
3.

Actual results:
(I cannot find these messages in the journal, which is disconcerting.  But, there are in syslog.)

389779 Jan 15 22:46:03 <hostname> systemd: Timed out stoppping LVM PV E3JYkO-fERG-xC8b-TBHe-jq1a-nFIc-x7xwJC on /dev/dm-0.
389781 Jan 15 22:46:03 <hostname> systemd: sys-devices-virtual-block-dm\x2d0.device: Job sys-devices-virtual-block-dm\x2d0.device/stop timed out.
389782 Jan 15 22:46:03 <hostname> systemd: Timed out stoppping LVM PV E3JYkO-fERG-xC8b-TBHe-jq1a-nFIc-x7xwJC on /dev/dm-0.
389783 Jan 15 22:46:03 <hostname> systemd: sys-devices-virtual-block-dm\x2d0.device: Job sys-devices-virtual-block-dm\x2d0.device/stop failed with result 'timeout'.
389785 Jan 15 22:46:03 <hostname> systemd: Timed out stoppping LVM PV E3JYkO-fERG-xC8b-TBHe-jq1a-nFIc-x7xwJC on /dev/dm-0.
389787 Jan 15 22:46:03 <hostname> systemd: dev-block-253:0.device: Job dev-block-253:0.device/stop timed out.
389788 Jan 15 22:46:03 <hostname> systemd: Timed out stoppping LVM PV E3JYkO-fERG-xC8b-TBHe-jq1a-nFIc-x7xwJC on /dev/dm-0.
389789 Jan 15 22:46:03 <hostname> systemd: dev-block-253:0.device: Job dev-block-253:0.device/stop failed with result 'timeout'.
389790 Jan 15 22:46:03 <hostname> systemd: dev-dm\x2d0.device: Job dev-dm\x2d0.device/stop timed out.
389791 Jan 15 22:46:03 <hostname> systemd: Timed out stoppping LVM PV E3JYkO-fERG-xC8b-TBHe-jq1a-nFIc-x7xwJC on /dev/dm-0.
389792 Jan 15 22:46:03 <hostname> systemd: dev-dm\x2d0.device: Job dev-dm\x2d0.device/stop failed with result 'timeout'.
389794 Jan 15 22:46:03 <hostname> systemd: Timed out stoppping LVM PV E3JYkO-fERG-xC8b-TBHe-jq1a-nFIc-x7xwJC on /dev/dm-0.
389796 Jan 15 22:46:03 <hostname> systemd: session-1.scope: Stopping timed out. Killing.
389797 Jan 15 22:46:03 <hostname> systemd: Stopped Session 1 of user pdestefa.          
389798 Jan 15 22:46:03 <hostname> systemd: session-1.scope: Unit entered failed state.  


Expected results:
Important jobs should finish, not time out and get forcefully killed.

Additional info:
I'm not certain this symptom indicates a problem, and, if so, how serious a problem is indicated.  For all I know, this is exactly why the timeout was implemented for stop jobs.  But, for the LVM jobs, I am worried I'm missing something serious or that I can fix.  My VM with F22 default install & F23 upgrade doesn't do this; it has root on LVM, too, but no mirrored LVs.
Comment 1 Paul DeStefano 2016-01-17 03:37:27 EST
Correction, my other computer that has un-mirrored LVM root also does this, so, I don't think it's related to that.
Comment 2 Paul DeStefano 2016-01-17 03:38:27 EST
...but, it's not 8 LVM jobs, it's 1 job for /dev/dm-something.  So, slightly different, I guess.
Comment 3 Martin Kyral 2016-02-03 08:26:36 EST
I am observing this bug too. It's ultimately annoying.
Comment 4 Peter Rajnoha 2016-02-04 08:27:11 EST
Please, boot your machine with "systemd.log_level=debug systemd.log_target=kmsg" added to your kernel command line. Then try to shutdown. Boot again and then grab the log from previous run by running "journalctl -b -1", please. We should see more debug info for that...
Comment 5 Paul DeStefano 2016-02-05 00:32 EST
Created attachment 1121282 [details]
Logs (slightly redacted)

Okay, here are the logs.  Force of habit, I redacted names.  Let me know if this helps.
Comment 6 Peter Rajnoha 2016-02-08 05:11:22 EST
(In reply to Paul DeStefano from comment #5)
> Created attachment 1121282 [details]
> Logs (slightly redacted)
> 
> Okay, here are the logs.  Force of habit, I redacted names.  Let me know if
> this helps.

(from the log)
Feb 04 21:00:20 <hostname> systemd[1]: lvm2-pvscan@179:4.service: Changed stop -> dead

Feb 04 21:01:52 <hostname> systemd[1]: lvm2-pvscan@179:4.service: Failed to send unit remove signal for lvm2-pvscan@179:4.service: Transport endpoint is not connected

To be honest, I'm not quite sure now what this is caused by - I've asked Lukas Nykryn from systemd team... He'll try to inspect this a bit more.
Comment 7 Peter Rajnoha 2016-02-08 05:25:50 EST
Can you please try this for more debug:

  https://wiki.freedesktop.org/www/Software/systemd/Debugging/

And the suggestion on "Shutdown Completes Eventually".
Comment 8 Paul DeStefano 2016-02-15 19:24:58 EST
Sorry, I was away last week...

Sure, happy to, but the link doesn't open for me; I get 403 error.  Hmm, even my bookmarks to other parts of that site are dead; I've been there plenty of times, before.  I'll check again later, must be temporary problem or maintenance.
Comment 9 Peter Rajnoha 2016-02-16 02:58:33 EST
Try this one instead ("wiki" instead of "www" in the address):
https://freedesktop.org/wiki/Software/systemd/Debugging/
Comment 10 Paul DeStefano 2016-02-23 00:21 EST
Created attachment 1129605 [details]
shutdown debug log

Sorry for the delay.  I thought I had uploaded this, already.  Thanks for the help finding debug techniques.  This is what I got from following those instructions.  I hope it helps.

As always, I'm willing to do more.  Just let me know.

Also, this is not a log from my *main* system, which shows 8 LVM jobs that stall.  This is from my net book which also showed the problem, recently.  I will send you that one, too, when I get a chance to boot with the required kernel logging options.
Comment 11 Paul DeStefano 2016-02-26 15:55 EST
Created attachment 1130904 [details]
Shutdown debug log

Okay, here is the debug shutdown log from my main system with the eight stalled LVM jobs.  I assume the root cause is the same as in the previous shutdown debug log I uploaded a few days ago, but it's possible there are different problems with the same symptom, so I didn't mark that log as obsoleted by this one.
Comment 12 Christian Krause 2016-02-27 06:51:00 EST
I'm suffering from the same problem, but I found at least one possible root cause (and a workaround):

I recognized during some unrelated debugging: if I fully boot the system up to the login manager, but do not login and just shutdown again, then the problem does not happen.

Similar to Paul's debug output, I'm also seeing the log entry:

[  465.356781] systemd-cryptsetup[5471]: Device luks-... is still in use.

I tested the following:

1. login (in my case cinnamon desktop environment)
2. logout
3. go to a virtual console, login as root, check for remaining processes of my regular user chkr: a few processes work still running
4. "killall -u chkr"
5. check again: kwalletd was still running
6. some more testing: kwalletd can only be killed with SIGKILL (9), SIGTERM (15) does not help

So at least one part of the problem is, that a process of the logged in user is still running and prevents umounting of the filesystem. It looks like that the login manager / systemd does not fully kill all processes of the user slice.


When I manually kill kwalletd when still logged in ("killall -9 kwalletd"), I can shutdown the system without delay.


However, that might be only one of the problems. When checking the journal, I see some more troublesome log messages during shutdown:

1.
 systemd-cryptsetup[4278]: crypt_init() failed: No such device

2.
Feb 27 11:56:51 systemd[1]: systemd-cryptsetup@luks.....service: Control process exited, code=exited status=1
Feb 27 11:56:51 systemd[1]: Stopped Cryptography Setup for luks-...
Feb 27 11:56:51 systemd[1]: systemd-cryptsetup@luks....service: Unit entered failed state.
Feb 27 11:56:51 systemd[1]: systemd-cryptsetup@luks....service: Failed with result 'exit-code'. 

3.
kernel: device-mapper: ioctl: unable to remove open device luks
systemd-cryptsetup[4279]: Failed to deactivate: Device or resource busy


4.
Even if the problem does not happen (e.g. because I did not login and directly shutdown the system via kdm, the very last log message is:
"systemd-shutdown[1]: Failed to finalize DM devices, ignoring"


Due to these multiple errors, it is hard to judge which part of the system is at fault here:

- Should kwalletd shutdown on SIGTERM?
- Should kwalletd be even started when not using KDE?
- Should the desktop environment cleanup all started user processes?
- Should the login manager fully cleanup all started user processes after logout?
- Should systemd stop all processes in user.slice?
- Should all of the above use SIGKILL in case the processes can't be removed via SIGTERM?
- Are all of the mentioned 4 error messages attributed to a process which holds a file open? If not, are these systemd(-shutdown/-cryptsetup) problems?


I verified my findings and the workaround on a fresh default F23 installation:

- run "killall -9 kwalletd" before shutting down the system helps as a workaround
Comment 13 Paul DeStefano 2016-07-05 02:29:36 EDT
After upgrading to F24, I still see these jobs on shutdown, but they don't seem to time out.  The run for 20 seconds or so and then the shutdown process proceeds.  This is an improvement, but I dont' think it is quite right.  Does anyone recognize this?
Comment 14 MartinG 2016-07-16 16:05:30 EDT
I also have this (delay on shutdown due to LVM) on F24. Had it in F23 too.
I use KDE and kwallet as well.
Comment 15 Pete Philips 2016-10-14 09:58:51 EDT
Also seeing this in f24 with a LUKS partition.
Comment 16 Zdenek Kabelac 2016-10-14 10:04:11 EDT
Can you please attach   your  'lsblk' and 'dmsetup ls --tree'   before you start your shutdown on your system ?

Probably add directly also:

dmsetup table

dmsetup status

dmsetup info -c
Comment 17 Pete Philips 2016-10-14 10:52 EDT
Created attachment 1210583 [details]
Output of commands lsblk, dmsetup ls --tree, dmsetup table, dmsetup status  and dmsetup info -c

Attached output.txt containing output from all commands specified.
Comment 18 Zdenek Kabelac 2016-10-14 10:58:07 EDT
So the core trouble is -  lvm2 is ATM supposed to be 'top-level' device.

In you case however you are stacking  crypto-device on top of LV.

This is somewhat far more complex.

And it's not really a bug of lvm2.

This needs to be properly handle on cryptosetup layer and their deps tracking.

Lvm2 can't be turned down while on it's top is sitting crypt device.
Comment 19 Pete Philips 2016-10-14 11:01:20 EDT
Interesting. This setup was installed by the F22 installer! I then upgraded to F24 and have been getting the stop jobs shutdown problem ever since. An installer bug? A problem introduced by the upgrade?
Comment 20 Zdenek Kabelac 2016-10-14 11:09:52 EDT
Well this needs to be handled properly on  'shutdown' dracut  processing.

Looking at this man-page:

/usr/lib/dracut/modules.d/98dracut-systemd/dracut-shutdown.service.8

Killing all the processes and then starting unmount - doesn't look right when we use these types of stacks.

There seems to be a few logical bugs in this process.
Comment 21 Zdenek Kabelac 2016-10-14 11:29:51 EDT
Further down the road of systemd log debugging -

There are events trying to 'execute' lvm2-pvscan services and updating  lvmetad - this is quite imperfect for shutdown sequence.

From the trace it's unclear what is actually real goal of shutdown - but just killing everything and switching to 'shutdown-initramfs' and loosing all the context isn't going to work so easily.


Would be probably nice to see & compare  log from f22 & f24.
It seems f24 is trying to be more 'clever' and do an ultimate 'clean' shutdown - but how not enough know-how to do it proper ordering.

Some running services should not be killed but 'restarted' from shutdown ramdisk.

It's unclear what 'dbus' is actually doing there - it looks like most wait is because of dbug not wanting to die??


Using 'systemctl enable debug-shell.service'  this should be the most effective way to see what's the awaited moment in this systemd world.

See the 'systemctl list-jobs'  when you feel the machine waits for 'something' then.

Saying this - there is now some rather highly 'experimental'  'dbus' lvm2 support - so would be good to check if this experimental stuff isn't the part of problem.

There is  lvm.conf  'notify_dbus'  settings - try to set to 0 and disable associated  service with systemctl

Next thing to try  is also  'use_lvmetad=0'   

Is the shutdown then any better/faster ?

(Note: will also likely need rebuild of initramdisk - since lvm.conf is installed in dracut)
Comment 22 Zdenek Kabelac 2016-10-14 12:12:33 EDT
Few more associated comments and questions:

It may need some inspection if there is some process waiting on 'dm' cookie decrementation.


(Usually such task could be 'instantly' unblocked by this call:

dmsetup udevcomplete_all -y


If this unblocks 'shutdown'  (in system debug shell)
Which process was there sleeping and waiting on semaphore ?
(collect   ps -aux)


There used to be some bug - where udev was not really running - but still had there some sockets ?

Is  udevd running or not ?
Comment 23 Paul DeStefano 2016-10-15 04:17:54 EDT
I haven't been watching this bug lately because I haven't noticed this issue lately, since F24, possibly.  But, I need to say that the problem with LVM 'underneath' LUKS, is not this bug.  At least, that was not the situation when I originally reported this problem.  This bug started in F22-ish and persisted through F23.  If something changed in F24, that's believable.

It's hard to know what is meant by "top" and "bottom", though.  I have sd/block/md/luks/lvm.  A think I follow you in the the filesystem convention, though, which seems the most logical.  I'll add some of the requested command output next so you can see for yourself and tell me if I understand it correctly.

I'm happy to provided new logs, though.

(In reply to Zdenek Kabelac from comment #21)
> Would be probably nice to see & compare  log from f22 & f24. ...

Would my logs be helpful if I'm not seeing this anymore?  Do you want the 'shutdown-log.txt' from here (https://wiki.freedesktop.org/www/Software/systemd/Debugging/), again, or something else?
Comment 24 Paul DeStefano 2016-10-15 04:19 EDT
Created attachment 1210747 [details]
requested system information
Comment 25 Zdenek Kabelac 2016-10-17 04:41:32 EDT
(In reply to Paul DeStefano from comment #23)

> It's hard to know what is meant by "top" and "bottom", though.  I have
> sd/block/md/luks/lvm.  A think I follow you in the the filesystem
> convention, though, which seems the most logical.  I'll add some of the
> requested command output next so you can see for yourself and tell me if I
> understand it correctly.

In your case  LV  is top-level device.

> I'm happy to provided new logs, though.
> 
> (In reply to Zdenek Kabelac from comment #21)
> > Would be probably nice to see & compare  log from f22 & f24. ...
> 
> Would my logs be helpful if I'm not seeing this anymore?  Do you want the
> 'shutdown-log.txt' from here
> (https://wiki.freedesktop.org/www/Software/systemd/Debugging/), again, or
> something else?

When you run 'shutdown' and machine blocks -

press Alt+F9  (when you have installed & enabled systemd debug service)

Then you can try to figure out what's it' waiting for.

'systemctl list-jobs' and whatever else you may try to figure out.
(i.e. is there something waiting on cookie ? -
echo t >/proc/sysrq-trigger
you may need to use serial console output for easy error-log capture.
Comment 26 Pete Philips 2016-10-17 05:34:19 EDT
Apologies but if the NEEDINFO status is waiting on me, I don't think I can provide any more details. Getting slightly beyond my time/tech levels!
Comment 27 Paul DeStefano 2016-10-18 01:19:55 EDT
I am willing to do that, but not sure when.  As I say, I'm not sure it is doing it, now.  I'll make a point of checking next reboot.

Did you see the two debug shutdown logs I submitted already, as requested?  Does it not show that information?  I don't think serial logging is achievable.
Comment 28 Gilles Duboscq 2016-12-29 09:12:31 EST
FYI, i'm seeing this on F25 as well.
I have an lvm PV backed by a LUKS device inside a partition of an NVMe device.
Comment 29 Svyatoslav Lachuzhnikov 2017-01-11 06:14:27 EST
Hi Team!

Similar issue time from time on F25. 


[ *** ] (1/8) A stop job is running for LVM PV <UUID> .... (elapse / 1m 30s)

Linux 4.8.15-300.fc25.x86_64 #1 SMP Thu Dec 15 23:10:23 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/fedora-release 
Fedora release 25 (Twenty Five)

$ sudo dmsetup table
fedora-swap: 0 16482304 linear 253:0 2048
fedora-root: 0 104857600 linear 253:0 361900032
luks-e6711301-fcc1-4206-8a22-e38f9ccf740e: 0 466757632 crypt aes-xts-plain64 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0 8:2 4096
fedora-home: 0 345415680 linear 253:0 16484352
luks-ae297b9d-8657-46de-95bf-c36f5a306050: 0 468854784 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 8:17 4096

$ sudo dmsetup status
fedora-swap: 0 16482304 linear 
fedora-root: 0 104857600 linear 
luks-e6711301-fcc1-4206-8a22-e38f9ccf740e: 0 466757632 crypt 
fedora-home: 0 345415680 linear 
luks-ae297b9d-8657-46de-95bf-c36f5a306050: 0 468854784 crypt 

$ sudo dmsetup info -c
Name                                      Maj Min Stat Open Targ Event  UUID                                                                                  
fedora-swap                               253   2 L--w    2    1      0 LVM-71oRC96VTjl6cjldw258kVcs9vXT7MXIttQmLdgt0y6xSXS9zfYbq3tdeKV2hb50                  
fedora-root                               253   1 L--w    1    1      0 LVM-71oRC96VTjl6cjldw258kVcs9vXT7MXIP8tmUThMBddsU3UAWvXlS75IhyqDlF7t                  
luks-e6711301-fcc1-4206-8a22-e38f9ccf740e 253   0 L--w    3    1      0 CRYPT-LUKS1-e6711301fcc142068a22e38f9ccf740e-luks-e6711301-fcc1-4206-8a22-e38f9ccf740e
fedora-home                               253   3 L--w    1    1      0 LVM-71oRC96VTjl6cjldw258kVcs9vXT7MXIsIbCUOHnk1Nn5njKpi2ivIDdi1buBW0N                  
luks-ae297b9d-8657-46de-95bf-c36f5a306050 253   4 L--w    1    1      0 CRYPT-LUKS1-ae297b9d865746de95bfc36f5a306050-luks-ae297b9d-8657-46de-95bf-c36f5a306050
Comment 30 Fedora End Of Life 2017-07-25 15:47:32 EDT
This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Comment 31 Paul DeStefano 2017-07-26 01:20:14 EDT
So, this still was happening in F25.

However, in F26, I now see that, on boot, some boot 'jobs' are "removed".  I suspect this will prevent this bug from appearing in F26.  However, these boot jobs are *not* actually removed, they are just bypassed on boot.  It would be cool if someone could just explain to me how to actually removed these bogus jobs and then I could fix this boot warning.  But, at least I *think* the situation is better than it was before.  Thanks!
Comment 32 David Smith 2017-09-11 11:14:31 EDT
I can confirm the bug is still ongoing in Fedora 26.

It has happened to me on both bare metal and virtualized instances since installing it a week ago.

Unfortunately I do not have time to contribute to debugging efforts by supplying logs and other information.
Comment 33 Paul DeStefano 2017-09-11 14:38:55 EDT
I've updated this to F26.  Sounds like you are waiting for debug systemd info?  Okay, I'll see what I can do.  Kabelac says "after installing and enabling systemd debug service".  Is this the same thing as debug-shell.service?  Or is there a separate systemd-debug package.  If the latter, I don't see it.

Currently, I am getting some blocking on shutdown, but the output is much different than before.

Note You need to log in before you can comment on or make changes to this bug.