Red Hat Bugzilla – Bug 1310608
Very slow shutdown, it takes 1 minute 30 seconds
Last modified: 2017-04-18 05:47:36 EDT
Created attachment 1129230 [details]
Description of problem:
This is not 100% reproducible, but occurs in more than 60% of shutdowns. It seems it cannot shutdown device mapper because it is busy, so it timeouts after 1:30. I suspected akmods, because the CPU is quite busy for the time it waits, but from the log it doesn't seems it's causing the problem.
Version-Release number of selected component (if applicable):
Sometimes (in more than 60% of shutdowns)
Steps to Reproduce:
Waits and timeouts after 1 minute 30 seconds, then it shutdowns.
It starts happening after upgrade to F23.
Is there any debug option howto find out what is blocking the shutdown?
Notice the time difference in journal:
úno 22 01:27:31 yarda kernel: thinkpad_acpi: EC reports that Thermal Table has changed
úno 22 01:28:56 yarda systemd: dev-dm\x2d0.device: Job dev-dm\x2d0.device/stop timed out.
Also, this seems to be the same problem as in https://github.com/systemd/systemd/issues/2691.
(In reply to Jaroslav Škarvada from comment #0)
> Is there any debug option howto find out what is blocking the shutdown?
You can try adding "systemd.log_level=debug" to the kernel command line and attach the output of the boot where the hang happened (use "journalctl -b -1 (minus one) after reboot).
The delay is caused by systemd-coredump process which needs a bit more time on my core i7 than 90 seconds to complete (it's dumping memory and xz compressing it which is really CPU demanding especially if there is a lot of RAM). The problem is that the process is killed after 90 seconds, so nothing is dumped, which is really confusing. I see some serious problems here:
- user should be alerted what's going there on the console, i.e. that's coredump process is running, just showing "waiting for device-mapper service" (or similar) and counting down is totally useless
- it should be logged to journal that coredump process started and it's waiting for its completion, but unfortunately nothing is currently logged
- and finally the coredump process shouldn't be killed after 90 seconds, or it shouldn't use xz compression when dumping, but compress the core later (e.g. during next boot) which may result in much faster completion, otherwise it's useless time waste
By observing the coredump, it seems it's caused by xfce4-power-manager crash (SIGSEGV), backtrace follows:
#0 0x00007f8890ff3b3a in strlen () at /lib64/libc.so.6
#1 0x000055b775fab868 in get_device_description (upower=0x7f8874009270, device=device@entry=0x7f88740394e0)
#2 0x000055b775fa7526 in power_manager_button_update_device_icon_and_details (button=0x55b777b04230, device=0x7f88740394e0)
#3 0x00007f88922ee7a5 in g_closure_invoke () at /lib64/libgobject-2.0.so.0
#4 0x00007f8892300851 in signal_emit_unlocked_R () at /lib64/libgobject-2.0.so.0
#5 0x00007f8892309530 in g_signal_emit_valist () at /lib64/libgobject-2.0.so.0
#6 0x00007f88923098ff in g_signal_emit () at /lib64/libgobject-2.0.so.0
#7 0x00007f88922f2cd4 in g_object_dispatch_properties_changed () at /lib64/libgobject-2.0.so.0
#8 0x00007f88922f51c1 in g_object_notify () at /lib64/libgobject-2.0.so.0
#9 0x00007f88922ee7a5 in g_closure_invoke () at /lib64/libgobject-2.0.so.0
#10 0x00007f8892300851 in signal_emit_unlocked_R () at /lib64/libgobject-2.0.so.0
#11 0x00007f8892309530 in g_signal_emit_valist () at /lib64/libgobject-2.0.so.0
#12 0x00007f88923098ff in g_signal_emit () at /lib64/libgobject-2.0.so.0
#13 0x00007f88922f2cd4 in g_object_dispatch_properties_changed () at /lib64/libgobject-2.0.so.0
#14 0x00007f88922f51c1 in g_object_notify () at /lib64/libgobject-2.0.so.0
#15 0x00007f889254692f in up_device_glue_proxy_g_properties_changed () at /lib64/libupower-glib.so.3
#16 0x00007f888c3dad30 in ffi_call_unix64 () at /lib64/libffi.so.6
#17 0x00007f888c3da79b in ffi_call () at /lib64/libffi.so.6
#18 0x00007f88922eefc9 in g_cclosure_marshal_generic () at /lib64/libgobject-2.0.so.0
#19 0x00007f88922ee7a5 in g_closure_invoke () at /lib64/libgobject-2.0.so.0
#20 0x00007f8892300dfe in signal_emit_unlocked_R () at /lib64/libgobject-2.0.so.0
#21 0x00007f8892309530 in g_signal_emit_valist () at /lib64/libgobject-2.0.so.0
#22 0x00007f88923098ff in g_signal_emit () at /lib64/libgobject-2.0.so.0
#23 0x00007f889283283d in on_name_owner_changed () at /lib64/libgio-2.0.so.0
#24 0x00007f8892822554 in emit_signal_instance_in_idle_cb () at /lib64/libgio-2.0.so.0
#25 0x00007f8891fefe3a in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#26 0x00007f8891ff01d0 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
#27 0x00007f8891ff04f2 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#28 0x00007f8894830255 in gtk_main () at /lib64/libgtk-3.so.0
#29 0x000055b775f934db in xfpm_start (bus=<optimized out>, client_id=0x0, dump=<optimized out>) at xfpm-main.c:240
#30 0x000055b775f93b4f in main (argc=1, argv=0x7ffe2e113e48) at xfpm-main.c:447
I am not sure why it's crashing, I guess there is probably null dereference when parsing some garbage/NULL got from the upower upon it's deinitialization. Well, it's very probably easily fixable xfce4-power-manager bug, so cloning there. But I think the above mentioned systemd issues should be also addressed.
Thanks for debugging this!
I'm removing the blocking bug, since it's a separate issue.
Maybe the underlying problem is more complex and just increasing the timeout value may be not enough. Yesterday when it dumped another core with increased timeout (I increased it to 15 minutes), the dump itself took cca 2 minutes and then the CPU went idle and nothing was happening until it was killed after 15 minutes timeout. It seems there is some deadlock, probably related to the dump:
[ OK ] Started Show Plymouth Power Off Screen.
[ OK ] Stoppped Restore /run/initramfs on shutdown.
[ ***] (5 of 8) A stop job is running for LVM PV ... on /dev/dm-0 (24s / 1min 30s)[ 7987.947850] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[* ] (1 of 8) A stop job is running for LVM PV ... on /dev/dm-0 (1min 30s / no limit)[ 8053.333338] systemd: dev-dm\x2d0.device: Job dev-dm\x2d0.device/stop timed out.
[ *] (2 of 2) A stop job is running for Session 1 of user yarda (3min 16s / 15 min)
> i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
I am not sure what does it exactly means, but it is probably something from i915 driver and unrelated to this problem
> (1 of 8) A stop job is running for LVM PV ... on /dev/dm-0 (1min 30s / no limit)
Why no limit here?
> systemd: dev-dm\x2d0.device: Job dev-dm\x2d0.device/stop timed out
Why Job dev-dm\x2d0.device? The escape sequence looks strange in the report.
And finally I don't know for what it is waiting for. The systemd-coredump job seems to finish (even if not announced in the logs) and dumped the core. Maybe it didn't correctly exit? Also this was the first time I reproduced this.
(In reply to Jaroslav Škarvada from comment #5)
> [* ] (1 of 8) A stop job is running for LVM PV ... on /dev/dm-0 (1min
> 30s / no limit)[ 8053.333338] systemd: dev-dm\x2d0.device: Job
> dev-dm\x2d0.device/stop timed out.
This job timed out after 1:30 even I set 15 minutes:
The other job seems to have "no limit" and the third job 15 minutes. Why? Guys it seems totally broken.
And finally I am able to reproduce the deadlock even with the xfce4-power-manager problem resolved. So there is probably another issue.
It seems that the problem occur only if there is a coredump created during the machine shutdown after the abrt handler is uninstalled (i.e. after abrt shutdown). I think the systemd code for catching the coredumps should be reviewed as it seems it is causing deadlocks and prolonging the shutdown process. What worse, user/admin is not notified what's going under the hood.
This should be fixed in rawhide now. Reassigning to 24, to keep track of the issue there.
systemd-229-9.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-47bda25e7a
systemd-229-9.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-47bda25e7a
systemd-229-9.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
Please reopen, the problem is persistant with F25, F26 Alpha, and the nightly builds.
the watchdog timer is waiting on some event. It does not report the event for which it is waiting.
If I enter a few ctl-breaks (less than 7), the timeout counter can be reduced to around 30 seconds from 90. If I do 7+ successive ctl-breaks, the watchdog timer is system is supposed exit for the reboot, but the system locks up, waiting for the event that has not arrived.
I can also reproduce it sometimes (very seldom, it's much better now), but it maybe different source of the problem than I originally reported. And taking into account account it now occurs in cca. 5% of reboots/shutdowns to me, it's practically undebugeable for me.