Bug 1310608 - Very slow shutdown, it takes 1 minute 30 seconds
Very slow shutdown, it takes 1 minute 30 seconds
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: systemd (Show other bugs)
24
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: systemd-maint
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-22 05:55 EST by Jaroslav Škarvada
Modified: 2017-04-18 05:47 EDT (History)
12 users (show)

See Also:
Fixed In Version: systemd-229-9.fc24
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1310871 1452662 (view as bug list)
Environment:
Last Closed: 2016-07-23 14:49:18 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Journal (74.88 KB, text/plain)
2016-02-22 05:55 EST, Jaroslav Škarvada
no flags Details

  None (edit)
Description Jaroslav Škarvada 2016-02-22 05:55:39 EST
Created attachment 1129230 [details]
Journal

Description of problem:
This is not 100% reproducible, but occurs in more than 60% of shutdowns. It seems it cannot shutdown device mapper because it is busy, so it timeouts after 1:30. I suspected akmods, because the CPU is quite busy for the time it waits, but from the log it doesn't seems it's causing the problem.

Version-Release number of selected component (if applicable):
systemd-222-14.fc23.x86_64

How reproducible:
Sometimes (in more than 60% of shutdowns)

Steps to Reproduce:
1. poweroff

Actual results:
Waits and timeouts after 1 minute 30 seconds, then it shutdowns.

Expected results:
Immediate shutdown.

Additional info:
It starts happening after upgrade to F23.

Is there any debug option howto find out what is blocking the shutdown?

Notice the time difference in journal:
úno 22 01:27:31 yarda kernel: thinkpad_acpi: EC reports that Thermal Table has changed
úno 22 01:28:56 yarda systemd[1]: dev-dm\x2d0.device: Job dev-dm\x2d0.device/stop timed out.
Comment 1 Jan Synacek 2016-02-22 06:07:22 EST
Also, this seems to be the same problem as in https://github.com/systemd/systemd/issues/2691.
Comment 2 Jan Synacek 2016-02-22 06:09:40 EST
(In reply to Jaroslav Škarvada from comment #0)
> Is there any debug option howto find out what is blocking the shutdown?

You can try adding "systemd.log_level=debug" to the kernel command line and attach the output of the boot where the hang happened (use "journalctl -b -1 (minus one) after reboot).
Comment 3 Jaroslav Škarvada 2016-02-22 15:52:10 EST
The delay is caused by systemd-coredump process which needs a bit more time on my core i7 than 90 seconds to complete (it's dumping memory and xz compressing it which is really CPU demanding especially if there is a lot of RAM). The problem is that the process is killed after 90 seconds, so nothing is dumped, which is really confusing. I see some serious problems here:

- user should be alerted what's going there on the console, i.e. that's coredump process is running, just showing "waiting for device-mapper service" (or similar) and counting down is totally useless

- it should be logged to journal that coredump process started and it's waiting for its completion, but unfortunately nothing is currently logged

- and finally the coredump process shouldn't be killed after 90 seconds, or it shouldn't use xz compression when dumping, but compress the core later (e.g. during next boot) which may result in much faster completion, otherwise it's useless time waste

By observing the coredump, it seems it's caused by xfce4-power-manager crash (SIGSEGV), backtrace follows:

#0  0x00007f8890ff3b3a in strlen () at /lib64/libc.so.6
#1  0x000055b775fab868 in get_device_description (upower=0x7f8874009270, device=device@entry=0x7f88740394e0)
    at xfpm-power-common.c:267
#2  0x000055b775fa7526 in power_manager_button_update_device_icon_and_details (button=0x55b777b04230, device=0x7f88740394e0)
    at ../panel-plugins/power-manager-plugin/power-manager-button.c:454
#3  0x00007f88922ee7a5 in g_closure_invoke () at /lib64/libgobject-2.0.so.0
#4  0x00007f8892300851 in signal_emit_unlocked_R () at /lib64/libgobject-2.0.so.0
#5  0x00007f8892309530 in g_signal_emit_valist () at /lib64/libgobject-2.0.so.0
#6  0x00007f88923098ff in g_signal_emit () at /lib64/libgobject-2.0.so.0
#7  0x00007f88922f2cd4 in g_object_dispatch_properties_changed () at /lib64/libgobject-2.0.so.0
#8  0x00007f88922f51c1 in g_object_notify () at /lib64/libgobject-2.0.so.0
#9  0x00007f88922ee7a5 in g_closure_invoke () at /lib64/libgobject-2.0.so.0
#10 0x00007f8892300851 in signal_emit_unlocked_R () at /lib64/libgobject-2.0.so.0
#11 0x00007f8892309530 in g_signal_emit_valist () at /lib64/libgobject-2.0.so.0
#12 0x00007f88923098ff in g_signal_emit () at /lib64/libgobject-2.0.so.0
#13 0x00007f88922f2cd4 in g_object_dispatch_properties_changed () at /lib64/libgobject-2.0.so.0
#14 0x00007f88922f51c1 in g_object_notify () at /lib64/libgobject-2.0.so.0
#15 0x00007f889254692f in up_device_glue_proxy_g_properties_changed () at /lib64/libupower-glib.so.3
#16 0x00007f888c3dad30 in ffi_call_unix64 () at /lib64/libffi.so.6
#17 0x00007f888c3da79b in ffi_call () at /lib64/libffi.so.6
#18 0x00007f88922eefc9 in g_cclosure_marshal_generic () at /lib64/libgobject-2.0.so.0
#19 0x00007f88922ee7a5 in g_closure_invoke () at /lib64/libgobject-2.0.so.0
#20 0x00007f8892300dfe in signal_emit_unlocked_R () at /lib64/libgobject-2.0.so.0
#21 0x00007f8892309530 in g_signal_emit_valist () at /lib64/libgobject-2.0.so.0
#22 0x00007f88923098ff in g_signal_emit () at /lib64/libgobject-2.0.so.0
#23 0x00007f889283283d in on_name_owner_changed () at /lib64/libgio-2.0.so.0
#24 0x00007f8892822554 in emit_signal_instance_in_idle_cb () at /lib64/libgio-2.0.so.0
#25 0x00007f8891fefe3a in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#26 0x00007f8891ff01d0 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
#27 0x00007f8891ff04f2 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#28 0x00007f8894830255 in gtk_main () at /lib64/libgtk-3.so.0
#29 0x000055b775f934db in xfpm_start (bus=<optimized out>, client_id=0x0, dump=<optimized out>) at xfpm-main.c:240
#30 0x000055b775f93b4f in main (argc=1, argv=0x7ffe2e113e48) at xfpm-main.c:447

I am not sure why it's crashing, I guess there is probably null dereference when parsing some garbage/NULL got from the upower upon it's deinitialization. Well, it's very probably easily fixable xfce4-power-manager bug, so cloning there. But I think the above mentioned systemd issues should be also addressed.
Comment 4 Jan Synacek 2016-02-23 03:04:18 EST
Thanks for debugging this!

I'm removing the blocking bug, since it's a separate issue.
Comment 5 Jaroslav Škarvada 2016-03-18 09:39:37 EDT
Maybe the underlying problem is more complex and just increasing the timeout value may be not enough. Yesterday when it dumped another core with increased timeout (I increased it to 15 minutes), the dump itself took cca 2 minutes and then the CPU went idle and nothing was happening until it was killed after 15 minutes timeout. It seems there is some deadlock, probably related to the dump:

[  OK  ] Started Show Plymouth Power Off Screen.
[  OK  ] Stoppped Restore /run/initramfs on shutdown.
[   ***] (5 of 8) A stop job is running for LVM PV ... on /dev/dm-0 (24s / 1min 30s)[ 7987.947850] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[*     ] (1 of 8) A stop job is running for LVM PV ... on /dev/dm-0 (1min 30s / no limit)[ 8053.333338] systemd[1]: dev-dm\x2d0.device: Job dev-dm\x2d0.device/stop timed out.
[     *] (2 of 2) A stop job is running for Session 1 of user yarda (3min 16s / 15 min)

So:

> i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment

I am not sure what does it exactly means, but it is probably something from i915 driver and unrelated to this problem

> (1 of 8) A stop job is running for LVM PV ... on /dev/dm-0 (1min 30s / no limit)

Why no limit here?

> systemd[1]: dev-dm\x2d0.device: Job dev-dm\x2d0.device/stop timed out

Why Job dev-dm\x2d0.device? The escape sequence looks strange in the report.

And finally I don't know for what it is waiting for. The systemd-coredump job seems to finish (even if not announced in the logs) and dumped the core. Maybe it didn't correctly exit? Also this was the first time I reproduced this.
Comment 6 Jaroslav Škarvada 2016-03-20 05:36:33 EDT
(In reply to Jaroslav Škarvada from comment #5)

> [*     ] (1 of 8) A stop job is running for LVM PV ... on /dev/dm-0 (1min
> 30s / no limit)[ 8053.333338] systemd[1]: dev-dm\x2d0.device: Job
> dev-dm\x2d0.device/stop timed out.

This job timed out after 1:30 even I set 15 minutes:
DefaultTimeoutStopSec=900s

The other job seems to have "no limit" and the third job 15 minutes. Why? Guys it seems totally broken.
Comment 7 Jaroslav Škarvada 2016-03-20 05:38:47 EDT
And finally I am able to reproduce the deadlock even with the xfce4-power-manager problem resolved. So there is probably another issue.
Comment 8 Jaroslav Škarvada 2016-04-05 05:50:51 EDT
It seems that the problem occur only if there is a coredump created during the machine shutdown after the abrt handler is uninstalled (i.e. after abrt shutdown). I think the systemd code for catching the coredumps should be reviewed as it seems it is causing deadlocks and prolonging the shutdown process. What worse, user/admin is not notified what's going under the hood.
Comment 10 Zbigniew Jędrzejewski-Szmek 2016-05-21 23:43:36 EDT
This should be fixed in rawhide now. Reassigning to 24, to keep track of the issue there.
Comment 11 Fedora Update System 2016-07-21 02:48:03 EDT
systemd-229-9.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-47bda25e7a
Comment 12 Fedora Update System 2016-07-21 14:53:26 EDT
systemd-229-9.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-47bda25e7a
Comment 13 Fedora Update System 2016-07-23 14:48:34 EDT
systemd-229-9.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
Comment 14 Leslie Satenstein 2017-04-16 14:25:24 EDT
Please reopen, the problem is persistant with F25, F26 Alpha, and the nightly builds.

As well, 
the watchdog timer is waiting on some event. It does not report the event for which it is waiting.


If I enter a few ctl-breaks (less than 7), the timeout counter can be reduced to around 30 seconds from 90.  If I do 7+ successive ctl-breaks, the watchdog timer is system is supposed exit for the reboot, but the system locks up, waiting for the event that has not arrived.
Comment 15 Jaroslav Škarvada 2017-04-18 05:47:36 EDT
I can also reproduce it sometimes (very seldom, it's much better now), but it maybe different source of the problem than I originally reported. And taking into account account it now occurs in cca. 5% of reboots/shutdowns to me, it's practically undebugeable for me.

Note You need to log in before you can comment on or make changes to this bug.