Bug 1664491

Summary: Shutdown takes a long time to finish
Product: [Fedora] Fedora Reporter: Romain Failliot <romain.failliot>
Component: systemdAssignee: systemd-maint
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: lnykryn, msekleta, ssahani, s, systemd-maint, zbyszek
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: systemd-239-8.gite339eae.fc29 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-22 12:38:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Debugging output of systemd during shutdown
none
journalctl --user -b -1
none
journalctl --system -b -1 none

Description Romain Failliot 2019-01-09 01:19:59 UTC
Created attachment 1519302 [details]
Debugging output of systemd during shutdown

Description of problem:

Most of the time, 9 times out of 10 I would say, when I shutdown or reboot, it takes between a minute or two to actually shut down.

Version-Release number of selected component (if applicable):

systemd 239

How reproducible:

I don't know, I just have that since I upgraded to Fedora 29 (a few weeks after it was released).

Actual results:

It takes 1-2 min to shut down.

Expected results:

I should take 20 sec or even less.

Additional info:

I followed Systemd's wiki to debug this kind of issue (https://freedesktop.org/wiki/Software/systemd/Debugging/#shutdowncompleteseventually), but I can't really understand the output. I think the clue is somewhere near these lines, since the timestamps jump matches the time I waited until my machine actually shut down (78314 -> 78403, ~89 sec):

[78314.219163] systemd[1]: Received SIGCHLD from PID 7222 ((sd-sync)).
[78314.219210] systemd[1]: Child 7222 ((sd-sync)) died (code=exited, status=0/SUCCESS)
[78314.219389] systemd-journald[530]: Journal effective settings seal=no compress=yes compress_threshold_bytes=512B
[78314.219498] systemd-journald[530]: Journal effective settings seal=no compress=yes compress_threshold_bytes=512B
[78403.197338] systemd-journald[530]: Sent WATCHDOG=1 notification.
[78403.197659] systemd[1]: systemd-journald.service: Got notification message from PID 530 (WATCHDOG=1)

I've attached the output with this issue.

Comment 1 Romain Failliot 2019-01-09 01:31:49 UTC
If I hit ESC while waiting, here's the line I see in the terminal:

> [  ***] A stop job is running for User Manager for UID 1000 (47s/ 2min)

But I can't see which job is still running.

Comment 2 Zbigniew Jędrzejewski-Szmek 2019-01-09 08:53:36 UTC
[78403.234573] systemd[1]: Received SIGCHLD from PID 7285 (systemctl).              
[78403.234624] systemd[1]: Child 1919 (systemd) died (code=exited, status=0/SUCCESS)         <----------------------
[78403.234683] systemd[1]: user: Child 1919 belongs to user.
[78403.234697] systemd[1]: user: Main process exited, code=exited, status=0/SUCCESS
[78403.235015] systemd[1]: user: Changed stop-sigterm -> final-sigkill
[78403.235468] systemd[1]: Child 7285 (systemctl) died (code=exited, status=0/SUCCESS)       <----------------------
[78403.235513] systemd[1]: user: Child 7285 belongs to user.
[78403.235899] systemd[1]: systemd-journald.service: Received EPOLLHUP on stored fd 321 (stored), closing.
[78403.235948] systemd[1]: Received SIGCHLD from PID 1921 ((sd-pam)).
[78403.235982] systemd[1]: Child 1921 ((sd-pam)) died (code=killed, status=9/KILL)            <----------------------
[78403.236028] systemd[1]: user: Child 1921 belongs to user.
[78403.236215] systemd[1]: user: cgroup is empty
[78403.236225] systemd[1]: user: Changed final-sigkill -> dead
[78403.237526] systemd[1]: user: Job user/stop finished, result=done
[78403.237531] systemd[1]: Stopped User Manager for UID 1000.

This looks a bit similar to https://github.com/systemd/systemd/issues/10414, but I don't think it's the same issue.
It seems here that we wait for systemd --user to stop (PID 1919), and in the end it does.
systemctl (PID 7285) was sending the request to PID 1919, so it also exits.
After that, PID 1 kills the sd-pam helper process, which is ugly but OK.

But we don't know why systemd --user is waiting. My guess is that you're
using cgroups v1 which has unreliable notifications, and it misses the stop
signal from some unit.

Do you have any idea why the messages from 'systemct --user' instance are missing? Did you use
some filter (e.g. journalctl --system) for those logs?

Comment 3 Romain Failliot 2019-01-09 14:14:21 UTC
I just followed the instructions here: https://freedesktop.org/wiki/Software/systemd/Debugging/#shutdowncompleteseventually
And then attached the file from /shutdown-log.txt

I was also hoping to get an output that was a bit easier to understand. Maybe it's something with the boot options?

Comment 4 Romain Failliot 2019-01-09 14:17:16 UTC
Also, I don't know about cgroups. I had a fresh install of Fedora 28 and upgraded to 29. Is it possible that cgroups config went rogue just with one dist upgrade?

Comment 5 Zbigniew Jędrzejewski-Szmek 2019-01-09 15:02:36 UTC
> I just followed the instructions here: https://freedesktop.org/wiki/Software/systemd/Debugging/#shutdowncompleteseventually

Ah, OK. You said that before, but I missed it. This doesn't include the logs from the
systemd --user instance which is the problem in this case. But those logs
should be available normally in 'journalctl --user'. Can you paste that too? (The part
from the shutdown sequence.)

> Also, I don't know about cgroups. I had a fresh install of Fedora 28 and upgraded to 29. Is it possible that cgroups config went rogue just with one dist upgrade?

Sorry, I was unclear. Cgroups v1 is the default in Fedora. It has known bugs and shortcomings,
but we cannot update because of kubernetes and libvirt and such. I was just stating the
fact that this might be underlying cause, because on cgroups v1 we sometimes miss notifications
about empty cgroups, which can then lead to a timeout when units are stopped.

Comment 6 Fedora Update System 2019-01-11 12:54:06 UTC
systemd-239-8.gite339eae.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-18b3a10c7f

Comment 7 Fedora Update System 2019-01-12 02:30:44 UTC
systemd-239-8.gite339eae.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-18b3a10c7f

Comment 8 Fedora Update System 2019-01-13 02:31:16 UTC
systemd-239-8.gite339eae.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 9 Romain Failliot 2019-02-10 18:37:30 UTC
So, sorry for the long wait.

I still have the problem, even with systemd 239.

Not entirely sure which log you needed, I've joined both outputs from `journalctl --user -b -1` and `journalctl --system -b -1`.

Comment 10 Romain Failliot 2019-02-10 18:38:21 UTC
Created attachment 1528792 [details]
journalctl --user -b -1

Comment 11 Romain Failliot 2019-02-10 18:39:24 UTC
Created attachment 1528793 [details]
journalctl --system -b -1

Comment 12 Romain Failliot 2019-03-22 12:38:58 UTC
This bug is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1656414
And it seems to come from gnome-search-tool

*** This bug has been marked as a duplicate of bug 1656414 ***