1023820 – slow shutdown unit user@0.service entered failed state

Bug 1023820 - slow shutdown unit user entered failed state

Summary: slow shutdown unit user@0.service entered failed state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	20
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	systemd-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1031239 (view as bug list)
Depends On:
Blocks:	1044602
TreeView+	depends on / blocked

Reported:	2013-10-28 02:10 UTC by Chris Murphy
Modified:	2019-06-18 14:14 UTC (History)
CC List:	31 users (show)
Fixed In Version:	systemd-208-14.fc20
Clone Of:
Environment:
Last Closed:	2014-02-24 12:33:09 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
shutdown-log debug (260.92 KB, text/plain) 2013-10-28 02:26 UTC, Chris Murphy	no flags	Details
virsh console output systemd.log_level=debug etc. (49.16 KB, text/plain) 2013-11-04 06:25 UTC, Chris Murphy	no flags	Details
hang on shutdown log (270.33 KB, text/plain) 2013-11-15 20:20 UTC, Zing	no flags	Details
shutdown-log.txt (557.14 KB, text/plain) 2013-12-16 23:58 UTC, Chris Murphy	no flags	Details
View All

Description Chris Murphy 2013-10-28 02:10:24 UTC

Description of problem: Over 1 minute hang on reboot/poweroff of a cleanly installed system.


Version-Release number of selected component (if applicable):

Fedora-20-Beta-TC6-x86_64-DVD.iso


How reproducible:
Uncertain. Installed systems seem to frequently but not always have this problem. This particular VM instance always reboots slow.

Steps to Reproduce:
0. qemu/kvm, qcow2, minimal package set, installed to LVM Thin Provision using guided partitioning. Booted to the newly installed system.
1. reboot or poweroff


Actual results:
> 1 minute hang, then reboot.


Expected results:
Faster reboot.


Additional info:
Rebooting and looking at the journal there are suspicious items in a row:

19:51:05 Started Restore /run/initramfs.
19:52.35 user stopping timed out. Killing.
19:52:35 Stopped User Manager for 0.
19:53:35 United user entered failed state.

sync && reboot -f reboots instantly

Comment 1 Chris Murphy 2013-10-28 02:26:50 UTC

Created attachment 816669 [details]
shutdown-log debug

Followed this:
http://freedesktop.org/wiki/Software/systemd/Debugging/#index2h1a

Comment 2 Chris Murphy 2013-11-04 06:25:14 UTC

Created attachment 818926 [details]
virsh console output systemd.log_level=debug etc.

I'm getting this from both Live Desktop Beta TC6 and RC2, but it's "A stop job is running for User Manager for 1000user stopping timed out. Killing."

Attaching virsh console log.

Comment 3 Zing 2013-11-15 20:19:16 UTC

Same issue: 2 minute hang when trying to shutdown/reboot.  Fedora 20 Beta install onto a kvm.  It happens frequently, although not 100% of the time.

snippet of my shutdown log (will attach full log):

[   65.425420] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected(
) on /org/freedesktop/DBus/Local
[  153.951503] systemd[1]: user stopping timed out. Killing.
[  153.955456] systemd[1]: user changed stop-sigterm -> stop-sigkill
[  153.955565] systemd[1]: Received SIGCHLD from PID 745 (systemd).
[  153.955614] systemd[1]: Got SIGCHLD for process 745 (systemd)
[  153.955661] systemd[1]: Child 745 died (code=killed, status=9/KILL)
[  153.955664] systemd[1]: Child 745 belongs to user
[  153.955673] systemd[1]: user: main process exited, code=killed, status=9/KILL
[  153.955680] systemd[1]: user changed stop-sigkill -> failed
[  153.955971] systemd[1]: Job user/stop finished, result=done

Comment 4 Zing 2013-11-15 20:20:09 UTC

Created attachment 824690 [details]
hang on shutdown log

Comment 5 Chris Murphy 2013-12-11 07:16:30 UTC

With systemd-208-8.fc20, and kernel-3.11.10-300.fc20 it has become more transient. But when it happens the hang is longer, over 1 minute, maybe 2 minutes.

Before the reboot, from that shell:

# loginctl user-status root
root (0)
           Since: Wed 2013-12-11 01:53:47 EST; 3min 33s ago
           State: active
        Sessions: 1
                  ├─session-1.scope
                  │ ├─1065 login -- root
                  │ └─1183 -bash
                  │ └─1216 loginctl user-status root
                  │ └─1217 less
                  └─user
                    ├─1180 /usr/lib/systemd/systemd --user
                    └─1185 (sd-pam)

# loginctl show-user 0
UID=0
GID=0
Name=root
Timestamp=Wed 2013-12-11 01:53:47 EST
TimestampMonotonic=21498632
RuntimePath=/run/user/0
Service=user
Slice=user-0.slice
State=active
Sessions=1
IdleHint=no
IdleSinceHint=0
IdleSinceHintMonotonic=0


# cat /run/systemd/users/0
# This is private data. Do not parse.
NAME=root
STATE=active
RUNTIME=/run/user/0
SERVICE=user
SLICE=user-0.slice
REALTIME=1386744827536712
MONOTONIC=21498632
SESSIONS=1
SEATS=seat0
ACTIVE_SESSIONS=1
ONLINE_SESSIONS=1
ACTIVE_SEATS=seat0
ONLINE_SEATS=seat0

Comment 6 Chris Murphy 2013-12-16 23:58:36 UTC

Created attachment 837471 [details]
shutdown-log.txt

So this is happening a majority of the time with a clean installation of Fedora 20 final RC1.1 DVD ISO plus a 3.13.x kernel:

3.13.0-0.rc3.git0.2.fc21.x86_64
systemd-208-9.fc20.x86_64


All I'm doing is booting to multi-user.target and ssh'ing into the machine once, and after some time I reboot and get this hang fairly reliably.

This looks suspicious:

[  777.611981] systemd[1]: Failed to destroy cgroup /user.slice/user-0.slice/session-1.scope: Device or resource busy

And big gaps between entries even though systemd is in debug mode:

[  778.439765] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local
[  804.091526] perf samples too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[  853.846103] systemd[1]: Accepted connection on private bus.

And finally:

  867.678101] systemd[1]: user stopping timed out. Killing.

Comment 7 Adam Williamson 2013-12-17 19:42:35 UTC

*** Bug 1031239 has been marked as a duplicate of this bug. ***

Comment 8 Chris Murphy 2013-12-18 20:33:17 UTC

Possible duplicate bug 1044602.

Comment 9 Zbigniew Jędrzejewski-Szmek 2013-12-19 19:21:47 UTC

Fixed upstream in http://cgit.freedesktop.org/systemd/systemd/commit/?id=63966da.

Comment 10 Tom Horsley 2013-12-21 01:16:19 UTC

Just FYI, I created a new user, and removed the initial temp user I made during the install, and hung for a very, very, long time when I rebooted with the message: A stop job is running for User Manager for 1000. (where 1000 was the UID of the user I deleted while I had the system up).

This is on a newly installed Fedora 20 system (from the release DVD).

Comment 11 Gary Humenuk 2013-12-30 21:50:50 UTC

I have seen this error for uid 1001 (my uid) with a minimal install on a VirtualBox machine. So it's not related to X or any window manager.

The only user related processes running were:

 root owned sshd: user [priv]
 user owned /usr/lib/systemd/systemd --user
 user owned (sd-pam)
 user owned sshd: user@pts/0
 user owned -bash

I also saw it hang for about a minute on "User Manager for 0".

Comment 12 Tom Horsley 2014-01-09 02:12:53 UTC

This is ridiculous. Apparently if I merely ssh into my f20 system using an NIS defined user account, the next time I try to reboot (long after logging out on the ssh session), the system will hang with the stoopid "stop job for UID" of the UID of the NIS user I once logged in as.

Is there some way to eradicate whatever "helpful" crap systemd is doing with these per-user things?

Comment 13 Harald Reindl 2014-01-14 01:04:51 UTC

this bug is *really* a problem

* if you are on a VT and reboot a F20 machine your VT is dead
* if you reboot from knsole you need to close the window instead cursor 
  up for re-connect after the few seconds for boot
* and the *most critical* issue: if you are working with scripts to 
  control a infrastructure with 20 machines and want to reboot them all 
  with a shellscript containing reboot and sleep commands to reduce
  load on the host you simply can forget them - "reboot-backup-machines.sh"
  is useless with F20 and RHEL7

Comment 14 Tom Horsley 2014-01-14 20:08:10 UTC

I haven't tried this yet, but someone who is more desperate than me might want to give it a shot:

This page: https://wiki.archlinux.org/index.php/Systemd/User

implies that these systemd user units are started by pam.

Possibly if the lines that say:

-session     optional      pam_systemd.so

in various files under /etc/pam.d were commented out, the stupid systemd user instances would never be created and systemd wouldn't wait forever for them to stop.

Comment 15 Harald Reindl 2014-01-14 20:14:10 UTC

you are completly wrong

here you go:
https://bugzilla.redhat.com/show_bug.cgi?id=1023788

see especially comment https://bugzilla.redhat.com/show_bug.cgi?id=1023788#c2

sshd does not close the connection properly before the real shutdown happens which results in the client tries to resume/keep-alive forever

the network-manager centric is unacceptable, nobody on this planet needs NM on static configured machines and the world does not turn around mobile devices

Comment 16 Fedora Update System 2014-01-15 01:26:46 UTC

systemd-208-11.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/systemd-208-11.fc20

Comment 17 Harald Reindl 2014-01-15 03:20:41 UTC

systemd-208-11.fc20 has https://bugzilla.redhat.com/show_bug.cgi?id=1053315 as regression and does not fix https://bugzilla.redhat.com/show_bug.cgi?id=1023788

Comment 18 Harald Reindl 2014-01-15 09:46:59 UTC

systemd-208-11.fc20 is unstable besides https://bugzilla.redhat.com/show_bug.cgi?id=1053315 - i guess on a desktop machine that could be a real problem, on a headless server it's only flooding logs and log-reportings

Jan 15 05:11:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.
Jan 15 05:31:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.
Jan 15 05:51:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.
Jan 15 06:21:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.
Jan 15 06:41:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.
Jan 15 07:11:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.
Jan 15 08:01:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.
Jan 15 08:31:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.
Jan 15 10:31:01 testserver systemd[1]: Unit systemd-logind.service entered failed state.

Comment 19 Fedora Update System 2014-01-16 07:03:10 UTC

Package systemd-208-11.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-208-11.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-0902/systemd-208-11.fc20
then log in and leave karma (feedback).

Comment 20 Harald Reindl 2014-01-16 12:33:32 UTC

*why was this pushed to updates-testing* as the build is known to be broken
this forces updates-testing users from now on use
"yum --exclude=systemd\*,libgudev\* upgrade" :-(

Comment 21 Zbigniew Jędrzejewski-Szmek 2014-01-16 14:11:42 UTC

Unpushed now. It don't think your issues were *caused* by changes in this update, but they probably *expose* them. I'll have to work on this a bit more.

Comment 22 Harald Reindl 2014-01-16 14:12:46 UTC

thank you very much!

Comment 23 Harald Reindl 2014-01-17 13:30:06 UTC

maybe that is also a sign that and what goes wrong fro someone knowing about systemd internals - at least "Reached target Sound Card" is questionable in case of a reload as well as "Mounted Configuration File System"

systemd-208-9.fc20.x86_64

[root@srv-rhsoft:~]$ > /var/log/messages; systemctl daemon-reload; cat /var/log/messages 
Jan 17 14:27:26 srv-rhsoft systemd[1]: Mounted Configuration File System.
Jan 17 14:27:26 srv-rhsoft systemd[1]: Reached target Sound Card.
Jan 17 14:27:26 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md0.
Jan 17 14:27:26 srv-rhsoft systemd[1]: Found device /dev/md0.
Jan 17 14:27:26 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:0.
Jan 17 14:27:26 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-1d691642:baed26df:1d197496:4fb00ff8.
Jan 17 14:27:26 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/boot.

Comment 24 Tom Horsley 2014-01-23 15:46:02 UTC

Well, this is clever. Apparently pulseaudio now depends on these systemd user daemons, so the technique from comment #14 does solve the interminable waiting on stop jobs at reboot, but it has the side effect of disabling sound :-(.

Comment 25 Tom Horsley 2014-01-23 19:04:54 UTC

We have stop jobs running for users, stop jobs running for NFS mounts on other systems that are down and not responding, etc. How about making systemd listen for Ctrl-Alt-Delete during shutdown/reboot processing and just terminate the damn stop jobs it is waiting on?

Comment 26 cornel panceac 2014-01-25 09:06:30 UTC

i don't know if this would help, but at about 50 minutes of uptime, i've noticed this on /var/log/messages:

Jan 25 11:01:01 localhost systemd[1]: Starting user-0.slice.
Jan 25 11:01:01 localhost systemd[1]: Created slice user-0.slice.
Jan 25 11:01:01 localhost systemd[1]: Starting User Manager for 0...
Jan 25 11:01:01 localhost systemd[1]: Starting Session 2 of user root.
Jan 25 11:01:01 localhost systemd[1]: Started Session 2 of user root.
Jan 25 11:01:01 localhost systemd: Failed to open private bus connection: Failed to connect to socket /run/user/0/dbus/user_bus_socket: No such file or directory
Jan 25 11:01:01 localhost systemd: Starting Default.
Jan 25 11:01:01 localhost systemd: Reached target Default.
Jan 25 11:01:01 localhost systemd: Startup finished in 10ms.
Jan 25 11:01:01 localhost systemd[1]: Started User Manager for 0.

Comment 27 cornel panceac 2014-01-25 09:14:30 UTC

also, i don't know if it's relevant but, bfore that, i've lost the ability to resolve internet URLs.

Comment 28 Tom Horsley 2014-01-25 15:59:38 UTC

I wonder if this is related to bug 1043212? Perhaps this slow shutdown only happens if the system has been up across a cron.daily run? I get a lot of annoying timeouts and screwed up services after cron runs every night:

bug 1057811, bug 1057618, bug 1057614.

Comment 29 Harald Reindl 2014-01-25 22:24:26 UTC

@Tom Horsley: unlikely

i *heavily* use cronjobs and rarely notice the problem of this bugreport but i face it *randomly* on my testserver-VM while it is completly unpredicatble when and why it happens - never faced it on my real machines until now


[root@srv-rhsoft:~]$ cat /etc/crontab | wc -l
48

[root@srv-rhsoft:~]$ ls /etc/cron.daily/
insgesamt 56K
-rwxr-xr-x 1 root root  434 2013-08-08 15:14 0logwatch
-rwxr-xr-x 1 root root 2,2K 2014-01-22 17:45 certwatch
-rwxr-xr-x 1 root root  364 2013-09-20 23:10 dbmail
-rwxr-xr-x 1 root root  589 2013-09-03 20:45 etckeeper
-rwxr-xr-x 1 root root  12K 2014-01-13 16:23 google-chrome
-rwxr-xr-x 1 root root  180 2012-08-01 12:56 logrotate
-rwxr-x--- 1 root root  192 2013-08-03 14:37 mlocate
-rwxr-xr-x 1 root root 1,8K 2013-08-03 12:44 rkhunter
-rwxr-xr-x 1 root root  113 2013-02-25 16:49 check-updates.cron
-rwxr-xr-x 1 root root  136 2012-01-10 07:49 cleanup-uploadtemp.cron
-rwxr-xr-x 1 root root  249 2010-08-20 16:52 contact-backup.cron
-rwxr-xr-x 1 root root  618 2013-08-03 11:32 man-db.cron

https://bugzilla.redhat.com/show_bug.cgi?id=1023788#c2 i can reprdouce on *any* machine running F20/RHEL7-Beta at any time
________________________________________

@cornel panceac

"Failed to connect to socket /run/user/0/dbus/user_bus_socket" appears all the time whenever a cronjob is started using a userid not called in any cronjob before or starting whatever user-session

https://bugzilla.redhat.com/show_bug.cgi?id=1010572

Comment 30 Harald Reindl 2014-01-25 22:27:16 UTC

*that* shows systemd in F20 has a serious problem because that happens at any package-update triggering "reload" and there are many, due the dist-upgrade a large amount of the complete log was flooded with that messages and there is no single reason why target "Sound Card" can be reached again and agin while happily play music with MPD without any interruption all the time
__________________________________________________________________

[root@srv-rhsoft:~]$ > /var/log/messages; systemctl daemon-reload; sleep 2; cat /var/log/messages 
Jan 25 23:22:28 srv-rhsoft systemd[1]: Mounted Configuration File System.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Reached target Sound Card.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md0.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /dev/md0.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:0.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-1d691642:baed26df:1d197496:4fb00ff8.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/boot.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md2.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /dev/md2.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:2.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-ea253255:cb915401:f32794ad:ce0fe396.
Jan 25 23:22:28 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/data.
__________________________________________________________________

[root@srv-rhsoft:~]$ > /var/log/messages; systemctl daemon-reload; sleep 2; cat /var/log/messages 
Jan 25 23:25:31 srv-rhsoft systemd[1]: Mounted Configuration File System.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Reached target Sound Card.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md0.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /dev/md0.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:0.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-1d691642:baed26df:1d197496:4fb00ff8.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/boot.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md2.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /dev/md2.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:2.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-ea253255:cb915401:f32794ad:ce0fe396.
Jan 25 23:25:31 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/data.

Comment 31 Harald Reindl 2014-01-25 22:37:22 UTC

here you see a machine recently rebooted

uid 48 is httpd (Apache), the loglines are for sure from teh first time
call this cronjob, "Reached target Default" is a joke, that target was reached *long before* and not by start a random cronjob

[root@testserver:~]$ cat /etc/crontab | grep apache
*/15 * * * *                                apache    nice -n 19 ionice -c 3 bash /scripts/cleanup-uploadtemp.cron
____________________________

Jan 25 23:15:01 testserver systemd[1]: Starting user-48.slice.
Jan 25 23:15:01 testserver systemd[1]: Created slice user-48.slice.
Jan 25 23:15:01 testserver systemd[1]: Starting User Manager for 48...
Jan 25 23:15:01 testserver systemd[1199]: Failed to open private bus connection: Failed to connect to socket /run/user/48/dbus/user_bus_socket: No such file or directory
Jan 25 23:15:01 testserver systemd[1199]: Starting Default.
Jan 25 23:15:01 testserver systemd[1199]: Reached target Default.
Jan 25 23:15:01 testserver systemd[1199]: Startup finished in 4ms.
Jan 25 23:15:01 testserver systemd[1]: Started User Manager for 48.
Jan 25 23:16:26 testserver systemd[1]: Starting Cleanup of Temporary Directories...
Jan 25 23:16:26 testserver systemd[1]: Started Cleanup of Temporary Directories.

Comment 32 Paolo Bonzini 2014-01-27 18:15:43 UTC

I can reproduce this with 3.12 kernel but not 3.11.

Comment 33 weiw333 2014-02-04 14:07:07 UTC

http://cgit.freedesktop.org/systemd/systemd/commit/?id=63966da fixed this bug.

Comment 34 Colin Guthrie 2014-02-14 18:15:11 UTC

Commenting here as Adam posted a reference and wanted to share my experience fixing a similar bug.

(In reply to weiw333 from comment #33)
> http://cgit.freedesktop.org/systemd/systemd/commit/?id=63966da fixed this
> bug.

I don't think this is a full solution.

I also have that patch in my build to address this, but it's a bit hacky, but OK for a released version IMO. We discussed the issue on the systemd mailing list. It related to the killing spree when systemd(PID1) sends term signals to everything) One of the processes it sends a term signal to is the systemd user session which then in-turn starts spawning processes calling kill itself, but after sending the term signal PID1 also sends kill signals to other processes in the cgroup which can (if the race is triggered) include the kill process that was spawned by systemd (user) to do it's killing... bit of a complex one to explain!

This is fixed properly upstream by the killmode=mixed patch (IIRC), but a temporary solution is similar to the fix in the httpd .service file Fedora currently ships.

Some more comments on our patch here:

http://svnweb.mageia.org/packages/cauldron/systemd/current/SOURCES/0515-Temporary-work-around-for-slow-shutdown-due-to-unter.patch?revision=568430&view=markup

I've had it confirmed from some of my users that this solves the shutdown slowness for them.

It might be unrelated I guess, but seems suspiciously similar to issues reported elsewhere and discussed at length on the systemd list.

HTHs

Comment 35 cornel panceac 2014-02-15 06:37:55 UTC

can we revert the commit that introduced this bug?

Comment 36 Fedora Update System 2014-02-17 07:37:49 UTC

systemd-208-13.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/systemd-208-13.fc20

Comment 37 Harald Reindl 2014-02-17 11:53:24 UTC

forget it

https://bugzilla.redhat.com/show_bug.cgi?id=1053315

> The patches that Djalal lists are a proper fix, but they are 
> rather intrusive, so I'm not trying to backport them now. 
> Please use 'loginctl enable-linger 0' and 'loginctl enable-linger 48' 
> for the affected users, as a work-around.

that is *unacceptable* if someone maintains a lot of machines with different cronjobs running as different users

Feb 17 12:29:28 Updated: systemd-libs-208-13.fc20.x86_64
Feb 17 12:29:29 Updated: systemd-208-13.fc20.x86_64
Feb 17 12:29:29 Updated: libgudev1-208-13.fc20.x86_64
_______________________________________

Feb 17 12:45:01 testserver systemd[1]: Starting user-48.slice.
Feb 17 12:45:01 testserver systemd[1]: Created slice user-48.slice.
Feb 17 12:45:01 testserver systemd[1]: Starting User Manager for 48...
Feb 17 12:45:01 testserver systemd[1498]: Failed to open private bus connection: Failed to connect to socket /run/user/48/dbus/user_bus_socket: No such file or directory
Feb 17 12:45:01 testserver systemd[1498]: Starting Default.
Feb 17 12:45:01 testserver systemd[1498]: Reached target Default.
Feb 17 12:45:01 testserver systemd[1498]: Startup finished in 6ms.
Feb 17 12:45:01 testserver systemd[1]: Started User Manager for 48.
Feb 17 12:45:01 testserver systemd[1]: Stopping User Manager for 48...
Feb 17 12:45:01 testserver systemd[1498]: Stopping Default.
Feb 17 12:45:01 testserver systemd[1498]: Stopped target Default.
Feb 17 12:45:01 testserver systemd[1498]: Starting Shutdown.
Feb 17 12:45:01 testserver systemd[1498]: Reached target Shutdown.
Feb 17 12:45:01 testserver systemd[1498]: Starting Exit the Session...
Feb 17 12:45:01 testserver systemd[1498]: Received SIGRTMIN+24 from PID 1538 (kill).
Feb 17 12:45:01 testserver systemd[1]: Stopped User Manager for 48.
Feb 17 12:45:01 testserver systemd[1]: Stopping user-48.slice.
Feb 17 12:45:01 testserver systemd[1]: Removed slice user-48.slice.
Feb 17 12:45:02 testserver systemd[1]: Starting Cleanup of Temporary Directories...
Feb 17 12:45:03 testserver systemd[1]: Started Cleanup of Temporary Directories.

Comment 38 Harald Reindl 2014-02-17 12:00:30 UTC

> Enable/disable user lingering for one or more users. 
> If enabled for a specific user, a user manager is 
> spawned for the user at boot and kept around after 
> logouts. This allows users who are not logged in to 
> run long-running services

if that would be true the log above would not happen because there is already running a service for uid 48 (apache), the webserver itself, the log above is a cronjob running twice a hour as user apache 

[root@testserver:~]$ systemctl status httpd.service 
httpd.service - Apache Webserver
   Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled)
   Active: active (running) since Mo 2014-02-17 12:30:12 CET; 28min ago
 Main PID: 769 (httpd)

Comment 39 Colin Guthrie 2014-02-17 14:30:59 UTC

@Harald: Note your log form comment 30 has the sound.target stuff in there on daemon-reload. This is caused by a patch that is now reverted, so that bit should be somewhat solved.

Regarding this actual bug (not related to cron) can you check to see if my suggested workaround in comment 34 would solve it for you? I think it should and is included in the new build mentioned above from a couple days ago.

There is also a (more recent i.e. a few minutes ago) a fix for log spamming which is a separate issue.

Comment 40 Harald Reindl 2014-02-17 14:48:00 UTC

@Colin: 

this are not daemon-reloads, these are cronjobs
Feb 17 12:45:01 testserver systemd[1498]: Stopping Default.
Feb 17 12:45:01 testserver systemd[1498]: Stopped target Default.
Feb 17 12:45:01 testserver systemd[1498]: Starting Shutdown.
Feb 17 12:45:01 testserver systemd[1498]: Reached target Shutdown.

https://bugzilla.redhat.com/show_bug.cgi?id=1023820#c34
besides the fact that i do not use fedora http-packages on my rawhide-VM there is no httpd enabled or running at all and i faced this issue, so any change in the httpd.service unlikley fixes the issue

the shutdown problem of this bugreport does not happen that often, but given that is is unlikely httpd-related it is hardly the way to go change random systemd-units working for F18/F19 because breakage in F20/F21/RHEL7

i give http://koji.fedoraproject.org/koji/buildinfo?buildID=498731 a try on my testserver-VM in a short
____________________________________________________________

[root@rh:~]$ cat /usr/lib/systemd/system/httpd.service
[Unit]
Description=Apache Webserver
After=network.service haveged.service rngd.service

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/httpd
Environment="PATH=/usr/sbin:/usr/bin"
ExecStart=/usr/sbin/httpd $OPTIONS -D FOREGROUND
ExecReload=/usr/sbin/httpd $OPTIONS -k graceful
Restart=always
RestartSec=1
UMask=006
PrivateTmp=yes
NoNewPrivileges=yes
CapabilityBoundingSet=CAP_DAC_OVERRIDE CAP_IPC_LOCK CAP_NET_BIND_SERVICE CAP_SETGID CAP_SETUID
ReadOnlyDirectories=/etc
ReadOnlyDirectories=/usr
InaccessibleDirectories=/sys
InaccessibleDirectories=/proc
InaccessibleDirectories=/boot
InaccessibleDirectories=/home
InaccessibleDirectories=/root
InaccessibleDirectories=/usr/local/scripts
InaccessibleDirectories=/var/lib/rpm
InaccessibleDirectories=/var/lib/yum
InaccessibleDirectories=/var/spool

[Install]
WantedBy=multi-user.target
____________________________________________________________

Comment 41 Fedora Update System 2014-02-17 15:05:42 UTC

systemd-208-14.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/systemd-208-14.fc20

Comment 42 Harald Reindl 2014-02-17 15:07:03 UTC

systemd-208-14.fc20.x86_64 looks much better

i can't confirm the hanging shutdown as fixed because i can rarely reproduce it (otherwise i would became choleric longer ago) but given that it does not bring new regressions for me i am open to give positive karma in the hope others are get more happy then with the current state

it no longer floods the log everytime a cronjob is started
so it's a acceptable compromise and fixing https://bugzilla.redhat.com/show_bug.cgi?id=1053315

however, these are annoying, the first must not happen at all and the others are misleading because calling the first time a cronjob for a user hardly reaches target Default because at that time it is already reached
Feb 17 15:55:02 testserver systemd[1163]: Failed to open private bus connection: Failed to connect to socket /run/user/48/dbus/user_bus_socket: No such file or directory
Feb 17 15:55:02 testserver systemd[1163]: Starting Default.
Feb 17 15:55:02 testserver systemd[1163]: Reached target Default.
Feb 17 15:55:02 testserver systemd[1163]: Startup finished in 4ms.
____________________________________

[root@testserver:~]$ cat /var/log/cron
Feb 17 15:55:02 testserver CROND[1178]: (apache) CMD (   nice -n 19 ionice -c 3 bash /scripts/cleanup-uploadtemp.cron)
Feb 17 16:00:01 testserver CROND[1360]: (apache) CMD (   nice -n 19 ionice -c 3 bash /scripts/cleanup-uploadtemp.cron)
____________________________________

[root@testserver:~]$ cat /var/log/messages 
Feb 17 15:55:02 testserver systemd[1]: Starting user-48.slice.
Feb 17 15:55:02 testserver systemd[1]: Created slice user-48.slice.
Feb 17 15:55:02 testserver systemd[1]: Starting User Manager for 48...
Feb 17 15:55:02 testserver systemd[1163]: Failed to open private bus connection: Failed to connect to socket /run/user/48/dbus/user_bus_socket: No such file or directory
Feb 17 15:55:02 testserver systemd[1163]: Starting Default.
Feb 17 15:55:02 testserver systemd[1163]: Reached target Default.
Feb 17 15:55:02 testserver systemd[1163]: Startup finished in 4ms.
Feb 17 15:55:02 testserver systemd[1]: Started User Manager for 48.

Comment 43 Zbigniew Jędrzejewski-Szmek 2014-02-17 15:25:17 UTC

(In reply to Harald Reindl from comment #42)
> systemd-208-14.fc20.x86_64 looks much better
Thanks for testing.

> i can't confirm the hanging shutdown as fixed because i can rarely reproduce
> it (otherwise i would became choleric longer ago) but given that it does not
> bring new regressions for me i am open to give positive karma in the hope
> others are get more happy then with the current state
> 
> it no longer floods the log everytime a cronjob is started
> so it's a acceptable compromise and fixing
> https://bugzilla.redhat.com/show_bug.cgi?id=1053315
> 
> however, these are annoying, the first must not happen at all and the others
> are misleading because calling the first time a cronjob for a user hardly
> reaches target Default because at that time it is already reached
> Feb 17 15:55:02 testserver systemd[1163]: Failed to open private bus
> connection: Failed to connect to socket /run/user/48/dbus/user_bus_socket:
> No such file or directory
> Feb 17 15:55:02 testserver systemd[1163]: Starting Default.
> Feb 17 15:55:02 testserver systemd[1163]: Reached target Default.
> Feb 17 15:55:02 testserver systemd[1163]: Startup finished in 4ms.
Those are a bit misleading... You have to look at the PID to know whether it's the
system instance or the user instance. I guess the descriptions should be improved to
say something like "Default system target", and "Default user target".

Comment 44 Colin Guthrie 2014-02-17 15:54:08 UTC

(In reply to Harald Reindl from comment #40)
> the shutdown problem of this bugreport does not happen that often, but given
> that is is unlikely httpd-related it is hardly the way to go change random
> systemd-units working for F18/F19 because breakage in F20/F21/RHEL7

For the record, I'm suggesting the same workaround as is currently used in the httpd service is also applied to the user@.service unit. This has now been done in the update you're testing, but I wanted to make it clear it's not related to the httpd.service file in anyway, other than using the same technique to workaround an overly aggressive killing spree.

Comment 45 Harald Reindl 2014-02-17 16:00:41 UTC

i understand that but we *really* should avoid such workaround in systemd-units not shipped directly with the systemd-package, this will end in a mess and unpredictable behavior all over the distribution

you never know what services somebody is running

if "killmode=mixed" is the solution for a major number and in the best case with no regressions to other services than *that* should be the systemd-default and only packages with exceptions for good reasons should override this 

even if making it systemd-default behavior solves most problems the question if this only masks the "overly aggressive killing spree" root problem instead solve it - given that systemd-upstream claims to do things the right way and all other components around in the distribution needs to be fixed left and right this would just go in the wrong direction

Comment 46 Adam Williamson 2014-02-17 16:46:29 UTC

harald: the point is we can't just take the correct upstream fixes (yet) because they're tied to all the instability that means there is not yet a 209 release. but we can't just sit around and wait for them, because we really need to fix #1043212 *now*.

Comment 47 Harald Reindl 2014-02-17 16:54:46 UTC

@Adam: i understand that *but* i also asked if "killmode=mixed" as default-behavior would solve this instead change random units left and right

> because they're tied to all the instability that means 
> there is not yet a 209 release

yes and that is why i was such a bad person to *ask* if the systemd-developer *may consider* to adjust their development process, for a core-component of in the meantime most distributions it is a unacceptable release versionin 207, 208, 209 and in the meantime any downstream needs cherry picking and backporting like hell instead have upstream point releases like 208.1, 208.2, 208.3 like in the good old days which implies "that is a bugfix-only with no expected new breakage"

well, the answer of Lennart is pretty clear and in my understanding "we are doing release-and-forget and not care about downstream distributions and users around as because that would possibly slown down the development of new features and major changes" 

-------- Original-Nachricht --------
Betreff: Re: [systemd-devel] https://bugzilla.redhat.com/show_bug.cgi?id=1047614
Datum: Wed, 12 Feb 2014 21:19:02 +0100
Von: Lennart Poettering <lennart>
Organisation: Red Hat, Inc.
An: Reindl Harald <h.reindl>
Kopie (CC): Mailing-List systemd <systemd-devel.org>

On Wed, 12.02.14 20:05, Reindl Harald (h.reindl) wrote:

> https://bugzilla.redhat.com/show_bug.cgi?id=1047614
> 
> Product: 	Red Hat Enterprise Linux 7
> Component: 	systemd (Show other bugs)
> Version: 	7.0
> Hardware: 	Unspecified Unspecified
> Priority 	urgent Severity high
> 
> first reported more than 3 months ago
> https://bugzilla.redhat.com/show_bug.cgi?id=1023788
> 
> maybe systemd-upstream should consider slow down development
> and spend more energy in quality and stability

Well, firstly, it's hardly your business how we spend our time.

Secondly, this bug is fixed upstream.

Thirdly, patches count more than complaining.

Comment 48 Adam Williamson 2014-02-17 17:07:26 UTC

For the love of Pete, how many times do I have to say? This is *bugzilla*, not a mailing list or forum. Please focus.

Comment 49 Fedora Update System 2014-02-24 12:33:09 UTC

systemd-208-14.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 50 Tom Horsley 2014-03-03 12:46:40 UTC

systemd-208-14.fc20 does NOT fix this hang.

I installed (and rebooted) it a few days ago when it appeared in updates-testing. I just did a "yum update" today and got a new kernel, so I decided to reboot again, and once again it hung longer than I could stand to wait (I gave it about 5 minutes with no obvious signs of any progress like the disk rattling) with the cylon eyeball message about a stop job running for user XXXX.

In fact, it kept switching back and forth between two different user numbers, one mine, another a different user who had been logged in using NIS during the weekend testing some software on fedora 20.

Comment 51 Mikhail Strizhov 2014-03-03 17:24:17 UTC

+1. Problem still occurs.

Comment 52 Adam Williamson 2014-03-04 07:43:13 UTC

This bug is for user . Not user XXXX. I'm not sure you've ever actually been seeing this bug.

Comment 53 Harald Reindl 2014-03-04 07:57:45 UTC

> This bug is for user . Not user XXXX

and you think the underlying problem is a different?
user@service is a template

rawhides systemd-210 seems to fix a handful of the F20 bugs on the other hand it re-opens https://bugzilla.redhat.com/show_bug.cgi?id=1053315#c31 what leaves in general a bad taste

however, systemd-210 *closes* https://bugzilla.redhat.com/show_bug.cgi?id=1023788 without dirty hacks as well as maybe the side effect https://bugzilla.redhat.com/show_bug.cgi?id=1047614#c9

Comment 54 Adam Williamson 2014-03-04 08:04:50 UTC

"and you think the underlying problem is a different?"

it seems a reasonable guess that the difference between user 0 (root) and user anything else (not root) is significant given the nature of the bug, yes.

Comment 55 Tom Horsley 2014-03-07 02:21:59 UTC

OK, if per-user stop jobs are different, I've created a new Bug 1073714 for them.

Comment 56 RobbieTheK 2019-06-18 14:14:35 UTC

I posted over at a similar sounding issue at https://github.com/systemd/systemd/issues/12702. Should I create a new BZ?

Failed to start User Manager for UID 0.

getent passwd 0
root:x:0:0:root:/root:/bin/bash
Showing this:

systemctl list-units --failed
  UNIT           LOAD   ACTIVE SUB    DESCRIPTION                                                                                                                                                                                                                                               
● user loaded failed failed User Manager for UID 0  
and:

systemctl status user
● user - User Manager for UID 0
   Loaded: loaded (/usr/lib/systemd/system/user@.service; static; vendor preset: disabled)
   Active: failed (Result: protocol) since Tue 2019-06-18 09:57:56 EDT; 9min ago
     Docs: man:user@.service(5)
 Main PID: 22782 (code=exited, status=1/FAILURE)

Jun 18 09:57:56 ourdomain.edu systemd[1]: Starting User Manager for UID 0...
Jun 18 09:57:56 ourdomain.edu systemd[22782]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
Jun 18 09:57:56 ourdomain.edu systemd[22782]: bind(/run/user/0/systemd/notify) failed: Address already in use
Jun 18 09:57:56 ourdomain.edu systemd[22782]: Failed to fully start up daemon: Address already in use
Jun 18 09:57:56 ourdomain.edu systemd[22783]: pam_unix(systemd-user:session): session closed for user root
Jun 18 09:57:56 ourdomain.edu systemd[1]: user: Failed with result 'protocol'.
Jun 18 09:57:56 ourdomain.edu systemd[1]: Failed to start User Manager for UID 0.

Note You need to log in before you can comment on or make changes to this bug.

ahmadsamir3891
amessina
andriy.ushakov
antonio.montagnani
atorkhov
awilliam
bruno
colin
cpanceac
hebert.soares
horsley1953
h.reindl
humenug
johannbg
lars
lnykryn
msekleta
pbonzini
platerx
plautrba
rdieter
rkudyba
samuel-rhbugs
Simon.Gerhards
strizhov
systemd-maint
theo148
vpavlin
weiw333
zbyszek
zing