1847437 – user@#.service goes into failed status with timeout

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1847437 - user@#.service goes into failed status with timeout

Summary: user@#.service goes into failed status with timeout

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	David Tardon
QA Contact:	Frantisek Sumsal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-16 11:33 UTC by alex
Modified:	2020-11-11 11:42 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-03 11:15:06 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
output of strace on user manager process (86.45 KB, text/plain) 2020-11-11 11:42 UTC, Christian Bartolomäus	no flags	Details
View All

Description alex 2020-06-16 11:33:56 UTC

Description of problem:

User systemd service is failed

Version-Release number of selected component (if applicable):

systemd-239-30.el8_2.x86_64

How reproducible:

Steps to Reproduce:

1. $ sudo systemctl status user

Actual results:

● user - User Manager for UID 1000
   Loaded: loaded (/usr/lib/systemd/system/user@.service; static; vendor preset: disabled)
   Active: failed (Result: timeout) since Tue 2020-06-16 11:02:04 UTC; 25min ago
  Process: 1202 ExecStart=/usr/lib/systemd/systemd --user (code=exited, status=0/SUCCESS)
 Main PID: 1202 (code=exited, status=0/SUCCESS)
   Status: "Startup finished in 27ms."

Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Paths.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Timers.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped Mark boot as successful after the user session has run 2 minutes.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Sockets.
Jun 16 11:00:04 system.example.com systemd[1202]: Closed D-Bus User Message Bus Socket.
Jun 16 11:00:04 system.example.com systemd[1202]: Reached target Shutdown.
Jun 16 11:00:04 system.example.com systemd[1202]: Starting Exit the Session...
Jun 16 11:02:04 system.example.com systemd[1]: user: Processes still around after final SIGKILL. Entering failed mode.
Jun 16 11:02:04 system.example.com systemd[1]: user: Failed with result 'timeout'.
Jun 16 11:02:04 system.example.com systemd[1]: Stopped User Manager for UID 1000.

Expected results:

No systemd failed units.

Additional info:

This is a system joined to a FreeIPA domain, but user 1000 is a local, non-IPA user.

If I restart the service, it stays for a while in "active (running)" status, but dies after a while.

Comment 1 alex 2020-06-16 14:05:51 UTC

Further observations:

* IPA users in other hosts seem to be affected too
* This seems to be flapping; with no action those services can jump to failed and then back to ok

Comment 2 David Tardon 2020-06-17 09:13:48 UTC

(In reply to alex from comment #0)
> Jun 16 11:02:04 system.example.com systemd[1]: user: Processes
> still around after final SIGKILL. Entering failed mode.
> Jun 16 11:02:04 system.example.com systemd[1]: user: Failed
> with result 'timeout'.

It says right there why the unit failed: there are processes running for user 1000 that couldn't be killed.

Comment 3 alex 2020-06-17 18:56:02 UTC

Hi,

Sorry for not providing more information.

This issue presented itself running CentOS 8 inside an LXC container. I've been able to reproduce it on a freshly provisioned container by doing the following:

1. adduser foo
2. sudo -u foo crontab -e
3. Set up a cron to be executed every 10 minutes that sleeps for a minute
4. watch systemctl status user
5. eventually, after the cron runs, and exits, something like:

Jun 17 18:51:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Killing process 1187 (systemctl) with signal SIGKILL.

will be printed, but if I check immediately for PID 1187 I don't see anything (nor on the system running the LXC container)

Two minutes after that, you see:

Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Processes still around after final SIGKILL. Entering failed mode.
Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Failed with result 'timeout'.

I suppose LXC won't be a supported configuration, but if you could provide some clues as to why this started happening after updating to CentOS 8.2 that might help me fix the issue with the LXC host...

Cheers,

Álex

Comment 4 alex 2020-06-17 18:56:37 UTC

(note, doing so on a VM presents no problem, of course)

Comment 5 David Tardon 2020-07-29 07:03:41 UTC

(In reply to alex from comment #3)
> Hi,
> 
> Sorry for not providing more information.
> 
> This issue presented itself running CentOS 8 inside an LXC container. I've
> been able to reproduce it on a freshly provisioned container by doing the
> following:
> 
> 1. adduser foo
> 2. sudo -u foo crontab -e
> 3. Set up a cron to be executed every 10 minutes that sleeps for a minute
> 4. watch systemctl status user
> 5. eventually, after the cron runs, and exits, something like:
> 
> Jun 17 18:51:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Killing process 1187 (systemctl) with signal SIGKILL.

This is likely bug 1860899. It's harmless.

> will be printed, but if I check immediately for PID 1187 I don't see
> anything (nor on the system running the LXC container)
> 
> Two minutes after that, you see:
> 
> Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Processes still around after final SIGKILL. Entering failed mode.
> Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Failed with result 'timeout'.

You can run

# systemd-cgls -u user-1001.slice

to see what processes are still alive in the user slice.

Comment 6 alex 2020-07-29 09:28:32 UTC

Thanks for the info. If I have some time I will play with it.

I added myself to the cc of the other bug- if you are reasonably certain that's the root cause, feel free to close this one. If the other bug is closed I'll recheck.


The impact for this is that I like to monitor failed systemd units- this is useful because it's very easy to set up and it actually catches a surprising amount of stuff... but on the other hand it surfaces lots of small issues which otherwise are not important. Anyway, I disabled monitoring of user slices until this bug (or the other one) is fixed.

Comment 7 David Tardon 2020-08-03 11:15:06 UTC

(In reply to alex from comment #6)
> Thanks for the info. If I have some time I will play with it.
> 
> I added myself to the cc of the other bug- if you are reasonably certain
> that's the root cause, feel free to close this one. If the other bug is
> closed I'll recheck.

No. Bug 1860899 only causes that "Killing process ..." message. It doesn't have any other effect. In particular, it cannot cause failure of user@.service instances.

Otherwise, I still think there's no systemd bug here... Either there really is some process that couldn't be killed (perhaps because it hangs on I/O), which you can check by the systemd-cgls command I mentioned in comment 5; or there is none, in which case this is something in LXC or cgroups.

Comment 8 Christian Bartolomäus 2020-11-11 11:41:07 UTC

Hi,

I'm pretty sure I'm hitting the same problem. I did some research and it *looks* like a systemd problem to me. But of course, I might very well be wrong there ...

First, I'm able to reproduce the problem (consistently) like this:
1. Create LXC container with vanilla CentOS 8.1.1911
2. Install and start openssh-server
3. Add test user (I did 'useradd --uid 1000 testuser') and enable SSH login for that user
4. Update systemd (from systemd-239-18.el8_1.5 to 239-31.el8_2.2) -- otherwise there is no problem
5. Log in with new user via SSH (in a second shell)
6. Exit from SSH session
7. Watch the log messages in /var/log/messages

[The 120 seconds can be changed if a different value for TimeoutStopSec is set in the unit file /lib/systemd/system/user@.service.]

I've run the mentioned command after step 6. It looks like there is some leftover (not a complete session):

$ systemd-cgls -u user-1000.slice
Unit user-1000.slice (/user.slice/user-1000.slice):
$

Also, there is stuff below /run/user/1000 and /sys/fs/cgroup/systemd/user.slice/user-1000.slice/user\@1000.service/:

$ ls -la /run/user/1000
total 0
drwx------ 3 testuser testuser 80 Oct 9 10:11 .
drwxr-xr-x 3 root root 60 Oct 9 10:11 ..
srw-rw-rw- 1 testuser testuser 0 Oct 9 10:11 bus
drwxr-xr-x 2 testuser testuser 80 Oct 9 10:11 systemd
$ ls -l /sys/fs/cgroup/systemd/user.slice/user-1000.slice/user\@1000.service/
total 0
-rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 cgroup.clone_children
-rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 cgroup.procs
drwxr-xr-x 2 testuser testuser 0 Nov 11 08:06 init.scope
-rw-r--r-- 1 root root 0 Nov 11 08:06 notify_on_release
-rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 tasks

All these files were removed after the configured timeout (120 seconds).

What I also did was running a strace on the user manager process for the SSH session (before step 6)

$ ps auxH | grep '[s]ystemd --user'
testuser 176 0.0 1.7 93596 9408 ? Ss 08:06 0:00 /usr/lib/systemd/systemd --user
$ strace -p 176 ## after attaching testuser typed 'exit' in their SSH session
[...]

I'll attach the output from that strace command, but the following line looked interesting:

rmdir("/sys/fs/cgroup/systemd/user.slice/user-1000.slice/user/init.scope") = -1 EBUSY (Device or resource busy)

After digging through some issues over at https://github.com/systemd/systemd/issues I found this recently merged PR, which *might* be relevant: https://github.com/systemd/systemd/pull/17422

Comment 9 Christian Bartolomäus 2020-11-11 11:42:40 UTC

Created attachment 1728300 [details]
output of strace on user manager process

Note You need to log in before you can comment on or make changes to this bug.