RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1847437 - user@#.service goes into failed status with timeout
Summary: user@#.service goes into failed status with timeout
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: systemd
Version: 8.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: David Tardon
QA Contact: Frantisek Sumsal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-16 11:33 UTC by alex
Modified: 2020-11-11 11:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-03 11:15:06 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
output of strace on user manager process (86.45 KB, text/plain)
2020-11-11 11:42 UTC, Christian Bartolomäus
no flags Details

Description alex 2020-06-16 11:33:56 UTC
Description of problem:

User systemd service is failed

Version-Release number of selected component (if applicable):

systemd-239-30.el8_2.x86_64

How reproducible:

Steps to Reproduce:

1. $ sudo systemctl status user

Actual results:

● user - User Manager for UID 1000
   Loaded: loaded (/usr/lib/systemd/system/user@.service; static; vendor preset: disabled)
   Active: failed (Result: timeout) since Tue 2020-06-16 11:02:04 UTC; 25min ago
  Process: 1202 ExecStart=/usr/lib/systemd/systemd --user (code=exited, status=0/SUCCESS)
 Main PID: 1202 (code=exited, status=0/SUCCESS)
   Status: "Startup finished in 27ms."

Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Paths.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Timers.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped Mark boot as successful after the user session has run 2 minutes.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Sockets.
Jun 16 11:00:04 system.example.com systemd[1202]: Closed D-Bus User Message Bus Socket.
Jun 16 11:00:04 system.example.com systemd[1202]: Reached target Shutdown.
Jun 16 11:00:04 system.example.com systemd[1202]: Starting Exit the Session...
Jun 16 11:02:04 system.example.com systemd[1]: user: Processes still around after final SIGKILL. Entering failed mode.
Jun 16 11:02:04 system.example.com systemd[1]: user: Failed with result 'timeout'.
Jun 16 11:02:04 system.example.com systemd[1]: Stopped User Manager for UID 1000.

Expected results:

No systemd failed units.

Additional info:

This is a system joined to a FreeIPA domain, but user 1000 is a local, non-IPA user.

If I restart the service, it stays for a while in "active (running)" status, but dies after a while.

Comment 1 alex 2020-06-16 14:05:51 UTC
Further observations:

* IPA users in other hosts seem to be affected too
* This seems to be flapping; with no action those services can jump to failed and then back to ok

Comment 2 David Tardon 2020-06-17 09:13:48 UTC
(In reply to alex from comment #0)
> Jun 16 11:02:04 system.example.com systemd[1]: user: Processes
> still around after final SIGKILL. Entering failed mode.
> Jun 16 11:02:04 system.example.com systemd[1]: user: Failed
> with result 'timeout'.

It says right there why the unit failed: there are processes running for user 1000 that couldn't be killed.

Comment 3 alex 2020-06-17 18:56:02 UTC
Hi,

Sorry for not providing more information.

This issue presented itself running CentOS 8 inside an LXC container. I've been able to reproduce it on a freshly provisioned container by doing the following:

1. adduser foo
2. sudo -u foo crontab -e
3. Set up a cron to be executed every 10 minutes that sleeps for a minute
4. watch systemctl status user
5. eventually, after the cron runs, and exits, something like:

Jun 17 18:51:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Killing process 1187 (systemctl) with signal SIGKILL.

will be printed, but if I check immediately for PID 1187 I don't see anything (nor on the system running the LXC container)

Two minutes after that, you see:

Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Processes still around after final SIGKILL. Entering failed mode.
Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Failed with result 'timeout'.

I suppose LXC won't be a supported configuration, but if you could provide some clues as to why this started happening after updating to CentOS 8.2 that might help me fix the issue with the LXC host...

Cheers,

Álex

Comment 4 alex 2020-06-17 18:56:37 UTC
(note, doing so on a VM presents no problem, of course)

Comment 5 David Tardon 2020-07-29 07:03:41 UTC
(In reply to alex from comment #3)
> Hi,
> 
> Sorry for not providing more information.
> 
> This issue presented itself running CentOS 8 inside an LXC container. I've
> been able to reproduce it on a freshly provisioned container by doing the
> following:
> 
> 1. adduser foo
> 2. sudo -u foo crontab -e
> 3. Set up a cron to be executed every 10 minutes that sleeps for a minute
> 4. watch systemctl status user
> 5. eventually, after the cron runs, and exits, something like:
> 
> Jun 17 18:51:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Killing process 1187 (systemctl) with signal SIGKILL.

This is likely bug 1860899. It's harmless.

> will be printed, but if I check immediately for PID 1187 I don't see
> anything (nor on the system running the LXC container)
> 
> Two minutes after that, you see:
> 
> Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Processes still around after final SIGKILL. Entering failed mode.
> Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Failed with result 'timeout'.

You can run

# systemd-cgls -u user-1001.slice

to see what processes are still alive in the user slice.

Comment 6 alex 2020-07-29 09:28:32 UTC
Thanks for the info. If I have some time I will play with it.

I added myself to the cc of the other bug- if you are reasonably certain that's the root cause, feel free to close this one. If the other bug is closed I'll recheck.


The impact for this is that I like to monitor failed systemd units- this is useful because it's very easy to set up and it actually catches a surprising amount of stuff... but on the other hand it surfaces lots of small issues which otherwise are not important. Anyway, I disabled monitoring of user slices until this bug (or the other one) is fixed.

Comment 7 David Tardon 2020-08-03 11:15:06 UTC
(In reply to alex from comment #6)
> Thanks for the info. If I have some time I will play with it.
> 
> I added myself to the cc of the other bug- if you are reasonably certain
> that's the root cause, feel free to close this one. If the other bug is
> closed I'll recheck.

No. Bug 1860899 only causes that "Killing process ..." message. It doesn't have any other effect. In particular, it cannot cause failure of user@.service instances.

Otherwise, I still think there's no systemd bug here... Either there really is some process that couldn't be killed (perhaps because it hangs on I/O), which you can check by the systemd-cgls command I mentioned in comment 5; or there is none, in which case this is something in LXC or cgroups.

Comment 8 Christian Bartolomäus 2020-11-11 11:41:07 UTC
Hi,

I'm pretty sure I'm hitting the same problem. I did some research and it *looks* like a systemd problem to me. But of course, I might very well be wrong there ...

First, I'm able to reproduce the problem (consistently) like this:
1. Create LXC container with vanilla CentOS 8.1.1911
2. Install and start openssh-server
3. Add test user (I did 'useradd --uid 1000 testuser') and enable SSH login for that user
4. Update systemd (from systemd-239-18.el8_1.5 to 239-31.el8_2.2) -- otherwise there is no problem
5. Log in with new user via SSH (in a second shell)
6. Exit from SSH session
7. Watch the log messages in /var/log/messages

[The 120 seconds can be changed if a different value for TimeoutStopSec is set in the unit file /lib/systemd/system/user@.service.]

I've run the mentioned command after step 6. It looks like there is some leftover (not a complete session):

  $ systemd-cgls -u user-1000.slice
  Unit user-1000.slice (/user.slice/user-1000.slice):
  $

Also, there is stuff below /run/user/1000 and /sys/fs/cgroup/systemd/user.slice/user-1000.slice/user\@1000.service/:

  $ ls -la /run/user/1000
  total 0
  drwx------ 3 testuser testuser 80 Oct  9 10:11 .
  drwxr-xr-x 3 root     root     60 Oct  9 10:11 ..
  srw-rw-rw- 1 testuser testuser  0 Oct  9 10:11 bus
  drwxr-xr-x 2 testuser testuser 80 Oct  9 10:11 systemd
  $ ls -l /sys/fs/cgroup/systemd/user.slice/user-1000.slice/user\@1000.service/
  total 0
  -rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 cgroup.clone_children
  -rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 cgroup.procs
  drwxr-xr-x 2 testuser testuser 0 Nov 11 08:06 init.scope
  -rw-r--r-- 1 root     root     0 Nov 11 08:06 notify_on_release
  -rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 tasks

All these files were removed after the configured timeout (120 seconds).

What I also did was running a strace on the user manager process for the SSH session (before step 6) 

  $ ps auxH | grep '[s]ystemd --user'
  testuser   176  0.0  1.7  93596  9408 ?        Ss   08:06   0:00 /usr/lib/systemd/systemd --user
  $ strace -p 176  ## after attaching testuser typed 'exit' in their SSH session
  [...]

I'll attach the output from that strace command, but the following line looked interesting:

  rmdir("/sys/fs/cgroup/systemd/user.slice/user-1000.slice/user/init.scope") = -1 EBUSY (Device or resource busy)

After digging through some issues over at https://github.com/systemd/systemd/issues I found this recently merged PR, which *might* be relevant: https://github.com/systemd/systemd/pull/17422

Comment 9 Christian Bartolomäus 2020-11-11 11:42:40 UTC
Created attachment 1728300 [details]
output of strace on user manager process


Note You need to log in before you can comment on or make changes to this bug.