Bug 1847437

Summary:

user@#.service goes into failed status with timeout

Product:

Red Hat Enterprise Linux 8

Reporter:

alex

Component:

systemd

Assignee:

David Tardon <dtardon>

Status:

CLOSED NOTABUG

QA Contact:

Frantisek Sumsal <fsumsal>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

8.2

CC:

alex, bartolin, dtardon, msekleta, systemd-maint-list

Target Milestone:

Keywords:

Reopened

Target Release:

8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-08-03 11:15:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
output of strace on user manager process	none

Description alex 2020-06-16 11:33:56 UTC

Description of problem:

User systemd service is failed

Version-Release number of selected component (if applicable):

systemd-239-30.el8_2.x86_64

How reproducible:

Steps to Reproduce:

1. $ sudo systemctl status user

Actual results:

● user - User Manager for UID 1000
   Loaded: loaded (/usr/lib/systemd/system/user@.service; static; vendor preset: disabled)
   Active: failed (Result: timeout) since Tue 2020-06-16 11:02:04 UTC; 25min ago
  Process: 1202 ExecStart=/usr/lib/systemd/systemd --user (code=exited, status=0/SUCCESS)
 Main PID: 1202 (code=exited, status=0/SUCCESS)
   Status: "Startup finished in 27ms."

Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Paths.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Timers.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped Mark boot as successful after the user session has run 2 minutes.
Jun 16 11:00:04 system.example.com systemd[1202]: Stopped target Sockets.
Jun 16 11:00:04 system.example.com systemd[1202]: Closed D-Bus User Message Bus Socket.
Jun 16 11:00:04 system.example.com systemd[1202]: Reached target Shutdown.
Jun 16 11:00:04 system.example.com systemd[1202]: Starting Exit the Session...
Jun 16 11:02:04 system.example.com systemd[1]: user: Processes still around after final SIGKILL. Entering failed mode.
Jun 16 11:02:04 system.example.com systemd[1]: user: Failed with result 'timeout'.
Jun 16 11:02:04 system.example.com systemd[1]: Stopped User Manager for UID 1000.

Expected results:

No systemd failed units.

Additional info:

This is a system joined to a FreeIPA domain, but user 1000 is a local, non-IPA user.

If I restart the service, it stays for a while in "active (running)" status, but dies after a while.

Comment 1 alex 2020-06-16 14:05:51 UTC

Further observations:

* IPA users in other hosts seem to be affected too
* This seems to be flapping; with no action those services can jump to failed and then back to ok

Comment 2 David Tardon 2020-06-17 09:13:48 UTC

(In reply to alex from comment #0)
> Jun 16 11:02:04 system.example.com systemd[1]: user: Processes
> still around after final SIGKILL. Entering failed mode.
> Jun 16 11:02:04 system.example.com systemd[1]: user: Failed
> with result 'timeout'.

It says right there why the unit failed: there are processes running for user 1000 that couldn't be killed.

Comment 3 alex 2020-06-17 18:56:02 UTC

Hi,

Sorry for not providing more information.

This issue presented itself running CentOS 8 inside an LXC container. I've been able to reproduce it on a freshly provisioned container by doing the following:

1. adduser foo
2. sudo -u foo crontab -e
3. Set up a cron to be executed every 10 minutes that sleeps for a minute
4. watch systemctl status user
5. eventually, after the cron runs, and exits, something like:

Jun 17 18:51:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Killing process 1187 (systemctl) with signal SIGKILL.

will be printed, but if I check immediately for PID 1187 I don't see anything (nor on the system running the LXC container)

Two minutes after that, you see:

Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Processes still around after final SIGKILL. Entering failed mode.
Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Failed with result 'timeout'.

I suppose LXC won't be a supported configuration, but if you could provide some clues as to why this started happening after updating to CentOS 8.2 that might help me fix the issue with the LXC host...

Cheers,

Álex

Comment 4 alex 2020-06-17 18:56:37 UTC

(note, doing so on a VM presents no problem, of course)

Comment 5 David Tardon 2020-07-29 07:03:41 UTC

(In reply to alex from comment #3)
> Hi,
> 
> Sorry for not providing more information.
> 
> This issue presented itself running CentOS 8 inside an LXC container. I've
> been able to reproduce it on a freshly provisioned container by doing the
> following:
> 
> 1. adduser foo
> 2. sudo -u foo crontab -e
> 3. Set up a cron to be executed every 10 minutes that sleeps for a minute
> 4. watch systemctl status user
> 5. eventually, after the cron runs, and exits, something like:
> 
> Jun 17 18:51:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Killing process 1187 (systemctl) with signal SIGKILL.

This is likely bug 1860899. It's harmless.

> will be printed, but if I check immediately for PID 1187 I don't see
> anything (nor on the system running the LXC container)
> 
> Two minutes after that, you see:
> 
> Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Processes still around after final SIGKILL. Entering failed mode.
> Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user:
> Failed with result 'timeout'.

You can run

# systemd-cgls -u user-1001.slice

to see what processes are still alive in the user slice.

Comment 6 alex 2020-07-29 09:28:32 UTC

Thanks for the info. If I have some time I will play with it.

I added myself to the cc of the other bug- if you are reasonably certain that's the root cause, feel free to close this one. If the other bug is closed I'll recheck.


The impact for this is that I like to monitor failed systemd units- this is useful because it's very easy to set up and it actually catches a surprising amount of stuff... but on the other hand it surfaces lots of small issues which otherwise are not important. Anyway, I disabled monitoring of user slices until this bug (or the other one) is fixed.

Comment 7 David Tardon 2020-08-03 11:15:06 UTC

(In reply to alex from comment #6)
> Thanks for the info. If I have some time I will play with it.
> 
> I added myself to the cc of the other bug- if you are reasonably certain
> that's the root cause, feel free to close this one. If the other bug is
> closed I'll recheck.

No. Bug 1860899 only causes that "Killing process ..." message. It doesn't have any other effect. In particular, it cannot cause failure of user@.service instances.

Otherwise, I still think there's no systemd bug here... Either there really is some process that couldn't be killed (perhaps because it hangs on I/O), which you can check by the systemd-cgls command I mentioned in comment 5; or there is none, in which case this is something in LXC or cgroups.

Comment 8 Christian Bartolomäus 2020-11-11 11:41:07 UTC

Hi,

I'm pretty sure I'm hitting the same problem. I did some research and it *looks* like a systemd problem to me. But of course, I might very well be wrong there ...

First, I'm able to reproduce the problem (consistently) like this:
1. Create LXC container with vanilla CentOS 8.1.1911
2. Install and start openssh-server
3. Add test user (I did 'useradd --uid 1000 testuser') and enable SSH login for that user
4. Update systemd (from systemd-239-18.el8_1.5 to 239-31.el8_2.2) -- otherwise there is no problem
5. Log in with new user via SSH (in a second shell)
6. Exit from SSH session
7. Watch the log messages in /var/log/messages

[The 120 seconds can be changed if a different value for TimeoutStopSec is set in the unit file /lib/systemd/system/user@.service.]

I've run the mentioned command after step 6. It looks like there is some leftover (not a complete session):

$ systemd-cgls -u user-1000.slice
Unit user-1000.slice (/user.slice/user-1000.slice):
$

Also, there is stuff below /run/user/1000 and /sys/fs/cgroup/systemd/user.slice/user-1000.slice/user\@1000.service/:

$ ls -la /run/user/1000
total 0
drwx------ 3 testuser testuser 80 Oct 9 10:11 .
drwxr-xr-x 3 root root 60 Oct 9 10:11 ..
srw-rw-rw- 1 testuser testuser 0 Oct 9 10:11 bus
drwxr-xr-x 2 testuser testuser 80 Oct 9 10:11 systemd
$ ls -l /sys/fs/cgroup/systemd/user.slice/user-1000.slice/user\@1000.service/
total 0
-rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 cgroup.clone_children
-rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 cgroup.procs
drwxr-xr-x 2 testuser testuser 0 Nov 11 08:06 init.scope
-rw-r--r-- 1 root root 0 Nov 11 08:06 notify_on_release
-rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 tasks

All these files were removed after the configured timeout (120 seconds).

What I also did was running a strace on the user manager process for the SSH session (before step 6)

$ ps auxH | grep '[s]ystemd --user'
testuser 176 0.0 1.7 93596 9408 ? Ss 08:06 0:00 /usr/lib/systemd/systemd --user
$ strace -p 176 ## after attaching testuser typed 'exit' in their SSH session
[...]

I'll attach the output from that strace command, but the following line looked interesting:

rmdir("/sys/fs/cgroup/systemd/user.slice/user-1000.slice/user/init.scope") = -1 EBUSY (Device or resource busy)

After digging through some issues over at https://github.com/systemd/systemd/issues I found this recently merged PR, which *might* be relevant: https://github.com/systemd/systemd/pull/17422

Comment 9 Christian Bartolomäus 2020-11-11 11:42:40 UTC

Created attachment 1728300 [details]
output of strace on user manager process