Bug 1847437
Summary: | user@#.service goes into failed status with timeout | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | alex | ||||
Component: | systemd | Assignee: | David Tardon <dtardon> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Frantisek Sumsal <fsumsal> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 8.2 | CC: | alex, bartolin, dtardon, msekleta, systemd-maint-list | ||||
Target Milestone: | rc | Keywords: | Reopened | ||||
Target Release: | 8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-08-03 11:15:06 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
alex
2020-06-16 11:33:56 UTC
Further observations: * IPA users in other hosts seem to be affected too * This seems to be flapping; with no action those services can jump to failed and then back to ok (In reply to alex from comment #0) > Jun 16 11:02:04 system.example.com systemd[1]: user: Processes > still around after final SIGKILL. Entering failed mode. > Jun 16 11:02:04 system.example.com systemd[1]: user: Failed > with result 'timeout'. It says right there why the unit failed: there are processes running for user 1000 that couldn't be killed. Hi, Sorry for not providing more information. This issue presented itself running CentOS 8 inside an LXC container. I've been able to reproduce it on a freshly provisioned container by doing the following: 1. adduser foo 2. sudo -u foo crontab -e 3. Set up a cron to be executed every 10 minutes that sleeps for a minute 4. watch systemctl status user 5. eventually, after the cron runs, and exits, something like: Jun 17 18:51:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Killing process 1187 (systemctl) with signal SIGKILL. will be printed, but if I check immediately for PID 1187 I don't see anything (nor on the system running the LXC container) Two minutes after that, you see: Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Processes still around after final SIGKILL. Entering failed mode. Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: Failed with result 'timeout'. I suppose LXC won't be a supported configuration, but if you could provide some clues as to why this started happening after updating to CentOS 8.2 that might help me fix the issue with the LXC host... Cheers, Álex (note, doing so on a VM presents no problem, of course) (In reply to alex from comment #3) > Hi, > > Sorry for not providing more information. > > This issue presented itself running CentOS 8 inside an LXC container. I've > been able to reproduce it on a freshly provisioned container by doing the > following: > > 1. adduser foo > 2. sudo -u foo crontab -e > 3. Set up a cron to be executed every 10 minutes that sleeps for a minute > 4. watch systemctl status user > 5. eventually, after the cron runs, and exits, something like: > > Jun 17 18:51:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: > Killing process 1187 (systemctl) with signal SIGKILL. This is likely bug 1860899. It's harmless. > will be printed, but if I check immediately for PID 1187 I don't see > anything (nor on the system running the LXC container) > > Two minutes after that, you see: > > Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: > Processes still around after final SIGKILL. Entering failed mode. > Jun 17 18:53:01 systemdtest2.h2.int.pdp7.net systemd[1]: user: > Failed with result 'timeout'. You can run # systemd-cgls -u user-1001.slice to see what processes are still alive in the user slice. Thanks for the info. If I have some time I will play with it. I added myself to the cc of the other bug- if you are reasonably certain that's the root cause, feel free to close this one. If the other bug is closed I'll recheck. The impact for this is that I like to monitor failed systemd units- this is useful because it's very easy to set up and it actually catches a surprising amount of stuff... but on the other hand it surfaces lots of small issues which otherwise are not important. Anyway, I disabled monitoring of user slices until this bug (or the other one) is fixed. (In reply to alex from comment #6) > Thanks for the info. If I have some time I will play with it. > > I added myself to the cc of the other bug- if you are reasonably certain > that's the root cause, feel free to close this one. If the other bug is > closed I'll recheck. No. Bug 1860899 only causes that "Killing process ..." message. It doesn't have any other effect. In particular, it cannot cause failure of user@.service instances. Otherwise, I still think there's no systemd bug here... Either there really is some process that couldn't be killed (perhaps because it hangs on I/O), which you can check by the systemd-cgls command I mentioned in comment 5; or there is none, in which case this is something in LXC or cgroups. Hi, I'm pretty sure I'm hitting the same problem. I did some research and it *looks* like a systemd problem to me. But of course, I might very well be wrong there ... First, I'm able to reproduce the problem (consistently) like this: 1. Create LXC container with vanilla CentOS 8.1.1911 2. Install and start openssh-server 3. Add test user (I did 'useradd --uid 1000 testuser') and enable SSH login for that user 4. Update systemd (from systemd-239-18.el8_1.5 to 239-31.el8_2.2) -- otherwise there is no problem 5. Log in with new user via SSH (in a second shell) 6. Exit from SSH session 7. Watch the log messages in /var/log/messages [The 120 seconds can be changed if a different value for TimeoutStopSec is set in the unit file /lib/systemd/system/user@.service.] I've run the mentioned command after step 6. It looks like there is some leftover (not a complete session): $ systemd-cgls -u user-1000.slice Unit user-1000.slice (/user.slice/user-1000.slice): $ Also, there is stuff below /run/user/1000 and /sys/fs/cgroup/systemd/user.slice/user-1000.slice/user\@1000.service/: $ ls -la /run/user/1000 total 0 drwx------ 3 testuser testuser 80 Oct 9 10:11 . drwxr-xr-x 3 root root 60 Oct 9 10:11 .. srw-rw-rw- 1 testuser testuser 0 Oct 9 10:11 bus drwxr-xr-x 2 testuser testuser 80 Oct 9 10:11 systemd $ ls -l /sys/fs/cgroup/systemd/user.slice/user-1000.slice/user\@1000.service/ total 0 -rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 cgroup.clone_children -rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 cgroup.procs drwxr-xr-x 2 testuser testuser 0 Nov 11 08:06 init.scope -rw-r--r-- 1 root root 0 Nov 11 08:06 notify_on_release -rw-r--r-- 1 testuser testuser 0 Nov 11 08:06 tasks All these files were removed after the configured timeout (120 seconds). What I also did was running a strace on the user manager process for the SSH session (before step 6) $ ps auxH | grep '[s]ystemd --user' testuser 176 0.0 1.7 93596 9408 ? Ss 08:06 0:00 /usr/lib/systemd/systemd --user $ strace -p 176 ## after attaching testuser typed 'exit' in their SSH session [...] I'll attach the output from that strace command, but the following line looked interesting: rmdir("/sys/fs/cgroup/systemd/user.slice/user-1000.slice/user/init.scope") = -1 EBUSY (Device or resource busy) After digging through some issues over at https://github.com/systemd/systemd/issues I found this recently merged PR, which *might* be relevant: https://github.com/systemd/systemd/pull/17422 Created attachment 1728300 [details]
output of strace on user manager process
|