Bug 2231597
| Summary: | reboot.target times out when having lingering users | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Renaud Métrich <rmetrich> |
| Component: | systemd | Assignee: | systemd maint <systemd-maint> |
| Status: | NEW --- | QA Contact: | Frantisek Sumsal <fsumsal> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 8.8 | CC: | hartsjc, qguo, systemd-maint |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem: We have a customer seeing constantly the reboot.target unit time out after 30 minutes when having lingering users. During the 30 minutes, we can see a dead loop of DBus communication between systemd-logind and systemd related to the lingering user unit "user" collecting. I cannot reproduce the issue during reboot, but it's very easily reproducible by amending "user-runtime-dir" to not stop automatically, by setting "StopWhenUnneeded=false", then stopping "user" unit from the shell. This makes the following happen: 1. *systemd* stops "user", which sends a "UnitRemoved" message among others 2. *systemd-logind* enqueues the user in the garbage collection queue, then queries the state of "user" ("ActiveState" message) -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 840 10:05:05.397060 sendmsg(12<UNIX:[23940->22913]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\0014\0\0\0\306\r\0\0\250\0\0\0\1\1o\0004\0\0\0/org/freedesktop/systemd1/unit/user_401001_2eservice\0\0\0\0\3\1s\0\3\0\0\0Get\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\10\1g\0\2ss\0", iov_len=184}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0\v\0\0\0ActiveState\0", iov_len=52}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 236 <0.000036> -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 3. *systemd* answers by `inactive` but also sends messages "UnitNew/UnitRemoved" -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 1 10:05:05.402891 sendmsg(45<UNIX:[22545->22821]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\2\1\1\21\0\0\0r\34\0\0\37\0\0\0\5\1u\0\306\r\0\0\6\1s\0\4\0\0\0:1.5\0\0\0\0\10\1g\0\1v\0\0", iov_len=48}, {iov_base="\1s\0\0\10\0\0\0inactive\0", iov_len=17}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 65 <0.000029> 1 10:05:05.402994 sendmsg(40<UNIX:[19275->12637]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="PRIORITY=7\nSYSLOG_FACILITY=3\nCODE_FILE=../src/core/manager.c\nCODE_LINE=1181\nCODE_FUNC=manager_dispatch_gc_unit_queue\nUNIT=user\nSYSLOG_IDENTIFIER=systemd\n", iov_len=166}, {iov_base="MESSAGE=", iov_len=8}, {iov_base="user: Collecting.", iov_len=30}, {iov_base="\n", iov_len=1}], msg_iovlen=4, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL <unfinished ...> 1 10:05:05.403073 sendmsg(45<UNIX:[22545->22821]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1Q\0\0\0s\34\0\0p\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\7\0\0\0UnitNew\0\10\1g\0\2so\0", iov_len=128}, {iov_base="\21\0\0\0user\0\0\0004\0\0\0/org/freedesktop/systemd1/unit/user_401001_2eservice\0", iov_len=81}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL <unfinished ...> 1 10:05:05.403192 sendmsg(45<UNIX:[22545->22821]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1Q\0\0\0t\34\0\0x\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\v\0\0\0UnitRemoved\0\0\0\0\0\10\1g\0\2so\0", iov_len=136}, {iov_base="\21\0\0\0user\0\0\0004\0\0\0/org/freedesktop/systemd1/unit/user_401001_2eservice\0", iov_len=81}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL <unfinished ...> -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 4. *systemd-logind* then queries the state of "user-runtime-dir" -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 840 10:05:05.403202 sendmsg(12<UNIX:[23940->22913]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\0014\0\0\0\307\r\0\0\270\0\0\0\1\1o\0D\0\0\0/org/freedesktop/systemd1/unit/user_2druntime_2ddir_401001_2eservice\0\0\0\0\3\1s\0\3\0\0\0Get\0\0\0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\10\1g\0\2ss\0", iov_len=200}, {iov_base="\35\0\0\0org.freedesktop.systemd1.Unit\0\0\0\v\0\0\0ActiveState\0", iov_len=52}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 252 <0.000047> -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- This is due to having the following code for user_unit_active() function (line 554): -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 546 static bool user_unit_active(User *u) { : 554 FOREACH_STRING(i, u->service, u->runtime_dir_service, u->slice) { 555 r = manager_unit_is_active(u->manager, i); 556 if (r < 0) 557 log_debug_errno(r, "Failed to determine whether unit '%s' is active, ignoring", u->service); 558 if (r != 0) 559 return true; 560 } : -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 5. *systemd* this time answers it's active, since "StopWhenUnneeded=false" was set for the reproducer During the shutdown, the same thing is happening since the runtime dir is still there. -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 1 10:05:05.403527 sendmsg(45<UNIX:[22545->22821]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\2\1\1\17\0\0\0u\34\0\0\37\0\0\0\5\1u\0\307\r\0\0\6\1s\0\4\0\0\0:1.5\0\0\0\0\10\1g\0\1v\0\0", iov_len=48}, {iov_base="\1s\0\0\6\0\0\0active\0", iov_len=15}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 63 <0.000028> -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 6. Because the unit is active, *systemd-logind* won't garbage collect the user -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 565 bool user_may_gc(User *u, bool drop_not_started) { : 584 /* Is this a user that shall stay around forever ("linger")? Before we say "no" to GC'ing for lingering users, let 's check 585 * if any of the three units that we maintain for this user is still around. If none of them is, 586 * there's no need to keep this user around even if lingering is enabled. */ 587 if (user_check_linger_file(u) > 0 && user_unit_active(u)) 588 return false; -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 7. *systemd-logind* then processes the 3rd message from *systemd* about unit "user" (the "UnitRemoved" message in step 3) -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 840 10:05:05.403400 <... recvmsg resumed>{msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\v\0\0\0UnitRemoved\0\0\0\0\0\10\1g\0\2so\0\7\1s\0\4\0\0\0:1.0\0\0\0\0\21\0\0\0user\0\0\0004\0\0\0/org/freedesktop/systemd1/unit/user_401001_2eservice\0", iov_len=209}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 209 <0.000007> -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- This makes *systemd-logind* enqueues the user into the garbage collection queue again, causing step 2 to occur again in loop. The issue is present in Fedora 38 as well (systemd-253.7-1.fc38.x86_64). The customer proposes to change the order of the for loop on line 554 as shown below: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 546 static bool user_unit_active(User *u) { : 554 FOREACH_STRING(i, u->slice, u->runtime_dir_service, u->service) { 555 r = manager_unit_is_active(u->manager, i); 556 if (r < 0) 557 log_debug_errno(r, "Failed to determine whether unit '%s' is active, ignoring", u->service); 558 if (r != 0) 559 return true; 560 } : -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- This change fixes the issue because the slice will be up, causing no "UnitNew/UnitRemoved" to be sent by *systemd* (step 3). Additionally, systemd-logind will have less job to do to determine whether the user is still active or not, since the slice is the last thing being taking down. This change won't change the logic. Version-Release number of selected component (if applicable): systemd-253.7-1.fc38.x86_64 (Fedora 38) systemd-239 (RHEL8) How reproducible: Always Steps to Reproduce: 1. Create a lingering user # useradd lingering # loginctl enable-linger lingering 2. Change "StopWhenUnneeded=true" into "StopWhenUnneeded=false" in user-runtime-dir unit # systemctl edit user-runtime-dir@$(id -u lingering).service --full ... 3. Reboot 4. Stop the user unit # systemctl stop user@$(id -u lingering).service Actual results: "Deadloop" between systemd-logind and systemd, systemd taking a lot of CPU forever: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 172532 10484 7960 R 60.0 0.3 1:01.52 systemd 690 root 20 0 310892 157412 156292 R 40.0 4.1 0:32.78 systemd-journal 810 dbus 20 0 56564 5304 4564 R 26.7 0.1 0:19.90 dbus-daemon 3310 root 20 0 140504 54696 7248 S 20.0 1.4 0:05.70 systemd-logind Expected results: No CPU looping. Quick reboot when issue happens on reboot (cannot reproduce for now).