Bug 2158167
| Summary: | systemd-logind: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332, function manager_start_scope() | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Renaud Métrich <rmetrich> | |
| Component: | systemd | Assignee: | Michal Sekletar <msekleta> | |
| Status: | POST --- | QA Contact: | Frantisek Sumsal <fsumsal> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 8.7 | CC: | abetkike, ahoness, bwaldo, dtardon, fkrska, jamacku, kurathod, kwalker, msekleta, pjagtap, pwallend, qguo, sbroz, staff, systemd-maint-list, systemd-maint, tvainio, yzheng | |
| Target Milestone: | rc | Keywords: | Bugfix, Regression, Triaged, ZStream | |
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2223602 (view as bug list) | Environment: | ||
| Last Closed: | Type: | Bug | ||
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2223602 | |||
I'm raising the Priority and Severity of this because there is a severe flaw in the new systemd-logind (session_set_leader() is new to RHEL8.7.0, was not in RHEL8.6.0).
It appears that backporting proposed commit fixes the Assertion, but leads to another issue:
we get "Failed to create session: File exists" instead, which confirms that the code fails in `hashmap_put()` which returns EEXIST due to finding the PID already:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 09:43:37 vm-rhel8 sshd[801]: pam_systemd(sshd:session): Failed to create session: File exists
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
To reproduce easily, it's enough to recycle PIDs often and spawn backgrounded sessions, as shown below:
1. Change kernel.pid_max to 1024
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# sysctl kernel.pid_max=1024
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
2. Spawn background sessions until PIDs recycle (leading to a new session leader to reuse a PID of some pending/closing ancient session)
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
$ while :; do ssh -n admin@vm-rhel8 "sleep 1234 &" & PID=$!; sleep 1 && kill $PID; done
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
After some time, with official systemd-239-68, the crash on systemd-logind happens:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 10:08:15 vm-rhel8 systemd-coredump[849]: Process 343 (systemd-logind) of user 0 dumped core.
Stack trace of thread 343:
#0 0x00007f16bccd2aff raise (libc.so.6)
#1 0x00007f16bcca5ea5 abort (libc.so.6)
#2 0x00007f16bd6e18b6 log_assert_failed_realm (libsystemd-shared-239.so)
#3 0x000055904c254189 manager_start_scope (systemd-logind)
#4 0x000055904c24451d session_start_scope (systemd-logind)
#5 0x000055904c245d1f session_start (systemd-logind)
#6 0x000055904c24e055 method_create_session (systemd-logind)
#7 0x00007f16bd7b57fa method_callbacks_run (libsystemd-shared-239.so)
#8 0x00007f16bd7b700c bus_process_object (libsystemd-shared-239.so)
#9 0x00007f16bd7c846a process_message (libsystemd-shared-239.so)
#10 0x00007f16bd7c940c io_callback (libsystemd-shared-239.so)
#11 0x00007f16bd7f0299 source_dispatch (libsystemd-shared-239.so)
#12 0x00007f16bd7f206d sd_event_dispatch (libsystemd-shared-239.so)
#13 0x00007f16bd7f2278 sd_event_run (libsystemd-shared-239.so)
#14 0x000055904c23bd72 manager_run (systemd-logind)
#15 0x00007f16bccbed85 __libc_start_main (libc.so.6)
#16 0x000055904c23de5e _start (systemd-logind)
Jan 05 10:08:15 vm-rhel8 sshd[843]: pam_systemd(sshd:session): Failed to create session: Message recipient disconnected from message bus without replying
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
With patched systemd-239-68, "Failed to create session: File exists" happen instead:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 09:43:37 vm-rhel8 sshd[801]: pam_systemd(sshd:session): Failed to create session: File exists
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Clearly there is something wrong in the new code: when a session is closing, I would expect the session leader to be reset to "0" since it's gone.
This would prevent the hashmap_put() to fail with EEXIST.
For now the code for that is in session_free():
~~~
81 Session* session_free(Session *s) {
:
125 if (pid_is_valid(s->leader))
126 (void) hashmap_remove_value(s->manager->sessions_by_leader, PID_TO_PTR(s->leader), s);
:
~~~
IMHO this should be done in session_stop() or similar, upon session leader exit.
Related BZ: #2158724 please please fix this ASAP. We see this probably about 10 times a day across our lab of maybe 40 identically imaged systems. I just wanted to add that when systemd-logind is restarted, the entire GDM/GNOME environment is aborted and the user is returned to the login screen. I happen to work on safety critical software and full autonomous/random logouts aren't very fun. We are using NVIDIA cards and it might be specific to NVIDIA that it kills Xorg when systemd-logind is lost. Hi Brett, you are right! Restarting systemd-logind kills the X sessions, I can reproduce on QEMU/KVM, with or without Wayland, this doesn't matter. In theory systemd-logind is optional, that's why it's set as optional in PAM, but in fact, it's not for X at all: as soon as "pam_systemd" is commented out in /etc/pam.d/system-auth, gdm.service fails to start. I will open a bugzilla for Desktop team to check. Because the impact is not cosmetic at all in the end, I'm increasing the bug priority and will notify systemd team. Renaud. Hi Brett (again), You state that you hit the issue in your lab frequently. So far the condition to meet the issue was to hit BZ #2158724, which is some error condition that shouldn't be met in usual case. Could you please open a case (you may state my name in the case description so that it gets redirected to me) for us to investigate? Renaud. Renaud, I have done so and the case ID is 03502124. We do hit this case quit a bit and our system does a ton of periodic constant ssh login's to localhost which is likely what exacerbates the issue. Running your test procedure and it'll crash in about 30 seconds with the max_pid so low. I've taken ownership of the new case, we shall continue in that case :) fix merged to github main branch -> https://github.com/redhat-plumbers/systemd-rhel8/pull/401 |
Description of problem: We have a customer hitting RHBZ #1653746: from time to time, systemd-logind crashes in assertion due to having pid == 0: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- systemd-logind[1099454]: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332, function manager_start_scope(). Aborting. -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Coredump analysis shows "pid" is 0, which means somehow session_set_leader() failed, probably due to already having the session leader in the hashmap (Complete analysis in the attached customer case): -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- (gdb) f 3 #3 0x00005650eea4c189 in manager_start_scope (manager=<optimized out>, scope=0x5650eedb1af0 "session-23431.scope", pid=0, slice=0x5650eed007d0 "user-1000.slice", description=0x7ffc410f87c0 "Session 23431 of user ...", wants=wants@entry=0x7ffc410f8820, after=0x7ffc410f8840, requires_mounts_for=0x5650eefcd640 "/home/...", more_properties=0x5650eee61a60, error=0x7ffc410f8a50, job=0x5650eee5f090) at ../src/login/logind-dbus.c:3332 3332 assert(pid > 1); (gdb) p pid $1 = 0 (gdb) up #4 0x00005650eea3c51d in session_start_scope (s=0x5650eee5f010, properties=0x5650eee61a60, error=0x7ffc410f8a50) at ../src/login/logind-session.c:640 640 r = manager_start_scope( (gdb) p s->leader $2 = 0 (gdb) f 6 #6 0x00005650eea46055 in method_create_session (message=0x5650eee61a60, userdata=<optimized out>, error=0x7ffc410f8a50) at ../src/login/logind-dbus.c:854 854 r = session_start(session, message, error); (gdb) p leader $10 = 1197801 158 int session_set_leader(Session *s, pid_t pid) { : 163 if (!pid_is_valid(pid)) 164 return -EINVAL; 165 166 if (s->leader == pid) 167 return 0; 168 169 r = hashmap_put(s->manager->sessions_by_leader, PID_TO_PTR(pid), s); 170 if (r < 0) 171 return r; : 176 s->leader = pid; 177 (void) audit_session_from_pid(pid, &s->audit_id); 178 179 return 1; 180 } (gdb) p *session->manager->sessions_by_leader $14 = {b = {hash_ops = 0x7f2d5f0249e0 <trivial_hash_ops>, {indirect = {storage = 0x7f2d5f1da010, hash_key = "eL\221\357i\226X\203\250?\023\262y\231O<", n_entries = 3986, n_buckets = 7710, idx_lowest_entry = 0, _pad = "\000\000"}, direct = { storage = "\020\240\035_-\177\000\000eL\221\357i\226X\203\250?\023\262y\231O<\222\017\000\000\036\036\000\000\000\000\000\000\000\000"}}, type = HASHMAP_TYPE_PLAIN, has_indirect = true, n_direct_entries = 0, from_pool = true, dirty = true, cached = false}} 1203 int hashmap_put(Hashmap *h, const void *key, void *value) { : 1212 if (idx != IDX_NIL) { 1213 e = plain_bucket_at(h, idx); 1214 if (e->value == value) 1215 return 0; 1216 return -EEXIST; 1217 } : -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- It's unclear when session_set_leader() can fail, especially when such PID can already be in the hashmap. I think there is another underlying bug here. But clearly, to avoid the assertion failure, the code must be fixed to fail upon session_set_leader() failure: Currently: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 591 static int method_create_session(sd_bus_message *message, void *userdata, sd_bus_error *error) { : 787 session_set_leader(session, leader); : 854 r = session_start(session, message, error); : -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Upstream: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 679 static int method_create_session(sd_bus_message *message, void *userdata, sd_bus_error *error) { : 892 r = session_set_leader(session, leader); 893 if (r < 0) 894 goto fail; : 961 r = session_start(session, message, error); : -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Version-Release number of selected component (if applicable): systemd-239-68.el8 How reproducible: Regularly on customer system, no idea how to reproduce in-house