Bug 2158167
Summary: | systemd-logind: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332, function manager_start_scope() | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Renaud Métrich <rmetrich> | |
Component: | systemd | Assignee: | Michal Sekletar <msekleta> | |
Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 8.7 | CC: | abetkike, ahoness, bwaldo, dtardon, fkrska, jamacku, kurathod, kwalker, msekleta, pjagtap, pwallend, qguo, sbroz, staff, systemd-maint-list, systemd-maint, tvainio, yzheng | |
Target Milestone: | rc | Keywords: | Bugfix, Regression, Triaged, ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | systemd-239-78.el8 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2223602 (view as bug list) | Environment: | ||
Last Closed: | 2023-11-14 15:48:25 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2223602 |
Description
Renaud Métrich
2023-01-04 12:44:42 UTC
I'm raising the Priority and Severity of this because there is a severe flaw in the new systemd-logind (session_set_leader() is new to RHEL8.7.0, was not in RHEL8.6.0). It appears that backporting proposed commit fixes the Assertion, but leads to another issue: we get "Failed to create session: File exists" instead, which confirms that the code fails in `hashmap_put()` which returns EEXIST due to finding the PID already: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Jan 05 09:43:37 vm-rhel8 sshd[801]: pam_systemd(sshd:session): Failed to create session: File exists -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- To reproduce easily, it's enough to recycle PIDs often and spawn backgrounded sessions, as shown below: 1. Change kernel.pid_max to 1024 -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- # sysctl kernel.pid_max=1024 -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 2. Spawn background sessions until PIDs recycle (leading to a new session leader to reuse a PID of some pending/closing ancient session) -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- $ while :; do ssh -n admin@vm-rhel8 "sleep 1234 &" & PID=$!; sleep 1 && kill $PID; done -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- After some time, with official systemd-239-68, the crash on systemd-logind happens: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Jan 05 10:08:15 vm-rhel8 systemd-coredump[849]: Process 343 (systemd-logind) of user 0 dumped core. Stack trace of thread 343: #0 0x00007f16bccd2aff raise (libc.so.6) #1 0x00007f16bcca5ea5 abort (libc.so.6) #2 0x00007f16bd6e18b6 log_assert_failed_realm (libsystemd-shared-239.so) #3 0x000055904c254189 manager_start_scope (systemd-logind) #4 0x000055904c24451d session_start_scope (systemd-logind) #5 0x000055904c245d1f session_start (systemd-logind) #6 0x000055904c24e055 method_create_session (systemd-logind) #7 0x00007f16bd7b57fa method_callbacks_run (libsystemd-shared-239.so) #8 0x00007f16bd7b700c bus_process_object (libsystemd-shared-239.so) #9 0x00007f16bd7c846a process_message (libsystemd-shared-239.so) #10 0x00007f16bd7c940c io_callback (libsystemd-shared-239.so) #11 0x00007f16bd7f0299 source_dispatch (libsystemd-shared-239.so) #12 0x00007f16bd7f206d sd_event_dispatch (libsystemd-shared-239.so) #13 0x00007f16bd7f2278 sd_event_run (libsystemd-shared-239.so) #14 0x000055904c23bd72 manager_run (systemd-logind) #15 0x00007f16bccbed85 __libc_start_main (libc.so.6) #16 0x000055904c23de5e _start (systemd-logind) Jan 05 10:08:15 vm-rhel8 sshd[843]: pam_systemd(sshd:session): Failed to create session: Message recipient disconnected from message bus without replying -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- With patched systemd-239-68, "Failed to create session: File exists" happen instead: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Jan 05 09:43:37 vm-rhel8 sshd[801]: pam_systemd(sshd:session): Failed to create session: File exists -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Clearly there is something wrong in the new code: when a session is closing, I would expect the session leader to be reset to "0" since it's gone. This would prevent the hashmap_put() to fail with EEXIST. For now the code for that is in session_free(): ~~~ 81 Session* session_free(Session *s) { : 125 if (pid_is_valid(s->leader)) 126 (void) hashmap_remove_value(s->manager->sessions_by_leader, PID_TO_PTR(s->leader), s); : ~~~ IMHO this should be done in session_stop() or similar, upon session leader exit. Related BZ: #2158724 please please fix this ASAP. We see this probably about 10 times a day across our lab of maybe 40 identically imaged systems. I just wanted to add that when systemd-logind is restarted, the entire GDM/GNOME environment is aborted and the user is returned to the login screen. I happen to work on safety critical software and full autonomous/random logouts aren't very fun. We are using NVIDIA cards and it might be specific to NVIDIA that it kills Xorg when systemd-logind is lost. Hi Brett, you are right! Restarting systemd-logind kills the X sessions, I can reproduce on QEMU/KVM, with or without Wayland, this doesn't matter. In theory systemd-logind is optional, that's why it's set as optional in PAM, but in fact, it's not for X at all: as soon as "pam_systemd" is commented out in /etc/pam.d/system-auth, gdm.service fails to start. I will open a bugzilla for Desktop team to check. Because the impact is not cosmetic at all in the end, I'm increasing the bug priority and will notify systemd team. Renaud. Hi Brett (again), You state that you hit the issue in your lab frequently. So far the condition to meet the issue was to hit BZ #2158724, which is some error condition that shouldn't be met in usual case. Could you please open a case (you may state my name in the case description so that it gets redirected to me) for us to investigate? Renaud. Renaud, I have done so and the case ID is 03502124. We do hit this case quit a bit and our system does a ton of periodic constant ssh login's to localhost which is likely what exacerbates the issue. Running your test procedure and it'll crash in about 30 seconds with the max_pid so low. I've taken ownership of the new case, we shall continue in that case :) fix merged to github main branch -> https://github.com/redhat-plumbers/systemd-rhel8/pull/401 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (systemd bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7097 |