Bug 2158167 - systemd-logind: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332, function manager_start_scope()
Summary: systemd-logind: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332...
Keywords:
Status: POST
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: systemd
Version: 8.7
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Michal Sekletar
QA Contact: Frantisek Sumsal
URL:
Whiteboard:
Depends On:
Blocks: 2223602
TreeView+ depends on / blocked
 
Reported: 2023-01-04 12:44 UTC by Renaud Métrich
Modified: 2023-07-28 15:54 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2223602 (view as bug list)
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github redhat-plumbers systemd-rhel8 pull 401 0 None open (#2158167) Better handle session leader information 2023-07-10 14:27:15 UTC
Github systemd systemd pull 11713 0 None Merged login: add a missing error check for session_set_leader() 2023-01-04 13:55:54 UTC
Red Hat Issue Tracker RHELPLAN-143937 0 None None None 2023-01-04 13:11:42 UTC
Red Hat Knowledge Base (Solution) 6993080 0 None None None 2023-01-06 07:19:07 UTC

Description Renaud Métrich 2023-01-04 12:44:42 UTC
Description of problem:

We have a customer hitting RHBZ #1653746: from time to time, systemd-logind crashes in assertion due to having pid == 0:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
systemd-logind[1099454]: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332, function manager_start_scope(). Aborting.
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Coredump analysis shows "pid" is 0, which means somehow session_set_leader() failed, probably due to already having the session leader in the hashmap (Complete analysis in the attached customer case):
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) f 3
#3  0x00005650eea4c189 in manager_start_scope (manager=<optimized out>, scope=0x5650eedb1af0 "session-23431.scope", 
    pid=0, slice=0x5650eed007d0 "user-1000.slice", 
    description=0x7ffc410f87c0 "Session 23431 of user ...", wants=wants@entry=0x7ffc410f8820, 
    after=0x7ffc410f8840, requires_mounts_for=0x5650eefcd640 "/home/...", more_properties=0x5650eee61a60, 
    error=0x7ffc410f8a50, job=0x5650eee5f090) at ../src/login/logind-dbus.c:3332
3332            assert(pid > 1);
(gdb) p pid
$1 = 0

(gdb) up
#4  0x00005650eea3c51d in session_start_scope (s=0x5650eee5f010, properties=0x5650eee61a60, error=0x7ffc410f8a50)
    at ../src/login/logind-session.c:640
640                    r = manager_start_scope(
(gdb) p s->leader
$2 = 0

(gdb) f 6
#6  0x00005650eea46055 in method_create_session (message=0x5650eee61a60, userdata=<optimized out>, 
    error=0x7ffc410f8a50) at ../src/login/logind-dbus.c:854
854            r = session_start(session, message, error);
(gdb) p leader
$10 = 1197801

 158 int session_set_leader(Session *s, pid_t pid) {
 :
 163         if (!pid_is_valid(pid))
 164                 return -EINVAL;
 165 
 166         if (s->leader == pid)
 167                 return 0;
 168 
 169         r = hashmap_put(s->manager->sessions_by_leader, PID_TO_PTR(pid), s);
 170         if (r < 0)
 171                 return r;
 :
 176         s->leader = pid;
 177         (void) audit_session_from_pid(pid, &s->audit_id);
 178 
 179         return 1;
 180 }

(gdb) p *session->manager->sessions_by_leader 
$14 = {b = {hash_ops = 0x7f2d5f0249e0 <trivial_hash_ops>, {indirect = {storage = 0x7f2d5f1da010, 
        hash_key = "eL\221\357i\226X\203\250?\023\262y\231O<", n_entries = 3986, n_buckets = 7710, 
        idx_lowest_entry = 0, _pad = "\000\000"}, direct = {
        storage = "\020\240\035_-\177\000\000eL\221\357i\226X\203\250?\023\262y\231O<\222\017\000\000\036\036\000\000\000\000\000\000\000\000"}}, type = HASHMAP_TYPE_PLAIN, has_indirect = true, n_direct_entries = 0, from_pool = true, 
    dirty = true, cached = false}}

1203 int hashmap_put(Hashmap *h, const void *key, void *value) {
 :
1212         if (idx != IDX_NIL) {
1213                 e = plain_bucket_at(h, idx);
1214                 if (e->value == value)
1215                         return 0;
1216                 return -EEXIST;
1217         }
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

It's unclear when session_set_leader() can fail, especially when such PID can already be in the hashmap.
I think there is another underlying bug here.

But clearly, to avoid the assertion failure, the code must be fixed to fail upon session_set_leader() failure:

Currently:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 591 static int method_create_session(sd_bus_message *message, void *userdata, sd_bus_error *error) {
 :
 787         session_set_leader(session, leader);
 :
 854         r = session_start(session, message, error);
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Upstream:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 679 static int method_create_session(sd_bus_message *message, void *userdata, sd_bus_error *error) {
 :
 892         r = session_set_leader(session, leader);
 893         if (r < 0)              
 894                 goto fail;
 :
 961         r = session_start(session, message, error);
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Version-Release number of selected component (if applicable):

systemd-239-68.el8

How reproducible:

Regularly on customer system, no idea how to reproduce in-house

Comment 1 Renaud Métrich 2023-01-05 09:24:00 UTC
I'm raising the Priority and Severity of this because there is a severe flaw in the new systemd-logind (session_set_leader() is new to RHEL8.7.0, was not in RHEL8.6.0).

It appears that backporting proposed commit fixes the Assertion, but leads to another issue:
we get "Failed to create session: File exists" instead, which confirms that the code fails in `hashmap_put()` which returns EEXIST due to finding the PID already:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 09:43:37 vm-rhel8 sshd[801]: pam_systemd(sshd:session): Failed to create session: File exists
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

To reproduce easily, it's enough to recycle PIDs often and spawn backgrounded sessions, as shown below:

1. Change kernel.pid_max to 1024

   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
   # sysctl kernel.pid_max=1024
   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

2. Spawn background sessions until PIDs recycle (leading to a new session leader to reuse a PID of some pending/closing ancient session)

   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
   $ while :; do ssh -n admin@vm-rhel8 "sleep 1234 &" & PID=$!; sleep 1 && kill $PID; done
   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

After some time, with official systemd-239-68, the crash on systemd-logind happens:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 10:08:15 vm-rhel8 systemd-coredump[849]: Process 343 (systemd-logind) of user 0 dumped core.
                                                
                                                Stack trace of thread 343:
                                                #0  0x00007f16bccd2aff raise (libc.so.6)
                                                #1  0x00007f16bcca5ea5 abort (libc.so.6)
                                                #2  0x00007f16bd6e18b6 log_assert_failed_realm (libsystemd-shared-239.so)
                                                #3  0x000055904c254189 manager_start_scope (systemd-logind)
                                                #4  0x000055904c24451d session_start_scope (systemd-logind)
                                                #5  0x000055904c245d1f session_start (systemd-logind)
                                                #6  0x000055904c24e055 method_create_session (systemd-logind)
                                                #7  0x00007f16bd7b57fa method_callbacks_run (libsystemd-shared-239.so)
                                                #8  0x00007f16bd7b700c bus_process_object (libsystemd-shared-239.so)
                                                #9  0x00007f16bd7c846a process_message (libsystemd-shared-239.so)
                                                #10 0x00007f16bd7c940c io_callback (libsystemd-shared-239.so)
                                                #11 0x00007f16bd7f0299 source_dispatch (libsystemd-shared-239.so)
                                                #12 0x00007f16bd7f206d sd_event_dispatch (libsystemd-shared-239.so)
                                                #13 0x00007f16bd7f2278 sd_event_run (libsystemd-shared-239.so)
                                                #14 0x000055904c23bd72 manager_run (systemd-logind)
                                                #15 0x00007f16bccbed85 __libc_start_main (libc.so.6)
                                                #16 0x000055904c23de5e _start (systemd-logind)
Jan 05 10:08:15 vm-rhel8 sshd[843]: pam_systemd(sshd:session): Failed to create session: Message recipient disconnected from message bus without replying
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

With patched systemd-239-68, "Failed to create session: File exists" happen instead:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 09:43:37 vm-rhel8 sshd[801]: pam_systemd(sshd:session): Failed to create session: File exists
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------



Clearly there is something wrong in the new code: when a session is closing, I would expect the session leader to be reset to "0" since it's gone.
This would prevent the hashmap_put() to fail with EEXIST.
For now the code for that is in session_free():
~~~
  81 Session* session_free(Session *s) {
  :
 125         if (pid_is_valid(s->leader))
 126                 (void) hashmap_remove_value(s->manager->sessions_by_leader, PID_TO_PTR(s->leader), s);
  :
~~~

IMHO this should be done in session_stop() or similar, upon session leader exit.

Comment 3 Renaud Métrich 2023-01-06 10:21:53 UTC
Related BZ: #2158724

Comment 4 Brett Waldo 2023-05-03 01:27:18 UTC
please please fix this ASAP. We see this probably about 10 times a day across our lab of maybe 40 identically imaged systems. I just wanted to add that when systemd-logind is restarted, the entire GDM/GNOME environment is aborted and the user is returned to the login screen. I happen to work on safety critical software and full autonomous/random logouts aren't very fun.  We are using NVIDIA cards and it might be specific to NVIDIA that it kills Xorg when systemd-logind is lost.

Comment 5 Renaud Métrich 2023-05-03 06:08:01 UTC
Hi Brett,

you are right! Restarting systemd-logind kills the X sessions, I can reproduce on QEMU/KVM, with or without Wayland, this doesn't matter.

In theory systemd-logind is optional, that's why it's set as optional in PAM, but in fact, it's not for X at all: as soon as "pam_systemd" is commented out in /etc/pam.d/system-auth, gdm.service fails to start.
I will open a bugzilla for Desktop team to check.

Because the impact is not cosmetic at all in the end, I'm increasing the bug priority and will notify systemd team.

Renaud.

Comment 6 Renaud Métrich 2023-05-03 06:11:01 UTC
Hi Brett (again),

You state that you hit the issue in your lab frequently.
So far the condition to meet the issue was to hit BZ #2158724, which is some error condition that shouldn't be met in usual case.
Could you please open a case (you may state my name in the case description so that it gets redirected to me) for us to investigate?

Renaud.

Comment 7 Brett Waldo 2023-05-03 12:52:39 UTC
Renaud, I have done so and the case ID is 03502124.  We do hit this case quit a bit and our system does a ton of periodic constant ssh login's to localhost which is likely what exacerbates the issue. Running your test procedure and it'll crash in about 30 seconds with the max_pid so low.

Comment 8 Renaud Métrich 2023-05-03 13:13:56 UTC
I've taken ownership of the new case, we shall continue in that case :)

Comment 25 Plumber Bot 2023-07-20 10:55:53 UTC
fix merged to github main branch -> https://github.com/redhat-plumbers/systemd-rhel8/pull/401


Note You need to log in before you can comment on or make changes to this bug.