RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2158167 - systemd-logind: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332, function manager_start_scope()
Summary: systemd-logind: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: systemd
Version: 8.7
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Michal Sekletar
QA Contact: Frantisek Sumsal
URL:
Whiteboard:
Depends On:
Blocks: 2223602
TreeView+ depends on / blocked
 
Reported: 2023-01-04 12:44 UTC by Renaud Métrich
Modified: 2023-11-14 18:01 UTC (History)
18 users (show)

Fixed In Version: systemd-239-78.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2223602 (view as bug list)
Environment:
Last Closed: 2023-11-14 15:48:25 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github redhat-plumbers systemd-rhel8 pull 401 0 None open (#2158167) Better handle session leader information 2023-07-10 14:27:15 UTC
Github systemd systemd pull 11713 0 None Merged login: add a missing error check for session_set_leader() 2023-01-04 13:55:54 UTC
Red Hat Issue Tracker RHELPLAN-143937 0 None None None 2023-01-04 13:11:42 UTC
Red Hat Knowledge Base (Solution) 6993080 0 None None None 2023-01-06 07:19:07 UTC
Red Hat Product Errata RHBA-2023:7097 0 None None None 2023-11-14 15:48:57 UTC

Description Renaud Métrich 2023-01-04 12:44:42 UTC
Description of problem:

We have a customer hitting RHBZ #1653746: from time to time, systemd-logind crashes in assertion due to having pid == 0:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
systemd-logind[1099454]: Assertion 'pid > 1' failed at ../src/login/logind-dbus.c:3332, function manager_start_scope(). Aborting.
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Coredump analysis shows "pid" is 0, which means somehow session_set_leader() failed, probably due to already having the session leader in the hashmap (Complete analysis in the attached customer case):
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) f 3
#3  0x00005650eea4c189 in manager_start_scope (manager=<optimized out>, scope=0x5650eedb1af0 "session-23431.scope", 
    pid=0, slice=0x5650eed007d0 "user-1000.slice", 
    description=0x7ffc410f87c0 "Session 23431 of user ...", wants=wants@entry=0x7ffc410f8820, 
    after=0x7ffc410f8840, requires_mounts_for=0x5650eefcd640 "/home/...", more_properties=0x5650eee61a60, 
    error=0x7ffc410f8a50, job=0x5650eee5f090) at ../src/login/logind-dbus.c:3332
3332            assert(pid > 1);
(gdb) p pid
$1 = 0

(gdb) up
#4  0x00005650eea3c51d in session_start_scope (s=0x5650eee5f010, properties=0x5650eee61a60, error=0x7ffc410f8a50)
    at ../src/login/logind-session.c:640
640                    r = manager_start_scope(
(gdb) p s->leader
$2 = 0

(gdb) f 6
#6  0x00005650eea46055 in method_create_session (message=0x5650eee61a60, userdata=<optimized out>, 
    error=0x7ffc410f8a50) at ../src/login/logind-dbus.c:854
854            r = session_start(session, message, error);
(gdb) p leader
$10 = 1197801

 158 int session_set_leader(Session *s, pid_t pid) {
 :
 163         if (!pid_is_valid(pid))
 164                 return -EINVAL;
 165 
 166         if (s->leader == pid)
 167                 return 0;
 168 
 169         r = hashmap_put(s->manager->sessions_by_leader, PID_TO_PTR(pid), s);
 170         if (r < 0)
 171                 return r;
 :
 176         s->leader = pid;
 177         (void) audit_session_from_pid(pid, &s->audit_id);
 178 
 179         return 1;
 180 }

(gdb) p *session->manager->sessions_by_leader 
$14 = {b = {hash_ops = 0x7f2d5f0249e0 <trivial_hash_ops>, {indirect = {storage = 0x7f2d5f1da010, 
        hash_key = "eL\221\357i\226X\203\250?\023\262y\231O<", n_entries = 3986, n_buckets = 7710, 
        idx_lowest_entry = 0, _pad = "\000\000"}, direct = {
        storage = "\020\240\035_-\177\000\000eL\221\357i\226X\203\250?\023\262y\231O<\222\017\000\000\036\036\000\000\000\000\000\000\000\000"}}, type = HASHMAP_TYPE_PLAIN, has_indirect = true, n_direct_entries = 0, from_pool = true, 
    dirty = true, cached = false}}

1203 int hashmap_put(Hashmap *h, const void *key, void *value) {
 :
1212         if (idx != IDX_NIL) {
1213                 e = plain_bucket_at(h, idx);
1214                 if (e->value == value)
1215                         return 0;
1216                 return -EEXIST;
1217         }
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

It's unclear when session_set_leader() can fail, especially when such PID can already be in the hashmap.
I think there is another underlying bug here.

But clearly, to avoid the assertion failure, the code must be fixed to fail upon session_set_leader() failure:

Currently:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 591 static int method_create_session(sd_bus_message *message, void *userdata, sd_bus_error *error) {
 :
 787         session_set_leader(session, leader);
 :
 854         r = session_start(session, message, error);
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Upstream:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 679 static int method_create_session(sd_bus_message *message, void *userdata, sd_bus_error *error) {
 :
 892         r = session_set_leader(session, leader);
 893         if (r < 0)              
 894                 goto fail;
 :
 961         r = session_start(session, message, error);
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Version-Release number of selected component (if applicable):

systemd-239-68.el8

How reproducible:

Regularly on customer system, no idea how to reproduce in-house

Comment 1 Renaud Métrich 2023-01-05 09:24:00 UTC
I'm raising the Priority and Severity of this because there is a severe flaw in the new systemd-logind (session_set_leader() is new to RHEL8.7.0, was not in RHEL8.6.0).

It appears that backporting proposed commit fixes the Assertion, but leads to another issue:
we get "Failed to create session: File exists" instead, which confirms that the code fails in `hashmap_put()` which returns EEXIST due to finding the PID already:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 09:43:37 vm-rhel8 sshd[801]: pam_systemd(sshd:session): Failed to create session: File exists
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

To reproduce easily, it's enough to recycle PIDs often and spawn backgrounded sessions, as shown below:

1. Change kernel.pid_max to 1024

   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
   # sysctl kernel.pid_max=1024
   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

2. Spawn background sessions until PIDs recycle (leading to a new session leader to reuse a PID of some pending/closing ancient session)

   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
   $ while :; do ssh -n admin@vm-rhel8 "sleep 1234 &" & PID=$!; sleep 1 && kill $PID; done
   -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

After some time, with official systemd-239-68, the crash on systemd-logind happens:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 10:08:15 vm-rhel8 systemd-coredump[849]: Process 343 (systemd-logind) of user 0 dumped core.
                                                
                                                Stack trace of thread 343:
                                                #0  0x00007f16bccd2aff raise (libc.so.6)
                                                #1  0x00007f16bcca5ea5 abort (libc.so.6)
                                                #2  0x00007f16bd6e18b6 log_assert_failed_realm (libsystemd-shared-239.so)
                                                #3  0x000055904c254189 manager_start_scope (systemd-logind)
                                                #4  0x000055904c24451d session_start_scope (systemd-logind)
                                                #5  0x000055904c245d1f session_start (systemd-logind)
                                                #6  0x000055904c24e055 method_create_session (systemd-logind)
                                                #7  0x00007f16bd7b57fa method_callbacks_run (libsystemd-shared-239.so)
                                                #8  0x00007f16bd7b700c bus_process_object (libsystemd-shared-239.so)
                                                #9  0x00007f16bd7c846a process_message (libsystemd-shared-239.so)
                                                #10 0x00007f16bd7c940c io_callback (libsystemd-shared-239.so)
                                                #11 0x00007f16bd7f0299 source_dispatch (libsystemd-shared-239.so)
                                                #12 0x00007f16bd7f206d sd_event_dispatch (libsystemd-shared-239.so)
                                                #13 0x00007f16bd7f2278 sd_event_run (libsystemd-shared-239.so)
                                                #14 0x000055904c23bd72 manager_run (systemd-logind)
                                                #15 0x00007f16bccbed85 __libc_start_main (libc.so.6)
                                                #16 0x000055904c23de5e _start (systemd-logind)
Jan 05 10:08:15 vm-rhel8 sshd[843]: pam_systemd(sshd:session): Failed to create session: Message recipient disconnected from message bus without replying
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

With patched systemd-239-68, "Failed to create session: File exists" happen instead:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Jan 05 09:43:37 vm-rhel8 sshd[801]: pam_systemd(sshd:session): Failed to create session: File exists
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------



Clearly there is something wrong in the new code: when a session is closing, I would expect the session leader to be reset to "0" since it's gone.
This would prevent the hashmap_put() to fail with EEXIST.
For now the code for that is in session_free():
~~~
  81 Session* session_free(Session *s) {
  :
 125         if (pid_is_valid(s->leader))
 126                 (void) hashmap_remove_value(s->manager->sessions_by_leader, PID_TO_PTR(s->leader), s);
  :
~~~

IMHO this should be done in session_stop() or similar, upon session leader exit.

Comment 3 Renaud Métrich 2023-01-06 10:21:53 UTC
Related BZ: #2158724

Comment 4 Brett Waldo 2023-05-03 01:27:18 UTC
please please fix this ASAP. We see this probably about 10 times a day across our lab of maybe 40 identically imaged systems. I just wanted to add that when systemd-logind is restarted, the entire GDM/GNOME environment is aborted and the user is returned to the login screen. I happen to work on safety critical software and full autonomous/random logouts aren't very fun.  We are using NVIDIA cards and it might be specific to NVIDIA that it kills Xorg when systemd-logind is lost.

Comment 5 Renaud Métrich 2023-05-03 06:08:01 UTC
Hi Brett,

you are right! Restarting systemd-logind kills the X sessions, I can reproduce on QEMU/KVM, with or without Wayland, this doesn't matter.

In theory systemd-logind is optional, that's why it's set as optional in PAM, but in fact, it's not for X at all: as soon as "pam_systemd" is commented out in /etc/pam.d/system-auth, gdm.service fails to start.
I will open a bugzilla for Desktop team to check.

Because the impact is not cosmetic at all in the end, I'm increasing the bug priority and will notify systemd team.

Renaud.

Comment 6 Renaud Métrich 2023-05-03 06:11:01 UTC
Hi Brett (again),

You state that you hit the issue in your lab frequently.
So far the condition to meet the issue was to hit BZ #2158724, which is some error condition that shouldn't be met in usual case.
Could you please open a case (you may state my name in the case description so that it gets redirected to me) for us to investigate?

Renaud.

Comment 7 Brett Waldo 2023-05-03 12:52:39 UTC
Renaud, I have done so and the case ID is 03502124.  We do hit this case quit a bit and our system does a ton of periodic constant ssh login's to localhost which is likely what exacerbates the issue. Running your test procedure and it'll crash in about 30 seconds with the max_pid so low.

Comment 8 Renaud Métrich 2023-05-03 13:13:56 UTC
I've taken ownership of the new case, we shall continue in that case :)

Comment 25 Plumber Bot 2023-07-20 10:55:53 UTC
fix merged to github main branch -> https://github.com/redhat-plumbers/systemd-rhel8/pull/401

Comment 30 errata-xmlrpc 2023-11-14 15:48:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (systemd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7097


Note You need to log in before you can comment on or make changes to this bug.