Bug 1829867

Summary: systemd segfaults reloading while serializing a service executing a ExecStartPost command
Product: Red Hat Enterprise Linux 8 Reporter: Jan Synacek <jsynacek>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.3CC: msekleta, pvlasin, rmetrich, systemd-maint-list
Target Milestone: rc   
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-04 02:02:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Synacek 2020-04-30 13:54:13 UTC
This bug was initially created as a copy of Bug #1828953

Description of problem:

Systemd segfaults after or while reloading

systemd segfaults reloading "kernel: systemd[1]: segfault at 48 ip 000055e3aedd0e68 sp 00007ffedf6a04a0 error 4 in systemd[55e3aed93000+163000]" [rhel-7.7]

Version-Release number of selected component (if applicable):

$ grep systemd installed-rpms 
systemd-219-67.el7_7.3.x86_64                               Wed Mar 25 21:38:12 2020
systemd-libs-219-67.el7_7.3.i686                            Wed Mar 25 21:38:25 2020
systemd-libs-219-67.el7_7.3.x86_64                          Wed Mar 25 21:38:11 2020
systemd-python-219-67.el7_7.3.x86_64                        Wed Mar 25 21:38:18 2020
systemd-sysv-219-67.el7_7.3.x86_64                          Wed Mar 25 21:38:13 2020

How reproducible:

Customer tell us the problem repeats itself now and then and provided a core.

Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:


Additional info:

journald extract:

Apr 27 01:02:03 hostname systemd[1]: Reloading.
Apr 27 01:02:03 hostname systemd[1]: Starting Insights Client...
Apr 27 01:02:03 hostname systemd[1]: Reloading.
Apr 27 01:02:03 hostname systemd[1]: Reloading.
Apr 27 01:02:03 hostname kernel: systemd[1]: segfault at 48 ip 000055e3aedd0e68 sp 00007ffedf6a04a0 error 4 in systemd[55e3aed93000+163000]
Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Only 885MiB is available on /var/spool/abrt
Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Process 4277 (systemd) of user 0 killed by SIGSEGV - ignoring (low free space)
Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Saved core dump of pid 4277 to core.4277 at / (19238912 bytes)
Apr 27 01:02:03 hostname systemd[1]: Caught <SEGV>, dumped core as pid 4277.
Apr 27 01:02:03 hostname systemd[1]: Freezing execution.
Apr 27 01:02:11 hostname insights-client[4256]: Starting to collect Insights data for some.hostname.net
Apr 27 01:02:11 hostname insights-client[4256]: Failed to notify init system: Connection refused
[..]
Apr 27 01:02:26 hostname insights-client[4256]: Failed to notify init system: Connection refused
Apr 27 01:02:28 hostname dbus[1712]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out

Additional information from RMetrich:

Coredump Analysis

We have systemd dying while reloading:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) bt
#0  0x00007f1645682647 in kill () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055e3aedb996d in crash (sig=11) at src/core/main.c:172
#2  <signal handler called>
#3  0x000055e3aedd0e68 in service_exec_command_index (current=0x55e3b0afadf0, 
    id=_SERVICE_EXEC_COMMAND_INVALID, u=0x55e3b0b69640) at src/core/service.c:2033
#4  service_serialize_exec_command (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, 
    command=0x55e3b0afadf0) at src/core/service.c:2063
#5  0x000055e3aedd1277 in service_serialize (u=0x55e3b0b69640, f=0x55e3b0afe1f0, fds=0x55e3b0c25070)
    at src/core/service.c:2122
#6  0x000055e3aee67626 in unit_serialize (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, 
    fds=fds@entry=0x55e3b0c25070, serialize_jobs=serialize_jobs@entry=true) at src/core/unit.c:2561
#7  0x000055e3aedbf962 in manager_serialize (m=m@entry=0x55e3b0aed040, f=0x55e3b0afe1f0, 
    fds=fds@entry=0x55e3b0c25070, switching_root=switching_root@entry=false) at src/core/manager.c:2553
#8  0x000055e3aedc0998 in manager_reload (m=m@entry=0x55e3b0aed040) at src/core/manager.c:2792
#9  0x000055e3aedb6c5f in main (argc=5, argv=0x7ffedf6a1038) at src/core/main.c:1799
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Upon reload, systemd "serializes" the internal state. Here, it was serializing "insights-client.service":

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) f 4
#4  service_serialize_exec_command (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, 
    command=0x55e3b0afadf0) at src/core/service.c:2063
2063	        idx = service_exec_command_index(u, id, command);

(gdb) p s->main_pid
$1 = 4256
(gdb) p u->id
$2 = 0x55e3b0c46260 "insights-client.service"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

systemd died while serializing ExecStartPost's 2nd command:
"
ExecStartPost=/bin/bash -c "if [ -d /sys/fs/cgroup/memory ]; then echo 1G > /sys/fs/cgroup/memory/system.slice/insights-client.service/memory.soft_limit_in_bytes; fi"
"

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) p command->argv[0]
$5 = 0x55e3b0b51860 "/bin/bash"
(gdb) p command->argv[1]
$6 = 0x55e3b0b69f60 "-c"
(gdb) p command->argv[2]
$7 = 0x55e3b0b69f80 "if [ -d /sys/fs/cgroup/memory ]; then echo 1G > /sys/fs/cgroup/memory/system.slice/insights-client.service/memory.soft_limit_in_bytes; fi"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Relevant source code:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
2039 static int service_serialize_exec_command(Unit *u, FILE *f, ExecCommand *command) {
2040         Service *s = SERVICE(u);
2041         ServiceExecCommand id;
2042         unsigned idx;
2043         const char *type;
2044         char **arg;
2045         _cleanup_strv_free_ char **escaped_args = NULL;
2046         _cleanup_free_ char *args = NULL, *p = NULL;
2047         size_t allocated = 0, length = 0;
2048 
2049         assert(s);
2050         assert(f);
2051 
2052         if (!command)
2053                 return 0;
2054 
2055         if (command == s->control_command) {
2056                 type = "control";
2057                 id = s->control_command_id;
2058         } else {
2059                 type = "main";
2060                 id = SERVICE_EXEC_START;
2061         }
2062 
2063         idx = service_exec_command_index(u, id, command);		<--- HERE


2023 static unsigned service_exec_command_index(Unit *u, ServiceExecCommand id, ExecCommand *current) {
2024         Service *s = SERVICE(u);
2025         unsigned idx = 0;
2026         ExecCommand *first, *c;
2027 
2028         assert(s);
2029 
2030         first = s->exec_command[id];		<--- HERE 'first' is invalid
2031 
2032         /* Figure out where we are in the list by walking back to the beginning */
2033         for (c = current; c != first; c = c->command_prev)		<--- HERE
2034                 idx++;
2035 
2036         return idx;
2037 }
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Why it died? Because the "s->control_command_id" was not expected:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) p s->control_command_id
$8 = _SERVICE_EXEC_COMMAND_INVALID
(gdb) p id
$9 = _SERVICE_EXEC_COMMAND_INVALID
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This caused the called function "service_exec_command_index" to derefence a NULL pointer:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) down
#3  0x000055e3aedd0e68 in service_exec_command_index (current=0x55e3b0afadf0, 
    id=_SERVICE_EXEC_COMMAND_INVALID, u=0x55e3b0b69640) at src/core/service.c:2033
2033	        for (c = current; c != first; c = c->command_prev)
(gdb) p current
$10 = (ExecCommand *) 0x55e3b0afadf0
(gdb) p c
$11 = (ExecCommand *) 0x0
(gdb) p first
$12 = (ExecCommand *) 0x55e3b0c48300
(gdb) p *first
$13 = {path = 0x1 <Address 0x1 out of bounds>, argv = 0x55e3b0aed4e0, exec_status = {start_timestamp = {
      realtime = 94436410693184, monotonic = 0}, exit_timestamp = {realtime = 0, monotonic = 227}, 
    pid = 100, code = 0, status = -1}, command_next = 0x0, command_prev = 0x0, ignore = 64}
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

We have 'c' now NULL and anyway 'first' was pointing to invalid memory.

What remains to understand is when this could happen.

Comment 5 Lukáš Nykrýn 2020-08-27 11:55:31 UTC
fix merged to github master branch -> https://github.com/systemd-rhel/rhel-8/pull/117

Comment 14 errata-xmlrpc 2020-11-04 02:02:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: systemd security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4553