Bug 1828953

Summary: systemd segfaults reloading while serializing a service executing a ExecStartPost command [rhel-7.7]
Product: Red Hat Enterprise Linux 7 Reporter: Juan Gamba <jgamba>
Component: systemdAssignee: Jan Synacek <jsynacek>
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.7CC: jsynacek, rmetrich, sbroz, systemd-maint-list
Target Milestone: rcKeywords: Patch, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1837972 1837973 (view as bug list) Environment:
Last Closed: 2020-09-29 20:32:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1837972, 1837973    

Description Juan Gamba 2020-04-28 16:20:39 UTC
Description of problem:

Systemd segfaults after or while reloading

systemd segfaults reloading "kernel: systemd[1]: segfault at 48 ip 000055e3aedd0e68 sp 00007ffedf6a04a0 error 4 in systemd[55e3aed93000+163000]" [rhel-7.7]

Version-Release number of selected component (if applicable):

$ grep systemd installed-rpms 
systemd-219-67.el7_7.3.x86_64                               Wed Mar 25 21:38:12 2020
systemd-libs-219-67.el7_7.3.i686                            Wed Mar 25 21:38:25 2020
systemd-libs-219-67.el7_7.3.x86_64                          Wed Mar 25 21:38:11 2020
systemd-python-219-67.el7_7.3.x86_64                        Wed Mar 25 21:38:18 2020
systemd-sysv-219-67.el7_7.3.x86_64                          Wed Mar 25 21:38:13 2020

How reproducible:

Customer tell us the problem repeats itself now and then and provided a core.

Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:


Additional info:

journald extract:

Apr 27 01:02:03 hostname systemd[1]: Reloading.
Apr 27 01:02:03 hostname systemd[1]: Starting Insights Client...
Apr 27 01:02:03 hostname systemd[1]: Reloading.
Apr 27 01:02:03 hostname systemd[1]: Reloading.
Apr 27 01:02:03 hostname kernel: systemd[1]: segfault at 48 ip 000055e3aedd0e68 sp 00007ffedf6a04a0 error 4 in systemd[55e3aed93000+163000]
Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Only 885MiB is available on /var/spool/abrt
Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Process 4277 (systemd) of user 0 killed by SIGSEGV - ignoring (low free space)
Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Saved core dump of pid 4277 to core.4277 at / (19238912 bytes)
Apr 27 01:02:03 hostname systemd[1]: Caught <SEGV>, dumped core as pid 4277.
Apr 27 01:02:03 hostname systemd[1]: Freezing execution.
Apr 27 01:02:11 hostname insights-client[4256]: Starting to collect Insights data for some.hostname.net
Apr 27 01:02:11 hostname insights-client[4256]: Failed to notify init system: Connection refused
[..]
Apr 27 01:02:26 hostname insights-client[4256]: Failed to notify init system: Connection refused
Apr 27 01:02:28 hostname dbus[1712]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out

Additional information from RMetrich:

Coredump Analysis

We have systemd dying while reloading:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) bt
#0  0x00007f1645682647 in kill () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055e3aedb996d in crash (sig=11) at src/core/main.c:172
#2  <signal handler called>
#3  0x000055e3aedd0e68 in service_exec_command_index (current=0x55e3b0afadf0, 
    id=_SERVICE_EXEC_COMMAND_INVALID, u=0x55e3b0b69640) at src/core/service.c:2033
#4  service_serialize_exec_command (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, 
    command=0x55e3b0afadf0) at src/core/service.c:2063
#5  0x000055e3aedd1277 in service_serialize (u=0x55e3b0b69640, f=0x55e3b0afe1f0, fds=0x55e3b0c25070)
    at src/core/service.c:2122
#6  0x000055e3aee67626 in unit_serialize (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, 
    fds=fds@entry=0x55e3b0c25070, serialize_jobs=serialize_jobs@entry=true) at src/core/unit.c:2561
#7  0x000055e3aedbf962 in manager_serialize (m=m@entry=0x55e3b0aed040, f=0x55e3b0afe1f0, 
    fds=fds@entry=0x55e3b0c25070, switching_root=switching_root@entry=false) at src/core/manager.c:2553
#8  0x000055e3aedc0998 in manager_reload (m=m@entry=0x55e3b0aed040) at src/core/manager.c:2792
#9  0x000055e3aedb6c5f in main (argc=5, argv=0x7ffedf6a1038) at src/core/main.c:1799
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Upon reload, systemd "serializes" the internal state. Here, it was serializing "insights-client.service":

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) f 4
#4  service_serialize_exec_command (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, 
    command=0x55e3b0afadf0) at src/core/service.c:2063
2063	        idx = service_exec_command_index(u, id, command);

(gdb) p s->main_pid
$1 = 4256
(gdb) p u->id
$2 = 0x55e3b0c46260 "insights-client.service"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

systemd died while serializing ExecStartPost's 2nd command:
"
ExecStartPost=/bin/bash -c "if [ -d /sys/fs/cgroup/memory ]; then echo 1G > /sys/fs/cgroup/memory/system.slice/insights-client.service/memory.soft_limit_in_bytes; fi"
"

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) p command->argv[0]
$5 = 0x55e3b0b51860 "/bin/bash"
(gdb) p command->argv[1]
$6 = 0x55e3b0b69f60 "-c"
(gdb) p command->argv[2]
$7 = 0x55e3b0b69f80 "if [ -d /sys/fs/cgroup/memory ]; then echo 1G > /sys/fs/cgroup/memory/system.slice/insights-client.service/memory.soft_limit_in_bytes; fi"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Relevant source code:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
2039 static int service_serialize_exec_command(Unit *u, FILE *f, ExecCommand *command) {
2040         Service *s = SERVICE(u);
2041         ServiceExecCommand id;
2042         unsigned idx;
2043         const char *type;
2044         char **arg;
2045         _cleanup_strv_free_ char **escaped_args = NULL;
2046         _cleanup_free_ char *args = NULL, *p = NULL;
2047         size_t allocated = 0, length = 0;
2048 
2049         assert(s);
2050         assert(f);
2051 
2052         if (!command)
2053                 return 0;
2054 
2055         if (command == s->control_command) {
2056                 type = "control";
2057                 id = s->control_command_id;
2058         } else {
2059                 type = "main";
2060                 id = SERVICE_EXEC_START;
2061         }
2062 
2063         idx = service_exec_command_index(u, id, command);		<--- HERE


2023 static unsigned service_exec_command_index(Unit *u, ServiceExecCommand id, ExecCommand *current) {
2024         Service *s = SERVICE(u);
2025         unsigned idx = 0;
2026         ExecCommand *first, *c;
2027 
2028         assert(s);
2029 
2030         first = s->exec_command[id];		<--- HERE 'first' is invalid
2031 
2032         /* Figure out where we are in the list by walking back to the beginning */
2033         for (c = current; c != first; c = c->command_prev)		<--- HERE
2034                 idx++;
2035 
2036         return idx;
2037 }
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Why it died? Because the "s->control_command_id" was not expected:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) p s->control_command_id
$8 = _SERVICE_EXEC_COMMAND_INVALID
(gdb) p id
$9 = _SERVICE_EXEC_COMMAND_INVALID
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This caused the called function "service_exec_command_index" to derefence a NULL pointer:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
(gdb) down
#3  0x000055e3aedd0e68 in service_exec_command_index (current=0x55e3b0afadf0, 
    id=_SERVICE_EXEC_COMMAND_INVALID, u=0x55e3b0b69640) at src/core/service.c:2033
2033	        for (c = current; c != first; c = c->command_prev)
(gdb) p current
$10 = (ExecCommand *) 0x55e3b0afadf0
(gdb) p c
$11 = (ExecCommand *) 0x0
(gdb) p first
$12 = (ExecCommand *) 0x55e3b0c48300
(gdb) p *first
$13 = {path = 0x1 <Address 0x1 out of bounds>, argv = 0x55e3b0aed4e0, exec_status = {start_timestamp = {
      realtime = 94436410693184, monotonic = 0}, exit_timestamp = {realtime = 0, monotonic = 227}, 
    pid = 100, code = 0, status = -1}, command_next = 0x0, command_prev = 0x0, ignore = 64}
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

We have 'c' now NULL and anyway 'first' was pointing to invalid memory.

What remains to understand is when this could happen.

Comment 2 Renaud Métrich 2020-04-29 12:08:06 UTC
I checked all the sources, I don't understand how this could happen since s->control_command and s->control_command_id are "bound" together:
when s->control_command_id is changed to _SERVICE_EXEC_COMMAND_INVALID, s->control_command is always set to NULL.

But from the coredump, this wasn't the case.

I'm trying to set up a reproducer but failed so far.
I believe a condition would be to have a reload while a ExecStartPost executes.
Additionally at least ExecStartPost are required and the 2nd one should execute at the time of the reload.

Comment 4 Lukáš Nykrýn 2020-04-30 10:12:39 UTC
fix merged to github master branch -> https://github.com/systemd-rhel/rhel-7/pull/108

Comment 14 errata-xmlrpc 2020-09-29 20:32:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: systemd security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4007