Bug 1828953
| Summary: | systemd segfaults reloading while serializing a service executing a ExecStartPost command [rhel-7.7] | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Juan Gamba <jgamba> | |
| Component: | systemd | Assignee: | Jan Synacek <jsynacek> | |
| Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 7.7 | CC: | jsynacek, rmetrich, sbroz, systemd-maint-list | |
| Target Milestone: | rc | Keywords: | Patch, ZStream | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1837972 1837973 (view as bug list) | Environment: | ||
| Last Closed: | 2020-09-29 20:32:28 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1837972, 1837973 | |||
I checked all the sources, I don't understand how this could happen since s->control_command and s->control_command_id are "bound" together: when s->control_command_id is changed to _SERVICE_EXEC_COMMAND_INVALID, s->control_command is always set to NULL. But from the coredump, this wasn't the case. I'm trying to set up a reproducer but failed so far. I believe a condition would be to have a reload while a ExecStartPost executes. Additionally at least ExecStartPost are required and the 2nd one should execute at the time of the reload. fix merged to github master branch -> https://github.com/systemd-rhel/rhel-7/pull/108 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Low: systemd security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4007 |
Description of problem: Systemd segfaults after or while reloading systemd segfaults reloading "kernel: systemd[1]: segfault at 48 ip 000055e3aedd0e68 sp 00007ffedf6a04a0 error 4 in systemd[55e3aed93000+163000]" [rhel-7.7] Version-Release number of selected component (if applicable): $ grep systemd installed-rpms systemd-219-67.el7_7.3.x86_64 Wed Mar 25 21:38:12 2020 systemd-libs-219-67.el7_7.3.i686 Wed Mar 25 21:38:25 2020 systemd-libs-219-67.el7_7.3.x86_64 Wed Mar 25 21:38:11 2020 systemd-python-219-67.el7_7.3.x86_64 Wed Mar 25 21:38:18 2020 systemd-sysv-219-67.el7_7.3.x86_64 Wed Mar 25 21:38:13 2020 How reproducible: Customer tell us the problem repeats itself now and then and provided a core. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: journald extract: Apr 27 01:02:03 hostname systemd[1]: Reloading. Apr 27 01:02:03 hostname systemd[1]: Starting Insights Client... Apr 27 01:02:03 hostname systemd[1]: Reloading. Apr 27 01:02:03 hostname systemd[1]: Reloading. Apr 27 01:02:03 hostname kernel: systemd[1]: segfault at 48 ip 000055e3aedd0e68 sp 00007ffedf6a04a0 error 4 in systemd[55e3aed93000+163000] Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Only 885MiB is available on /var/spool/abrt Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Process 4277 (systemd) of user 0 killed by SIGSEGV - ignoring (low free space) Apr 27 01:02:03 hostname abrt-hook-ccpp[4278]: Saved core dump of pid 4277 to core.4277 at / (19238912 bytes) Apr 27 01:02:03 hostname systemd[1]: Caught <SEGV>, dumped core as pid 4277. Apr 27 01:02:03 hostname systemd[1]: Freezing execution. Apr 27 01:02:11 hostname insights-client[4256]: Starting to collect Insights data for some.hostname.net Apr 27 01:02:11 hostname insights-client[4256]: Failed to notify init system: Connection refused [..] Apr 27 01:02:26 hostname insights-client[4256]: Failed to notify init system: Connection refused Apr 27 01:02:28 hostname dbus[1712]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out Additional information from RMetrich: Coredump Analysis We have systemd dying while reloading: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- (gdb) bt #0 0x00007f1645682647 in kill () at ../sysdeps/unix/syscall-template.S:81 #1 0x000055e3aedb996d in crash (sig=11) at src/core/main.c:172 #2 <signal handler called> #3 0x000055e3aedd0e68 in service_exec_command_index (current=0x55e3b0afadf0, id=_SERVICE_EXEC_COMMAND_INVALID, u=0x55e3b0b69640) at src/core/service.c:2033 #4 service_serialize_exec_command (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, command=0x55e3b0afadf0) at src/core/service.c:2063 #5 0x000055e3aedd1277 in service_serialize (u=0x55e3b0b69640, f=0x55e3b0afe1f0, fds=0x55e3b0c25070) at src/core/service.c:2122 #6 0x000055e3aee67626 in unit_serialize (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, fds=fds@entry=0x55e3b0c25070, serialize_jobs=serialize_jobs@entry=true) at src/core/unit.c:2561 #7 0x000055e3aedbf962 in manager_serialize (m=m@entry=0x55e3b0aed040, f=0x55e3b0afe1f0, fds=fds@entry=0x55e3b0c25070, switching_root=switching_root@entry=false) at src/core/manager.c:2553 #8 0x000055e3aedc0998 in manager_reload (m=m@entry=0x55e3b0aed040) at src/core/manager.c:2792 #9 0x000055e3aedb6c5f in main (argc=5, argv=0x7ffedf6a1038) at src/core/main.c:1799 -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Upon reload, systemd "serializes" the internal state. Here, it was serializing "insights-client.service": -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- (gdb) f 4 #4 service_serialize_exec_command (u=u@entry=0x55e3b0b69640, f=f@entry=0x55e3b0afe1f0, command=0x55e3b0afadf0) at src/core/service.c:2063 2063 idx = service_exec_command_index(u, id, command); (gdb) p s->main_pid $1 = 4256 (gdb) p u->id $2 = 0x55e3b0c46260 "insights-client.service" -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- systemd died while serializing ExecStartPost's 2nd command: " ExecStartPost=/bin/bash -c "if [ -d /sys/fs/cgroup/memory ]; then echo 1G > /sys/fs/cgroup/memory/system.slice/insights-client.service/memory.soft_limit_in_bytes; fi" " -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- (gdb) p command->argv[0] $5 = 0x55e3b0b51860 "/bin/bash" (gdb) p command->argv[1] $6 = 0x55e3b0b69f60 "-c" (gdb) p command->argv[2] $7 = 0x55e3b0b69f80 "if [ -d /sys/fs/cgroup/memory ]; then echo 1G > /sys/fs/cgroup/memory/system.slice/insights-client.service/memory.soft_limit_in_bytes; fi" -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Relevant source code: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 2039 static int service_serialize_exec_command(Unit *u, FILE *f, ExecCommand *command) { 2040 Service *s = SERVICE(u); 2041 ServiceExecCommand id; 2042 unsigned idx; 2043 const char *type; 2044 char **arg; 2045 _cleanup_strv_free_ char **escaped_args = NULL; 2046 _cleanup_free_ char *args = NULL, *p = NULL; 2047 size_t allocated = 0, length = 0; 2048 2049 assert(s); 2050 assert(f); 2051 2052 if (!command) 2053 return 0; 2054 2055 if (command == s->control_command) { 2056 type = "control"; 2057 id = s->control_command_id; 2058 } else { 2059 type = "main"; 2060 id = SERVICE_EXEC_START; 2061 } 2062 2063 idx = service_exec_command_index(u, id, command); <--- HERE 2023 static unsigned service_exec_command_index(Unit *u, ServiceExecCommand id, ExecCommand *current) { 2024 Service *s = SERVICE(u); 2025 unsigned idx = 0; 2026 ExecCommand *first, *c; 2027 2028 assert(s); 2029 2030 first = s->exec_command[id]; <--- HERE 'first' is invalid 2031 2032 /* Figure out where we are in the list by walking back to the beginning */ 2033 for (c = current; c != first; c = c->command_prev) <--- HERE 2034 idx++; 2035 2036 return idx; 2037 } -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Why it died? Because the "s->control_command_id" was not expected: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- (gdb) p s->control_command_id $8 = _SERVICE_EXEC_COMMAND_INVALID (gdb) p id $9 = _SERVICE_EXEC_COMMAND_INVALID -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- This caused the called function "service_exec_command_index" to derefence a NULL pointer: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- (gdb) down #3 0x000055e3aedd0e68 in service_exec_command_index (current=0x55e3b0afadf0, id=_SERVICE_EXEC_COMMAND_INVALID, u=0x55e3b0b69640) at src/core/service.c:2033 2033 for (c = current; c != first; c = c->command_prev) (gdb) p current $10 = (ExecCommand *) 0x55e3b0afadf0 (gdb) p c $11 = (ExecCommand *) 0x0 (gdb) p first $12 = (ExecCommand *) 0x55e3b0c48300 (gdb) p *first $13 = {path = 0x1 <Address 0x1 out of bounds>, argv = 0x55e3b0aed4e0, exec_status = {start_timestamp = { realtime = 94436410693184, monotonic = 0}, exit_timestamp = {realtime = 0, monotonic = 227}, pid = 100, code = 0, status = -1}, command_next = 0x0, command_prev = 0x0, ignore = 64} -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- We have 'c' now NULL and anyway 'first' was pointing to invalid memory. What remains to understand is when this could happen.