Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1760149

Summary: systemctl daemon-reload crashes systemd "free(): invalid next size (fast)" malloc
Product: Red Hat Enterprise Linux 7 Reporter: Rick White <rickatnight11>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.7CC: bwelterl, cbesson, hmatsumo, jsynacek, msekleta, redhat-bugzilla, rmetrich, sbroz, systemd-maint-list
Target Milestone: rcKeywords: EasyFix, Patch, Reproducer, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: systemd-219-73.el7.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1778083 (view as bug list) Environment:
Last Closed: 2020-03-31 20:02:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1778083    
Attachments:
Description Flags
backtrace from ABRT none

Description Rick White 2019-10-10 01:10:53 UTC
Created attachment 1624112 [details]
backtrace from ABRT

Description of problem:

Under certain circumstances running `systemctl daemon-reload` can crash systemd due to:

```
...
Thread 1 (LWP 15825):
#0  0x00007f5312567607 in kill () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x0000559568e8d95d in crash (sig=6) at src/core/main.c:172
        rl = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        pid = <optimized out>
        __func__ = "crash"
        __PRETTY_FUNCTION__ = "crash"
#2  <signal handler called>
No locals.
#3  0x00007f5312567337 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
        resultvar = 0
        pid = 1
        selftid = 1
#4  0x00007f5312568a28 in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7f53126b9c31, sa_sigaction = 0x7f53126b9c31}, sa_mask = {__val = {3, 140736349168843, 5, 139994768051687, 1, 139994768060401, 3, 140736349168820, 12, 139994768060405, 2, 139994768060405, 2, 140736349169632, 14073634916
9632, 140736349171392}}, sa_flags = 14, sa_restorer = 0x7fffbc1969e0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#5  0x00007f53125a9e87 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f53126bc3b8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
        ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fffbc196ed0, reg_save_area = 0x7fffbc196de0}}
        ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7fffbc196ed0, reg_save_area = 0x7fffbc196de0}}
        fd = 2
        on_2 = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#6  0x00007f53125b2679 in malloc_printerr (ar_ptr=0x7f53128f8760 <main_arena>, ptr=<optimized out>, str=0x7f53126bc530 "free(): invalid next size (fast)", action=3) at malloc.c:4967
        buf = "000055956ad9dac0"
        cp = <optimized out>
        ar_ptr = 0x7f53128f8760 <main_arena>
        ptr = <optimized out>
        str = 0x7f53126bc530 "free(): invalid next size (fast)"
        action = 3
#7  _int_free (av=0x7f53128f8760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843
        size = <optimized out>
        fb = <optimized out>
        nextchunk = <optimized out>
        nextsize = <optimized out>
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        errstr = 0x7f53126bc530 "free(): invalid next size (fast)"
        locked = <optimized out>
...
```

Version-Release number of selected component (if applicable):

systemd-219-67.el7_7.1

How reproducible:

Roughly 15% of new servers provisioned with the same workflow (takes a base el7.7 install and, through Puppet, deploys a litany of packages and config changes, and mid-way through performs a `systemctl daemon-reload`) are hitting this box.  Simply restarting the server resolves the issue.

Steps to Reproduce:

Difficult to say which specific part of our provisioning workflow is more apt to trigger this behavior, since it doesn't happen 100% of the time, but will continue digging and try to reproduce more specifically.

Actual results:

systemd crashes


Expected results:

systemd doesn't crash

Additional info:

Full backtrace attached

Comment 2 Jan Synacek 2019-10-16 08:52:43 UTC
According to the CentOS tracker, this issue should was fixed by https://github.com/systemd-rhel/rhel-7/pull/38/commits/3a3c7374893751d2974cbb1b05dfbf71b49df590, which we have already merged.

Comment 3 Rick White 2019-10-29 14:13:51 UTC
We're running the latest `systemd-219-67.el7_7.2.x86_64`, which contains the above fix, but we're still seeing this behavior.  I'm not sure if the above fix is related.

Comment 4 Renaud Métrich 2019-11-21 15:03:09 UTC
Hi Rick,

Do you have a Red Hat subscription? If so please create a support case, I will have a look into this.
Otherwise, could you please share a coredump of systemd (latest, systemd-219-67.el7_7.2.x86_64)?

Thanks,

Renaud.

Comment 5 Michael Chapman 2019-11-22 04:13:24 UTC
I am able to reproduce this reliably on systemd-219-67.el7_7.2.x86_64.

The reproducer steps are in https://bugs.centos.org/view.php?id=16441#c35733 .

Comment 6 Renaud Métrich 2019-11-22 10:48:32 UTC
Thanks, I can also reproduce this now:

# cat << EOF >/etc/systemd/system/foobar-a.service
[Unit]
After=foobar-b.socket
[Service]
ExecStart=/bin/sleep 600
Restart=always
EOF

# cat << EOF >/etc/systemd/system/foobar-b.socket
[Socket]
ListenDatagram=/tmp/socket
EOF

# systemctl daemon-reload
# systemctl start foobar-a.service

repro.sh:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
#!/bin/bash

ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
rm -f /etc/systemd/system/foobar-b.service; systemctl daemon-reload
systemctl daemon-reexec
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

I'm working on it.

The issue seems related to having a broken cgroup path "/sys/fs/cgroup/systemd/sys/fs/cgroup/systemd" for the unit (here "foobar-a.service").
Initially it's good, but while re-loading or re-execing, this seems to change.

Comment 7 Renaud Métrich 2019-11-22 12:14:27 UTC
For some reason, rebuilding systemd and linking it with "mcheck" library makes the issue disappear entirely:

$ git diff
diff --git a/systemd.spec b/systemd.spec
index a446e0b..d489b7e 100644
--- a/systemd.spec
+++ b/systemd.spec
@@ -7,7 +7,7 @@
 Name:           systemd
 Url:            http://www.freedesktop.org/wiki/Software/systemd
 Version:        219
-Release:        67%{?dist}.2
+Release:        67%{?dist}.2.mcheck
 # For a breakdown of the licensing, see README
 License:        LGPLv2+ and MIT and GPLv2+
 Summary:        A System and Service Manager
@@ -1011,6 +1011,7 @@ CONFIGURE_OPTS=(
 %ifarch s390 s390x ppc %{power64} aarch64
     --disable-lto
 %endif
+    --with-ldflags=-lmcheck
 )
 
 %configure "${CONFIGURE_OPTS[@]}"

Comment 8 Renaud Métrich 2019-11-23 12:45:01 UTC
I finally found where the memory was corrupting, I had to spend several hours instrumenting the code with traces ...

When freeing a unit (which is done upon reload), the cgroup_path was freed, but not reinitialized to NULL.
Due to this, some path /sys/fs/cgroup/... was allocated while performing the unit destruction, causing the memory location used by cgroup_path to be reused, and later crashing.

See PR https://github.com/systemd-rhel/rhel-7/pull/58 for details.

Comment 9 Michael Chapman 2019-11-24 22:11:19 UTC
While this patch stops systemd from crashing, I do not believe it solves the underlying problem.

This is with that patch applied:

    root@test:~# systemctl start foobar-a.service
    root@test:~# systemctl daemon-reexec
    root@test:~# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
    root@test:~# rm -f /etc/systemd/system/foobar-b.service; systemctl daemon-reload
    root@test:~# systemctl status foobar-a.service
    * foobar-a.service
       Loaded: loaded (/etc/systemd/system/foobar-a.service; static; vendor preset: disabled)
       Active: active (running) since Mon 2019-11-25 09:04:36 AEDT; 11s ago
     Main PID: 18552 (sleep)
       CGroup: /system.slice/foobar-a.service
               `-18552 /bin/sleep 600

    Nov 25 09:04:36 test systemd[1]: Started foobar-a.service.
    root@test:~# systemctl daemon-reexec
    root@test:~# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
    root@test:~# rm -f /etc/systemd/system/foobar-b.service; systemctl daemon-reload
    root@test:~# systemctl status foobar-a.service
    * foobar-a.service
       Loaded: loaded (/etc/systemd/system/foobar-a.service; static; vendor preset: disabled)
       Active: active (running) since Mon 2019-11-25 09:04:36 AEDT; 18s ago
     Main PID: 18552 (sleep)
       CGroup: /system.slice/foobar-a.service
               `-18552 /bin/sleep 600

    Nov 25 09:04:36 test systemd[1]: Started foobar-a.service.
    root@test:~# systemctl daemon-reexec
    root@test:~# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
    root@test:~# rm -f /etc/systemd/system/foobar-b.service; systemctl daemon-reload
    root@test:~# systemctl status foobar-a.service
    * foobar-a.service
       Loaded: loaded (/etc/systemd/system/foobar-a.service; static; vendor preset: disabled)
       Active: active (running) since Mon 2019-11-25 09:04:36 AEDT; 23s ago
     Main PID: 18552 (sleep)
       CGroup: /system.slice/foobar-a.service
               `-18552 /bin/sleep 600

    Nov 25 09:04:36 test systemd[1]: Started foobar-a.service.
    root@test:~# systemctl daemon-reexec
    root@test:~# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
    root@test:~# rm -f /etc/systemd/system/foobar-b.service; systemctl daemon-reload
    root@test:~# systemctl status foobar-a.service
    * foobar-a.service
       Loaded: loaded (/etc/systemd/system/foobar-a.service; static; vendor preset: disabled)
       Active: inactive (dead)
     Main PID: 18552
       CGroup: /system.slice/foobar-a.service
               `-18552 /bin/sleep 600

    Nov 25 09:04:36 test systemd[1]: Started foobar-a.service.

It took four goes to hit the bug, but the final `systemctl status foobar-a.service` output shows the problem. The service is marked "inactive (dead)", yet the service's main process is still running!

In fact, systemd will happily let you start an additional copy of the service:

    root@test:~# systemctl start foobar-a.service
    root@test:~# systemctl status foobar-a.service
    * foobar-a.service
       Loaded: loaded (/etc/systemd/system/foobar-a.service; static; vendor preset: disabled)
       Active: active (running) since Mon 2019-11-25 09:08:40 AEDT; 1s ago
     Main PID: 18861 (sleep)
       CGroup: /system.slice/foobar-a.service
               `-18552 /bin/sleep 600
               `-18861 /bin/sleep 600

    Nov 25 09:08:40 test systemd[1]: Started foobar-a.service.

I think there must be something else still clobbering the state of foobar-a.service.

Comment 10 Michael Chapman 2019-11-24 23:01:38 UTC
The bug specifically occurs when the additional link is added to the unit (i.e. not when that link is removed):

    root@test:~# systemctl daemon-reexec
    root@test:~# systemctl status foobar-a.service; ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl status foobar-a.service
    * foobar-a.service
       Loaded: loaded (/etc/systemd/system/foobar-a.service; static; vendor preset: disabled)
       Active: active (running) since Mon 2019-11-25 09:58:42 AEDT; 54s ago
     Main PID: 19170 (sleep)
       CGroup: /system.slice/foobar-a.service
               `-19170 /bin/sleep 600

    Nov 25 09:58:42 test systemd[1]: Started foobar-a.service.
    * foobar-a.service
       Loaded: loaded (/etc/systemd/system/foobar-a.service; enabled; vendor preset: disabled)
       Active: inactive (dead)
     Main PID: 19170 (sleep)
       CGroup: /system.slice/foobar-a.service
               `-19170 /bin/sleep 600

    Nov 25 09:58:42 test systemd[1]: Started foobar-a.service.

Comment 11 Michael Chapman 2019-11-24 23:06:34 UTC
And with a `kill 19170` immediately after that, systemd crashed with an "Uh, main process died at wrong time" assertion:

    (gdb) bt 
    #0  0x00007f0fa47e9607 in kill () at ../sysdeps/unix/syscall-template.S:81
    #1  0x000055d8b367926d in crash (sig=6) at src/core/main.c:172
    #2  <signal handler called>
    #3  0x00007f0fa47e9337 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
    #4  0x00007f0fa47eaa28 in __GI_abort () at abort.c:90
    #5  0x000055d8b36e3cf2 in log_assert_failed_unreachable (text=text@entry=0x55d8b375eeb8 "Uh, main process died at wrong time.", 
        file=file@entry=0x55d8b375c830 "src/core/service.c", line=line@entry=2753, 
        func=func@entry=0x55d8b375f6e0 <__PRETTY_FUNCTION__.18132> "service_sigchld_event") at src/shared/log.c:759
    #6  0x000055d8b36991ff in service_sigchld_event (u=0x55d8b552d780, pid=<optimized out>, code=<optimized out>, status=15) at src/core/service.c:2753
    #7  0x000055d8b3679d3e in invoke_sigchld_event (m=m@entry=0x55d8b5452490, u=u@entry=0x55d8b552d780, si=si@entry=0x7ffc245e5410) at src/core/manager.c:1844
    #8  0x000055d8b367a096 in manager_dispatch_sigchld (m=m@entry=0x55d8b5452490) at src/core/manager.c:1896
    #9  0x000055d8b3680fcf in manager_dispatch_signal_fd (source=<optimized out>, fd=<optimized out>, revents=<optimized out>, userdata=0x55d8b5452490)
        at src/core/manager.c:2155
    #10 0x000055d8b3722bc0 in source_dispatch (s=s@entry=0x55d8b5453030) at src/libsystemd/sd-event/sd-event.c:2115
    #11 0x000055d8b3723c5a in sd_event_dispatch (e=0x55d8b5452c20) at src/libsystemd/sd-event/sd-event.c:2472
    #12 0x000055d8b3723dff in sd_event_run (e=<optimized out>, timeout=<optimized out>) at src/libsystemd/sd-event/sd-event.c:2501
    #13 0x000055d8b3681cf3 in manager_loop (m=0x55d8b5452490) at src/core/manager.c:2274
    #14 0x000055d8b3675f88 in main (argc=4, argv=0x7ffc245e60c8) at src/core/main.c:1784

Comment 12 Michael Chapman 2019-11-25 03:54:47 UTC
I'm pretty sure I explain how this bug is occurring now, though I am not able to suggest an easy fix.

First, we can simplify the test case a bit so that we just need to service units, no socket units:

    # cat << EOF >/etc/systemd/system/foobar-a.service
    [Unit]
    Wants=foobar-b.service
    [Service]
    ExecStart=/bin/sleep 600
    Restart=always
    EOF
    # systemctl daemon-reload
    # systemctl start foobar-a.service

Here foobar-a.service refers to an initially-nonexistent foobar-b.service.

We can dump systemd's internal state with `systemd-analyze dump`. This iterates through the units, in whatever order they happen to be in the internal hashmap. Let's say it outputs:

    # systemd-analyze dump
    ...
    -> Unit foobar-a.service:
            Description: foobar-a.service
            Instance: n/a
            Unit Load State: loaded
            Unit Active State: active
    ...
    -> Unit foobar-b.service:
            Description: foobar-b.service
            Instance: n/a
            Unit Load State: not-found
            Unit Active State: inactive
    ...

We want foobar-b to come _after_ foobar-a. If it comes before, we can use `systemctl daemon-reexec` to reinitialize everything, and with luck the hashmap ordering will be different.

Next we create that symlink and reload systemd:

    # ln -sf /etc/systemd/system/foobar-{a,b}.service
    # systemctl daemon-reload

systemd serializes its state, unloads and frees most of its data structures, then uses the serialized state to reinitialize everything. The key loop is in manager.c:

   2730         for (;;) {
   2731                 _cleanup_free_ char *line = NULL;
   2732                 Unit *u;
   2733 
   2734                 /* Start marker */
   2735                 r = read_line(f, LONG_LINE_MAX, &line);
   2736                 if (r < 0)
   2737                         return log_error_errno(r, "Failed to read serialization line: %m");
   2738                 if (r == 0)
   2739                         break;
   2740 
   2741                 r = manager_load_unit(m, strstrip(line), NULL, NULL, &u);
   2742                 if (r < 0)
   2743                         goto finish;
   2744 
   2745                 r = unit_deserialize(u, f, fds);
   2746                 if (r < 0)
   2747                         goto finish;
   2748         }

Note that it first loads the unit before deserializing the unit's state. Since foobar-a comes first in the serialization stream, it is loaded first and deserialized first. When foobar-b is loaded, systemd follows the symlink and ends up simply adding foobar-b as an additional name onto this unit. Then it deserializes the state for foobar-b. Since foobar-b is now just an additional name on the foobar-a unit, the deserialized state for foobar-b ends up clobbering the state of foobar-a.

Comment 13 Michael Chapman 2019-11-25 04:45:17 UTC
I can reproduce this last crash on my Fedora system, so it looks like an upstream bug. I've opened https://github.com/systemd/systemd/issues/14141 to track it.

Comment 14 Renaud Métrich 2019-11-25 07:50:02 UTC
Hi Michael,

Thank you for the additional details.
My patch fixes the segfault, but yes, not the other problem you mentioned which apparently happens upstream as well, which should be tracked through another BZ.

Renaud.

Comment 17 Lukáš Nykrýn 2019-11-28 10:52:05 UTC
fix merged to github master branch -> https://github.com/systemd-rhel/rhel-7/pull/58 -> post

Comment 23 errata-xmlrpc 2020-03-31 20:02:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1117