Bug 1970945
| Summary: | systemd coredump in message_extend_fields -> realloc | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Michele Baldessari <michele> | ||||
| Component: | systemd | Assignee: | Michal Sekletar <msekleta> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 8.6 | CC: | achernet, adam.winberg, bdobreli, bwelterl, byodlows, dciabrin, ealcaniz, ffotorel, fkrska, fsun, gerald.prock, gnaik, h13958451065, horst.thaller, jamacku, jcalhoun, jeckersb, jwennerberg, mmatsuya, msekleta, ohudlick, peter.vreman, pratshar, qguo, rhodain, rmetrich, sbalasub, sbroz, shobbs, siggib, skolosov, smeyer, soutteri, stanislav.moravec, systemd-maint-list, tartosh, tcrider, toneata, yferszt, yoliynyk, yvan.broccard, yzheng | ||||
| Target Milestone: | rc | Keywords: | Triaged, ZStream | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | systemd-239-53.el8 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2024903 2026244 (view as bug list) | Environment: | |||||
| Last Closed: | 2022-05-10 15:25:47 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2024903, 2026244 | ||||||
| Attachments: |
|
||||||
May be similar to BZ #1976833 (cannot tell since it's a memory coruption in malloc structures). I've prepared the test build of systemd (latest RHEL-8.4) that is instrumented using AdressSanitizer and UndefinedBehaviorSanitizer, https://msekleta.fedorapeople.org/systemd-rhel84-asan-ubsan/ To install the test packages run, yum-config-manager --add-repo https://msekleta.fedorapeople.org/systemd-rhel84-asan-ubsan/ yum --nogpgcheck update -y systemd After installing test rpms it is necessary to set sanitizer options (passed to systemd as env vars) on kernel command line and reboot. grubby --update-kernel ALL --args='ASAN_OPTIONS=strict_string_checks=1:detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1:halt_on_error=0:log_path=/run/log/systemd-asan.log' grubby --update-kernel ALL --args='UBSAN_OPTIONS=print_stacktrace=1:print_summary=1:halt_on_error=0:log_path=/run/log/systemd-ubsan.log' Note that any memory or undefined behavior related errors will be logged to files /run/log/systemd-asan.log.1 and /run/log/systemd-ubsan.log.1. Whoever is going to install the builds should setup monitoring and watch out for presence of those files as their existence means we have triggered systemd bug. *** Bug 1976833 has been marked as a duplicate of this bug. *** HEADS UP:
The ASAN enabled systemd will crash on boot if embedded in the initramfs.
This is due to libasan trying to read /proc/self/mountinfo, which fails since /proc is not mounted yet:
~~~
==1==AddressSanitizer CHECK failed: ../../../../libsanitizer/sanitizer_common/sanitizer_procmaps_common.cc:75 "((data_.proc_self_maps.len)) > ((0))" (0x0, 0x0)
<empty stack>
X.XXXXXXX Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[...]
~~~
I've submitted a dracut PR to "fix" this: https://github.com/dracutdevs/dracut/pull/1602
Meanwhile, if you hit this, the recovery steps are to perform the following operations:
1. Boot with "rdinit=/bin/sh"
2. At the prompt, mount /proc
# mount -t proc proc /proc
3. Execute systemd
# exec /usr/lib/systemd/systemd
Attached is a patch for dracut to make sure that systemd can load properly: # patch /usr/lib/dracut/modules.d/00systemd/module-setup.sh 00systemd.patch patching file /usr/lib/dracut/modules.d/00systemd/module-setup.sh Created attachment 1822995 [details]
Patch for dracut to let systemd ASAN start properly in the initramfs as well
I have seen two systemd crashes on Rocky Linux 8.4, running systemd-239-45.el8_4.3.x86_64. The backtrace isn't identical but it's similar and also points to sd_bus message handling. I have encountered this problem on 2 VMs, one in VMware (2vCPU/4GB) and the other in RHV (4vCPU/32GB). First occurred in the middle of the day, second in the middle of the night, both under idle conditions. On both accounts this seems to have been triggered by a sysstat run. Here you can see the journal and a backtrace. https://pastebin.com/5yBSmmbP Encountered this systemd crash on DigitalOcean Rocky Linux 8.4:
[root@test]# systemctl --version
systemd 239 (239-45.el8_4.3)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy
[root@test]# coredumpctl info
Failed to check if any systemd-coredump@.service units are running: Connection timed out
PID: 195921 (systemd)
UID: 0 (root)
GID: 0 (root)
Signal: 6 (ABRT)
Timestamp: Sun 2021-10-03 20:56:55 BST (1 day 14h ago)
Command Line: /usr/lib/systemd/systemd --system --deserialize 32
Executable: /usr/lib/systemd/systemd
Control Group: /init.scope
Unit: init.scope
Slice: -.slice
Boot ID: f14aab64ca624f5b9204bea3cc6b726b
Machine ID: be1642266afc4781a3ce84889079d86b
Hostname: test
Storage: /var/lib/systemd/coredump/core.systemd.0.f14aab64ca624f5b9204bea3cc6b726b.195921.1633291015000000.lz4
Message: Process 195921 (systemd) of user 0 dumped core.
Stack trace of thread 195921:
#0 0x00007f2b0c50f68b kill (libc.so.6)
#1 0x000055de92e41f8a crash (systemd)
#2 0x00007f2b0c8afb30 __restore_rt (libpthread.so.0)
#3 0x00007f2b0c50f38f raise (libc.so.6)
#4 0x00007f2b0c4f9dc5 abort (libc.so.6)
#5 0x00007f2b0c5524f7 __libc_message (libc.so.6)
#6 0x00007f2b0c5595fc malloc_printerr (libc.so.6)
#7 0x00007f2b0c559e3c unlink_chunk.isra.2 (libc.so.6)
#8 0x00007f2b0c55cb91 _int_malloc (libc.so.6)
#9 0x00007f2b0c55dc82 malloc (libc.so.6)
#10 0x00007f2b0c59c4ba __alloc_dir (libc.so.6)
#11 0x00007f2b0c59c5bd opendir_tail (libc.so.6)
#12 0x00007f2b0dc88487 cg_enumerate_subgroups (libsystemd-shared-239.so)
#13 0x000055de92ec27f2 unit_watch_pids_in_path (systemd)
#14 0x000055de92e970ff service_enter_signal (systemd)
#15 0x000055de92e9ae65 service_sigchld_event (systemd)
#16 0x000055de92e7d740 manager_dispatch_sigchld (systemd)
#17 0x00007f2b0dd35abd source_dispatch (libsystemd-shared-239.so)
#18 0x00007f2b0dd37b8d sd_event_dispatch (libsystemd-shared-239.so)
#19 0x00007f2b0dd37d98 sd_event_run (libsystemd-shared-239.so)
#20 0x000055de92e86ac2 manager_loop (systemd)
#21 0x000055de92e3d481 main (systemd)
#22 0x00007f2b0c4fb4a3 __libc_start_main (libc.so.6)
#23 0x000055de92e3f0fe _start (systemd)
Found another one, this time CentOS 8.4, again triggered by sysstat.
Sep 21 22:30:33 localhost systemd[1]: Starting system activity accounting tool...
Sep 21 22:30:33 localhost systemd-coredump[633002]: Due to PID 1 having crashed coredump collection will now be turned off.
Sep 21 22:30:33 localhost systemd[1]: Caught <ABRT>, dumped core as pid 633001.
systemctl --version
systemd 239 (239-45.el8_4.3)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy
Failed to check if any systemd-coredump@.service units are running: Connection timed out
PID: 633001 (systemd)
UID: 0 (root)
GID: 0 (root)
Signal: 6 (ABRT)
Timestamp: Tue 2021-09-21 22:30:33 GMT (1 weeks 6 days ago)
Command Line: /usr/lib/systemd/systemd --switched-root --system --deserialize 17
Executable: /usr/lib/systemd/systemd
Control Group: /init.scope
Unit: init.scope
Slice: -.slice
Boot ID: dd20eda7cb844c77aa39ca13e9f01dbc
Machine ID: 5a0c87db412d45d3b3f32bd9298e3d81
Hostname: localhost.localdomain
Storage: /var/lib/systemd/coredump/core.systemd.0.dd20eda7cb844c77aa39ca13e9f01dbc.633001.1632263433000000.lz4
Message: Process 633001 (systemd) of user 0 dumped core.
Stack trace of thread 633001:
#0 0x00007fcf4dbee67b kill (libc.so.6)
#1 0x000056487f647f7a crash (systemd)
#2 0x00007fcf4df8eb20 __restore_rt (libpthread.so.0)
#3 0x00007fcf4dbee37f raise (libc.so.6)
#4 0x00007fcf4dbd8db5 abort (libc.so.6)
#5 0x00007fcf4dc314e7 __libc_message (libc.so.6)
#6 0x00007fcf4dc385ec malloc_printerr (libc.so.6)
#7 0x00007fcf4dc3bc7d _int_malloc (libc.so.6)
#8 0x00007fcf4dc3d8d6 __libc_calloc (libc.so.6)
#9 0x00007fcf4f3ccf4c sd_bus_message_new (libsystemd-shared-239.so)
#10 0x00007fcf4f3cd384 sd_bus_message_new_signal (libsystemd-shared-239.so)
#11 0x00007fcf4f3d9fe4 emit_properties_changed_on_interface (libsystemd-shared-239.so)
#12 0x00007fcf4f3ddc97 sd_bus_emit_properties_changed_strv (libsystemd-shared-239.so)
#13 0x000056487f65195c send_changed_signal (systemd)
#14 0x000056487f65a787 bus_foreach_bus (systemd)
#15 0x000056487f653924 bus_unit_send_change_signal (systemd)
#16 0x000056487f68ca3b manager_loop (systemd)
#17 0x000056487f643471 main (systemd)
#18 0x00007fcf4dbda493 __libc_start_main (libc.so.6)
#19 0x000056487f6450ee _start (systemd)
Not sure what triggered my coredump, but the lines in /var/log/messages immediately before it were: Oct 3 20:56:54 test systemd[1]: Starting dnf makecache... Oct 3 20:56:55 test dnf[195919]: Metadata cache refreshed recently. Oct 3 20:56:55 test systemd-coredump[195922]: Due to PID 1 having crashed coredump collection will now be turned off. Hi all, Just to be clear, the fact that it happens "randomly" is just due to how the memory allocator allocates memory: at some point in time the corruption happens and later, once malloc() wants to reuse the corrupted memory slot, systemd crashes. It may happen in a child of system as well, between fork and exec when systemd executes a ExecStart command for example (once exec'ed memory is replaced so corruption is gone as well). That's why it can be noticed with "sysstat" or "dnf makecache": it's just still systemd at that time, shortly after fork(). Renaud. Can we ask a customer to try systemd-239-45 (rhel-8.4 version) on top of rhel-8.3 and/or systemd-239-41 (version from rhel-8.3) on top of rhel-8.4? We would like to rule out the possibility, that the bug is in a different component. I was able to make sense out of provided sanitiser reports. I mapped addresses to symbols using gdb and top of the backtrace points to bug in function unit_fail_if_noncanonical(), specifically to line https://github.com/redhat-plumbers/systemd-rhel8/blob/master/src/core/unit.c#L4788. If chase_symlinks() fails then we free uninitialised pointer. Depending on the state of stack at the time of the function call the bug can subsequently trigger various crash behaviours. That is consistent with what we see here, i.e. some report abort() while other report segfaults. The bug is already fixed upstream. https://github.com/systemd/systemd/commit/58d9d89b4b41189bdcea86c2ad5cf708b7d54aca I think we should release z-stream update for RHEL-8.4 with the above fix together with the backport of https://github.com/systemd/systemd/pull/20670. Here are test rpms with the backport of patches that I've mentioned in comment #55, https://msekleta.fedorapeople.org/bz1970945-memory-corruption-test-rpms/ fix merged to github master branch -> https://github.com/redhat-plumbers/systemd-rhel8/pull/235 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (systemd bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2069 *** Bug 2005790 has been marked as a duplicate of this bug. *** hello, I failed to start systemd in the way described above. I updated the dracut and compiled the 249 version of systemd. After the __lsan_do_recoverable_leak_check of ASAN is executed or the memory overflows, an error is reported. The error is as follows: localhost login: localhost login: [ 74.683917][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 [ 74.685137][ T1] CPU: 9 PID: 1 Comm: systemd Kdump: loaded Tainted: G O 5.10.0-60.18.0.50.h551.eulerosv2r11.x86_64 #1 [ 74.687019][ T1] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 74.688009][ T1] Call Trace: [ 74.688540][ T1] dump_stack+0x57/0x6e [ 74.689257][ T1] panic+0x196/0x3ec [ 74.689863][ T1] do_exit.cold+0x14/0xaf [ 74.690556][ T1] ? sched_clock+0x5/0x10 [ 74.691230][ T1] do_group_exit+0x33/0xa0 [ 74.691955][ T1] __x64_sys_exit_group+0x14/0x20 [ 74.692738][ T1] do_syscall_64+0x30/0x40 [ 74.693419][ T1] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 74.694374][ T1] RIP: 0033:0x7f8db492b0e0 [ 74.695079][ T1] Code: 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 50 58 b8 e7 00 00 00 48 83 ec 08 48 63 ff 0f 05 <e8> 3b f8 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 28 64 48 [ 74.698149][ T1] RSP: 002b:00007ffd99c06a00 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7 [ 74.699524][ T1] RAX: ffffffffffffffda RBX: 00007f8db5234618 RCX: 00007f8db492b0e0 [ 74.700775][ T1] RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000001 [ 74.701992][ T1] RBP: 00007f8db5234618 R08: 00007f8dae000094 R09: 0000000000000000 [ 74.703222][ T1] R10: 00007ffd99c064a0 R11: 0000000000000206 R12: 00007ffd99c06a90 [ 74.704509][ T1] R13: 0000000000000000 R14: 0000000000000000 R15: 000061c000000880 [ 74.705821][ T1] kernel fault(0x5) notification starting on CPU 9 [ 74.706909][ T1] kernel fault(0x5) notification finished on CPU 9 [ 74.709001][ T1] [kbox] catch panic event, panic reason:Attempted to kill init! exitcode=0x00000100 [ 74.710091][ T1] [kbox] catch panic event, start logging [ 74.710750][ T1] [kbox] stack: [ 74.711150][ T1] [kbox] ffffc01840013d68 0000000000000001 ffffc01840010000 ffffc01840014000 [ 74.712160][ T1] [kbox] 0000000000000000 20000000fffffffe 3030303030303030 3030303030303030 [ 74.713180][ T1] [kbox] 2020202020202020 2020202020202020 2020202020200030 3030303230323032 [ 74.714203][ T1] [kbox] 2020202020202020 2020202020202020 2020202020202020 0020202020202020 [ 74.715224][ T1] [kbox] fb7984b329126600 0000000000000040 0000000000000014 ffffffffc0d09d69 [ 74.716238][ T1] [kbox] 000000000000012c ffffffffc0d09efb ffffffffa69ddda0 0000000000000000 [ 74.717250][ T1] [kbox] ffffffffa632f5f8 ffffffffa510bbb6 ffffffffa50587d9 ffff9990801fa001 [ 74.718265][ T1] [kbox] ffffc01840013ed8 0000000000000001 0000000000000000 ffffffffa5a4e45c [ 74.719280][ T1] [kbox] 0000001100000010 ffffc01840013ee8 ffffc01840013e88 fb7984b329126600 [ 74.720298][ T1] [kbox] 0000000000000000 0000000000000100 0000000000000000 ffff999081b75a00 [ 74.721312][ T1] [kbox] 0000000000000000 0000000000000000 ffff9990801fa0c0 ffff9990801fa001 [ 74.722328][ T1] [kbox] 0000000000000001 0000000000000000 0000000000000100 ffffffffa5a4e976 [ 74.723352][ T1] [kbox] ffffffffa50273f5 0000000000000100 ffff9990803150c0 ffff9990801fa0c0 [ 74.724371][ T1] [kbox] 0000000000000000 ffffffffa50e62d3 0000000000000000 ffffc01840013f58 [ 74.725392][ T1] [kbox] 0000000000000000 0000000000000000 ffffffffa50e6354 ffffffffa5a96a40 [ 74.726403][ T1] [kbox] 0000000000000000 ffffffffa5c00099 000061c000000880 0000000000000000 [ 74.727417][ T1] [kbox] 0000000000000000 00007ffd99c06a90 00007f8db5234618 00007f8db5234618 [ 74.728434][ T1] [kbox] 0000000000000206 00007ffd99c064a0 0000000000000000 00007f8dae000094 [ 74.729449][ T1] [kbox] ffffffffffffffda 00007f8db492b0e0 000000000000000f 0000000000000000 [ 74.730469][ T1] [kbox] 0000000000000001 00000000000000e7 00007f8db492b0e0 0000000000000033 (In reply to huyubiao from comment #73) > hello, > I failed to start systemd in the way described above. > I updated the dracut and compiled the 249 version of systemd. After the > __lsan_do_recoverable_leak_check of ASAN is executed or the memory > overflows, an error is reported. The error is as follows: > > localhost login: > localhost login: [ 74.683917][ T1] Kernel panic - not syncing: > Attempted to kill init! exitcode=0x00000100 > [ 74.685137][ T1] CPU: 9 PID: 1 Comm: systemd Kdump: loaded Tainted: G #1 > [ 74.687019][ T1] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > [ 74.688009][ T1] Call Trace: > [ 74.688540][ T1] dump_stack+0x57/0x6e > [ 74.689257][ T1] panic+0x196/0x3ec > [ 74.689863][ T1] do_exit.cold+0x14/0xaf > [ 74.690556][ T1] ? sched_clock+0x5/0x10 > [ 74.691230][ T1] do_group_exit+0x33/0xa0 > [ 74.691955][ T1] __x64_sys_exit_group+0x14/0x20 > [ 74.692738][ T1] do_syscall_64+0x30/0x40 > [ 74.693419][ T1] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [ 74.694374][ T1] RIP: 0033:0x7f8db492b0e0 > [ 74.695079][ T1] Code: 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f > 05 c3 0f 1f 84 00 00 00 00 00 50 58 b8 e7 00 00 00 48 83 ec 08 48 63 ff 0f > 05 <e8> 3b f8 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 28 64 48 > [ 74.698149][ T1] RSP: 002b:00007ffd99c06a00 EFLAGS: 00000206 ORIG_RAX: > 00000000000000e7 > [ 74.699524][ T1] RAX: ffffffffffffffda RBX: 00007f8db5234618 RCX: > 00007f8db492b0e0 > [ 74.700775][ T1] RDX: 000000000000000f RSI: 0000000000000000 RDI: > 0000000000000001 > [ 74.701992][ T1] RBP: 00007f8db5234618 R08: 00007f8dae000094 R09: > 0000000000000000 > [ 74.703222][ T1] R10: 00007ffd99c064a0 R11: 0000000000000206 R12: > 00007ffd99c06a90 > [ 74.704509][ T1] R13: 0000000000000000 R14: 0000000000000000 R15: > 000061c000000880 > [ 74.705821][ T1] kernel fault(0x5) notification starting on CPU 9 > [ 74.706909][ T1] kernel fault(0x5) notification finished on CPU 9 > [ 74.709001][ T1] [kbox] catch panic event, panic reason:Attempted to > kill init! exitcode=0x00000100 > [ 74.710091][ T1] [kbox] catch panic event, start logging > [ 74.710750][ T1] [kbox] stack: > [ 74.711150][ T1] [kbox] ffffc01840013d68 0000000000000001 > ffffc01840010000 ffffc01840014000 > [ 74.712160][ T1] [kbox] 0000000000000000 20000000fffffffe > 3030303030303030 3030303030303030 > [ 74.713180][ T1] [kbox] 2020202020202020 2020202020202020 > 2020202020200030 3030303230323032 > [ 74.714203][ T1] [kbox] 2020202020202020 2020202020202020 > 2020202020202020 0020202020202020 > [ 74.715224][ T1] [kbox] fb7984b329126600 0000000000000040 > 0000000000000014 ffffffffc0d09d69 > [ 74.716238][ T1] [kbox] 000000000000012c ffffffffc0d09efb > ffffffffa69ddda0 0000000000000000 > [ 74.717250][ T1] [kbox] ffffffffa632f5f8 ffffffffa510bbb6 > ffffffffa50587d9 ffff9990801fa001 > [ 74.718265][ T1] [kbox] ffffc01840013ed8 0000000000000001 > 0000000000000000 ffffffffa5a4e45c > [ 74.719280][ T1] [kbox] 0000001100000010 ffffc01840013ee8 > ffffc01840013e88 fb7984b329126600 > [ 74.720298][ T1] [kbox] 0000000000000000 0000000000000100 > 0000000000000000 ffff999081b75a00 > [ 74.721312][ T1] [kbox] 0000000000000000 0000000000000000 > ffff9990801fa0c0 ffff9990801fa001 > [ 74.722328][ T1] [kbox] 0000000000000001 0000000000000000 > 0000000000000100 ffffffffa5a4e976 > [ 74.723352][ T1] [kbox] ffffffffa50273f5 0000000000000100 > ffff9990803150c0 ffff9990801fa0c0 > [ 74.724371][ T1] [kbox] 0000000000000000 ffffffffa50e62d3 > 0000000000000000 ffffc01840013f58 > [ 74.725392][ T1] [kbox] 0000000000000000 0000000000000000 > ffffffffa50e6354 ffffffffa5a96a40 > [ 74.726403][ T1] [kbox] 0000000000000000 ffffffffa5c00099 > 000061c000000880 0000000000000000 > [ 74.727417][ T1] [kbox] 0000000000000000 00007ffd99c06a90 > 00007f8db5234618 00007f8db5234618 > [ 74.728434][ T1] [kbox] 0000000000000206 00007ffd99c064a0 > 0000000000000000 00007f8dae000094 > [ 74.729449][ T1] [kbox] ffffffffffffffda 00007f8db492b0e0 > 000000000000000f 0000000000000000 > [ 74.730469][ T1] [kbox] 0000000000000001 00000000000000e7 > 00007f8db492b0e0 0000000000000033 |
Description of problem: today we observed a coredump of systemd on a test ceph node (running in a rhel8.4 VM). The machine had been booted a day before and was relatively idle (besides a couple of ceph containers running). [root@ceph-1 coredump]# coredumpctl info Failed to check if any systemd-coredump@.service units are running: Connection timed out PID: 30920 (systemd) UID: 0 (root) GID: 0 (root) Signal: 11 (SEGV) Timestamp: Fri 2021-06-11 07:19:38 UTC (6h ago) Command Line: /usr/lib/systemd/systemd --switched-root --system --deserialize 17 Executable: /usr/lib/systemd/systemd Control Group: /init.scope Unit: init.scope Slice: -.slice Boot ID: 042badcdc3d64775bd0b9f227bd23172 Machine ID: 84415a69e17940f58f15b2a01ec76373 Hostname: ceph-1 Storage: /var/lib/systemd/coredump/core.systemd.0.042badcdc3d64775bd0b9f227bd23172.30920.1623395978000000.lz4 Message: Process 30920 (systemd) of user 0 dumped core. Stack trace of thread 30920: #0 0x00007f072a82967b kill (libc.so.6) #1 0x0000560b02df2f7a crash (systemd) #2 0x00007f072abc9b20 __restore_rt (libpthread.so.0) #3 0x00007f072a876264 _int_malloc (libc.so.6) #4 0x00007f072a8773af _int_realloc (libc.so.6) #5 0x00007f072a87867b realloc (libc.so.6) #6 0x00007f072c006076 message_extend_fields (libsystemd-shared-239.so) #7 0x00007f072c00644f message_append_field_string (libsystemd-shared-239.so) #8 0x00007f072c0089a0 message_new_reply (libsystemd-shared-239.so) #9 0x00007f072c00a312 sd_bus_message_new_method_error (libsystemd-shared-239.so) #10 0x00007f072bff913e sd_bus_reply_method_error (libsystemd-shared-239.so) #11 0x00007f072bff9653 sd_bus_reply_method_errno (libsystemd-shared-239.so) #12 0x00007f072c002f1d bus_maybe_reply_error (libsystemd-shared-239.so) #13 0x00007f072c016340 object_find_and_run (libsystemd-shared-239.so) #14 0x00007f072c017afc bus_process_object (libsystemd-shared-239.so) #15 0x00007f072c026a2a bus_process_internal (libsystemd-shared-239.so) #16 0x00007f072c02871c io_callback (libsystemd-shared-239.so) #17 0x00007f072c04fdf9 source_dispatch (libsystemd-shared-239.so) #18 0x00007f072c051bed sd_event_dispatch (libsystemd-shared-239.so) #19 0x00007f072c051df8 sd_event_run (libsystemd-shared-239.so) #20 0x0000560b02e37ab2 manager_loop (systemd) #21 0x0000560b02dee471 main (systemd) #22 0x00007f072a815493 __libc_start_main (libc.so.6) #23 0x0000560b02df00ee _start (systemd) (gdb) bt #0 0x00007f072a82967b in kill () at ../sysdeps/unix/syscall-template.S:78 #1 0x0000560b02df2f7a in crash (sig=11) at ../src/core/main.c:194 #2 <signal handler called> #3 _int_malloc (av=av@entry=0x7f072abb1bc0 <main_arena>, bytes=bytes@entry=33) at malloc.c:3655 #4 0x00007f072a8773af in _int_realloc (av=av@entry=0x7f072abb1bc0 <main_arena>, oldp=oldp@entry=0x560b03441770, oldsize=oldsize@entry=32, nb=nb@entry=48) at malloc.c:4612 #5 0x00007f072a87867b in __GI___libc_realloc (oldmem=0x560b03441780, bytes=bytes@entry=40) at malloc.c:3238 #6 0x00007f072c006076 in message_extend_fields (m=m@entry=0x560b034460b0, align=align@entry=8, sz=sz@entry=16, add_offset=add_offset@entry=false) at ../src/libsystemd/sd-bus/bus-message.c:163 #7 0x00007f072c00644f in message_append_field_string (m=0x560b034460b0, h=h@entry=6, type=type@entry=115 's', s=0x560b0348b100 ":1.1197", ret=0x560b034460e0) at ../src/libsystemd/sd-bus/bus-message.c:252 #8 0x00007f072c0089a0 in message_new_reply (call=0x560b0347c9d0, type=<optimized out>, m=0x7fff0ad56d50) at ../src/libsystemd/sd-bus/bus-message.c:725 #9 0x00007f072c00a312 in sd_bus_message_new_method_error (call=call@entry=0x560b0347c9d0, m=m@entry=0x7fff0ad56d80, e=e@entry=0x7fff0ad56f00) at ../src/libsystemd/sd-bus/bus-message.c:755 #10 0x00007f072bff913e in sd_bus_reply_method_error (call=0x560b0347c9d0, e=0x7fff0ad56f00) at ../src/libsystemd/sd-bus/bus-convenience.c:181 #11 0x00007f072bff9653 in sd_bus_reply_method_errno (call=0x560b0347c9d0, error=-3, p=0x7fff0ad56f00) at ../src/libsystemd/sd-bus/bus-convenience.c:236 #12 0x00007f072c002f1d in bus_maybe_reply_error (m=m@entry=0x560b0347c9d0, r=r@entry=-3, error=error@entry=0x7fff0ad56f00) at ../src/libsystemd/sd-bus/bus-internal.c:341 #13 0x00007f072c016340 in method_callbacks_run (found_object=0x7fff0ad56fb7, require_fallback=false, c=<optimized out>, m=0x560b0347c9d0, bus=0x560b03435a90) at ../src/libsystemd/sd-bus/bus-objects.c:409 #14 object_find_and_run (bus=0x560b03435a90, m=0x560b0347c9d0, p=0x560b0348b058 "/org/freedesktop/systemd1", require_fallback=false, found_object=0x7fff0ad56fb7) at ../src/libsystemd/sd-bus/bus-objects.c:1268 #15 0x00007f072c017afc in bus_process_object (bus=bus@entry=0x560b03435a90, m=m@entry=0x560b0347c9d0) at ../src/libsystemd/sd-bus/bus-objects.c:1388 #16 0x00007f072c026a2a in process_message (m=0x560b0347c9d0, bus=0x560b03435a90) at ../src/libsystemd/sd-bus/sd-bus.c:2705 #17 process_running (ret=0x0, priority=<optimized out>, hint_priority=<optimized out>, bus=0x560b03435a90) at ../src/libsystemd/sd-bus/sd-bus.c:2747 #18 bus_process_internal (bus=0x560b03435a90, hint_priority=<optimized out>, priority=<optimized out>, ret=0x0) at ../src/libsystemd/sd-bus/sd-bus.c:2966 #19 0x00007f072c02871c in io_callback (s=<optimized out>, fd=<optimized out>, revents=<optimized out>, userdata=0x560b03435a90) at ../src/libsystemd/sd-bus/sd-bus.c:3346 #20 0x00007f072c04fdf9 in source_dispatch (s=0x560b03402de0) at ../src/libsystemd/sd-event/sd-event.c:3365 #21 0x00007f072c051bed in sd_event_dispatch (e=0x560b032c1000) at ../src/libsystemd/sd-event/sd-event.c:3784 #22 0x00007f072c051df8 in sd_event_run (e=0x560b032c1000, timeout=18446744073709551615) at ../src/libsystemd/sd-event/sd-event.c:3844 #23 0x0000560b02e37ab2 in manager_loop (m=m@entry=0x560b032c0a70) at ../src/core/manager.c:2884 #24 0x0000560b02dee471 in invoke_main_loop (ret_error_message=0x7fff0ad57378, ret_switch_root_init=<synthetic pointer>, ret_switch_root_dir=<synthetic pointer>, ret_fds=0x7fff0ad57388, ret_shutdown_verb=<synthetic pointer>, ret_retval=<synthetic pointer>, ret_reexecute=<synthetic pointer>, saved_rlimit_memlock=0x7fff0ad573f0, saved_rlimit_nofile=0x7fff0ad573e0, m=0x560b032c0a70) at ../src/core/main.c:1665 #25 main (argc=<optimized out>, argv=0x7fff0ad57658) at ../src/core/main.c:2606 (gdb) fr 6 #6 0x00007f072c006076 in message_extend_fields (m=m@entry=0x560b034460b0, align=align@entry=8, sz=sz@entry=16, add_offset=add_offset@entry=false) at ../src/libsystemd/sd-bus/bus-message.c:163 163 np = realloc(m->header, ALIGN8(new_size)); (gdb) p m->header $1 = (struct bus_header *) 0x560b03441780 (gdb) p *( m->header) $2 = {endian = 108 'l', type = 3 '\003', flags = 1 '\001', version = 1 '\001', {dbus1 = {body_size = 0, serial = 0, fields_size = 0}, dbus2 = {_reserved = 0, cookie = 0}}} (gdb) Version-Release number of selected component (if applicable): systemd-239-45.el8.x86_64 Full coredump file and sosreport (taken after a VM reset as the system was fairly hosed) are here: http://file.rdu.redhat.com/~mbaldess/systemd-coredump-rhel-8.4/ (Note we've only seen this once so far, so there is no way to easily reproduce this)