Bug 1930875 - systemd-oomd.service: Failed with result 'core-dump'
Summary: systemd-oomd.service: Failed with result 'core-dump'
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 34
Hardware: armv7hl
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-19 16:55 UTC by Paul Whalen
Modified: 2021-12-01 19:08 UTC (History)
15 users (show)

Fixed In Version: systemd-248~rc2-1.fc34
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-02 16:31:50 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
systemd-oomd backtrace (6.25 KB, text/plain)
2021-02-19 16:55 UTC, Paul Whalen
no flags Details

Description Paul Whalen 2021-02-19 16:55:45 UTC
Created attachment 1758217 [details]
systemd-oomd backtrace

Description of problem:

When booting F34 on armhfp, systemd-oomd fails: 

Feb 19 11:43:08 bpi systemd[1]: systemd-oomd.service: Failed with result 'core-dump'.
Feb 19 11:43:08 bpi systemd[1]: systemd-oomd.service: Scheduled restart job, restart counter is at 10.
Feb 19 11:43:08 bpi systemd[1]: Stopped Userspace Out-Of-Memory (OOM) Killer.
Feb 19 11:43:08 bpi systemd[1]: systemd-oomd.service: Start request repeated too quickly.
Feb 19 11:43:08 bpi systemd[1]: systemd-oomd.service: Failed with result 'core-dump'.
Feb 19 11:43:08 bpi systemd[1]: Failed to start Userspace Out-Of-Memory (OOM) Killer.


Version-Release number of selected component (if applicable):
systemd-247.3-2.fc34

How reproducible:
Most of the time. 

Actual results:

coredumpctl info
           PID: 913 (systemd-oomd)
           UID: 997 (systemd-oom)
           GID: 994 (systemd-oom)
        Signal: 6 (ABRT)
     Timestamp: Fri 2021-02-19 11:43:06 EST (6min ago)
  Command Line: /usr/lib/systemd/systemd-oomd
    Executable: /usr/lib/systemd/systemd-oomd
 Control Group: /system.slice/systemd-oomd.service
          Unit: systemd-oomd.service
         Slice: system.slice
       Boot ID: e6738df9a18541aba4ebfbd30c90dca1
    Machine ID: 42aadbaf03e24900b4ac58ea5d562588
      Hostname: bpi
       Storage: /var/lib/systemd/coredump/core.systemd-oomd.997.e6738df9a18541aba4ebfbd30c90dca1.913.1613752986000000.zst
       Message: Process 913 (systemd-oomd) of user 997 dumped core.
                
                Stack trace of thread 913:
                #0  0x00000000b6b060d4 raise (libc.so.6 + 0x320d4)

Additional info:
backtrace attached

Comment 1 Zbigniew Jędrzejewski-Szmek 2021-02-19 17:11:31 UTC
Maybe https://github.com/systemd/systemd/pull/18328? I was supposed to backport that anyway.

Comment 2 Anita Zhang 2021-02-23 09:57:24 UTC
I don't *think* https://github.com/systemd/systemd/pull/18328 would fix this since it doesn't change how systemd-oomd behaves nor how the pid1 varlink server behaves. Backtrace suggests stack smashing in process_managed_oom_reply() but I'm not seeing anything obvious. I'll try to reproduce this in a VM

Comment 3 Anita Zhang 2021-02-25 11:04:31 UTC
This was kind of tricky. So what happened what was that `process_managed_oom_reply()` used `json_dispatch_unsigned()` to parse the value and store it into reply.limit (which is of type unsigned). But `json_dispatch_unsigned()` actually casts the return pointer to type uintmax_t* (and not type unsigned* like the name suggests). On armv7l uintmax_t is 8 bytes and unsigned is 4 bytes hence the stack smash.

This was inadvertently fixed by https://github.com/systemd/systemd/pull/18659 (in systemd v248~rc2) because poettering changed reply.limit to be uint32_t and changed the parser to `json_dispatch_uint32()` to match the uint32_t type used for permyriad conversion.

Comment 4 Zbigniew Jędrzejewski-Szmek 2021-02-25 11:34:26 UTC
Shouldn't this be fixed in json_dispatch_unsigned()? Seems like an invitation for errors.

Comment 5 Anita Zhang 2021-02-26 02:59:10 UTC
An invitation for errors indeed. I submitted https://github.com/systemd/systemd/pull/18809

Comment 6 Zbigniew Jędrzejewski-Szmek 2021-03-02 16:31:50 UTC
> This was inadvertently fixed by https://github.com/systemd/systemd/pull/18659 (in systemd v248~rc2)

Comment 7 RobbieTheK 2021-12-01 19:07:00 UTC
After upgrading to F35, systemd-oomd-defaults-249.7-2.fc35.noarch systemd-oomd.service times out. 

The unit systemd-oomd.service has successfully entered the 'dead' state.
Subject: A stop job for unit systemd-oomd.service has finished
A stop job for unit systemd-oomd.service has finished.
Subject: A start job for unit systemd-oomd.service has begun execution
A start job for unit systemd-oomd.service has begun execution.
Dec 01 13:57:52  systemd[1]: systemd-oomd.service: Main process exited, code=killed, status=9/KILL
An ExecStart= process belonging to unit systemd-oomd.service has exited.
Dec 01 13:57:52 systemd[1]: systemd-oomd.service: Failed with result 'signal'.
Dec 01 13:59:44 systemd[1]: systemd-oomd.service: start operation timed out. Terminating.
Dec 01 14:00:25 systemd[1]: systemd-oomd.service: Failed with result 'timeout'.


strace -p 152886
strace: Process 152886 attached
ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8

And just hangs.

Should I open a new bug?

Comment 8 RobbieTheK 2021-12-01 19:08:21 UTC
Update to strace:
strace -p 152886
strace: Process 152886 attached
ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0\r\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1003}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1003
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\264\3\0\0\16\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1132}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1132
recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1C\3\0\0\17\0\0\0\266\0\0\0\1\1o\0/\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1011}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1011
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\204\3\0\0\20\0\0\0\266\0\0\0\1\1o\0/\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1076}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1076
recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1[\3\0\0\21\0\0\0\266\0\0\0\1\1o\0/\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1035}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1035
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\204\3\0\0\22\0\0\0\266\0\0\0\1\1o\0/\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1076}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1076
recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1L\0\0\0\23\0\0\0\246\0\0\0\1\1o\0\31\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1\0\0\0\0\0\0\0"..., iov_len=236}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 236
recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0\24\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1003}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1003
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\244\3\0\0\25\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1116}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1116
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1S\0\0\0\26\0\0\0\242\0\0\0\1\1o\0\31\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1\0\0\0\0\0\0\0"..., iov_len=243}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 243
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\4\0013\0\0\0\4\0\0\0\250\0\0\0\1\1o\0007\0\0\0/org/fre"..., iov_len=184}, {iov_base=" \0\0\0org.freedesktop.systemd1.Ser"..., iov_len=51}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 235
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0\27\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1003}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1003
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\204\3\0\0\30\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1084}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1084
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1L\0\0\0\31\0\0\0\246\0\0\0\1\1o\0\31\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1\0\0\0\0\0\0\0"..., iov_len=236}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 236
recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=24, tv_nsec=999623000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {tv_sec=24, tv_nsec=999621709})
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0\32\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1003}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1003
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\224\3\0\0\33\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1100}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1100
recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=24, tv_nsec=999296000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {tv_sec=24, tv_nsec=999275306})
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\2\1\1\20\0\0\0\34\0\0\0007\0\0\0\5\1u\0\4\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\7\1s\0\30\0\0\0org.freedesktop.systemd1"..., iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 64
writev(2, [{iov_base="Job for systemd-oomd.service fai"..., iov_len=67}, {iov_base="\n", iov_len=1}], 2Job for systemd-oomd.service failed because a timeout was exceeded.
) = 68
writev(2, [{iov_base="See \"systemctl status systemd-oo"..., iov_len=99}, {iov_base="\n", iov_len=1}], 2See "systemctl status systemd-oomd.service" and "journalctl -xeu systemd-oomd.service" for details.
) = 100
close(3)                                = 0
kill(152887, SIGTERM)                   = 0
kill(152887, SIGCONT)                   = 0
waitid(P_PID, 152887, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=152887, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WEXITED, NULL) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=152887, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
exit_group(1)                           = ?
+++ exited with 1 +++
[1]+  Exit 1                  systemctl start systemd-oomd.service


Note You need to log in before you can comment on or make changes to this bug.