Created attachment 1758217 [details] systemd-oomd backtrace Description of problem: When booting F34 on armhfp, systemd-oomd fails: Feb 19 11:43:08 bpi systemd[1]: systemd-oomd.service: Failed with result 'core-dump'. Feb 19 11:43:08 bpi systemd[1]: systemd-oomd.service: Scheduled restart job, restart counter is at 10. Feb 19 11:43:08 bpi systemd[1]: Stopped Userspace Out-Of-Memory (OOM) Killer. Feb 19 11:43:08 bpi systemd[1]: systemd-oomd.service: Start request repeated too quickly. Feb 19 11:43:08 bpi systemd[1]: systemd-oomd.service: Failed with result 'core-dump'. Feb 19 11:43:08 bpi systemd[1]: Failed to start Userspace Out-Of-Memory (OOM) Killer. Version-Release number of selected component (if applicable): systemd-247.3-2.fc34 How reproducible: Most of the time. Actual results: coredumpctl info PID: 913 (systemd-oomd) UID: 997 (systemd-oom) GID: 994 (systemd-oom) Signal: 6 (ABRT) Timestamp: Fri 2021-02-19 11:43:06 EST (6min ago) Command Line: /usr/lib/systemd/systemd-oomd Executable: /usr/lib/systemd/systemd-oomd Control Group: /system.slice/systemd-oomd.service Unit: systemd-oomd.service Slice: system.slice Boot ID: e6738df9a18541aba4ebfbd30c90dca1 Machine ID: 42aadbaf03e24900b4ac58ea5d562588 Hostname: bpi Storage: /var/lib/systemd/coredump/core.systemd-oomd.997.e6738df9a18541aba4ebfbd30c90dca1.913.1613752986000000.zst Message: Process 913 (systemd-oomd) of user 997 dumped core. Stack trace of thread 913: #0 0x00000000b6b060d4 raise (libc.so.6 + 0x320d4) Additional info: backtrace attached
Maybe https://github.com/systemd/systemd/pull/18328? I was supposed to backport that anyway.
I don't *think* https://github.com/systemd/systemd/pull/18328 would fix this since it doesn't change how systemd-oomd behaves nor how the pid1 varlink server behaves. Backtrace suggests stack smashing in process_managed_oom_reply() but I'm not seeing anything obvious. I'll try to reproduce this in a VM
This was kind of tricky. So what happened what was that `process_managed_oom_reply()` used `json_dispatch_unsigned()` to parse the value and store it into reply.limit (which is of type unsigned). But `json_dispatch_unsigned()` actually casts the return pointer to type uintmax_t* (and not type unsigned* like the name suggests). On armv7l uintmax_t is 8 bytes and unsigned is 4 bytes hence the stack smash. This was inadvertently fixed by https://github.com/systemd/systemd/pull/18659 (in systemd v248~rc2) because poettering changed reply.limit to be uint32_t and changed the parser to `json_dispatch_uint32()` to match the uint32_t type used for permyriad conversion.
Shouldn't this be fixed in json_dispatch_unsigned()? Seems like an invitation for errors.
An invitation for errors indeed. I submitted https://github.com/systemd/systemd/pull/18809
> This was inadvertently fixed by https://github.com/systemd/systemd/pull/18659 (in systemd v248~rc2)
After upgrading to F35, systemd-oomd-defaults-249.7-2.fc35.noarch systemd-oomd.service times out. The unit systemd-oomd.service has successfully entered the 'dead' state. Subject: A stop job for unit systemd-oomd.service has finished A stop job for unit systemd-oomd.service has finished. Subject: A start job for unit systemd-oomd.service has begun execution A start job for unit systemd-oomd.service has begun execution. Dec 01 13:57:52 systemd[1]: systemd-oomd.service: Main process exited, code=killed, status=9/KILL An ExecStart= process belonging to unit systemd-oomd.service has exited. Dec 01 13:57:52 systemd[1]: systemd-oomd.service: Failed with result 'signal'. Dec 01 13:59:44 systemd[1]: systemd-oomd.service: start operation timed out. Terminating. Dec 01 14:00:25 systemd[1]: systemd-oomd.service: Failed with result 'timeout'. strace -p 152886 strace: Process 152886 attached ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8 And just hangs. Should I open a new bug?
Update to strace: strace -p 152886 strace: Process 152886 attached ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0\r\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1003}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1003 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\264\3\0\0\16\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1132}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1132 recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1C\3\0\0\17\0\0\0\266\0\0\0\1\1o\0/\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1011}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1011 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\204\3\0\0\20\0\0\0\266\0\0\0\1\1o\0/\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1076}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1076 recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1[\3\0\0\21\0\0\0\266\0\0\0\1\1o\0/\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1035}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1035 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\204\3\0\0\22\0\0\0\266\0\0\0\1\1o\0/\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1076}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1076 recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1L\0\0\0\23\0\0\0\246\0\0\0\1\1o\0\31\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1\0\0\0\0\0\0\0"..., iov_len=236}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 236 recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) ppoll([{fd=3, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0\24\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1003}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1003 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\244\3\0\0\25\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1116}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1116 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1S\0\0\0\26\0\0\0\242\0\0\0\1\1o\0\31\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1\0\0\0\0\0\0\0"..., iov_len=243}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 243 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\4\0013\0\0\0\4\0\0\0\250\0\0\0\1\1o\0007\0\0\0/org/fre"..., iov_len=184}, {iov_base=" \0\0\0org.freedesktop.systemd1.Ser"..., iov_len=51}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 235 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0\27\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1003}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1003 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\204\3\0\0\30\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1084}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1084 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1L\0\0\0\31\0\0\0\246\0\0\0\1\1o\0\31\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1\0\0\0\0\0\0\0"..., iov_len=236}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 236 recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=24, tv_nsec=999623000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {tv_sec=24, tv_nsec=999621709}) recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\0013\3\0\0\32\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1003}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1003 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\224\3\0\0\33\0\0\0\276\0\0\0\1\1o\0007\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1/unit/s"..., iov_len=1100}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 1100 recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=24, tv_nsec=999296000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {tv_sec=24, tv_nsec=999275306}) recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\2\1\1\20\0\0\0\34\0\0\0007\0\0\0\5\1u\0\4\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\7\1s\0\30\0\0\0org.freedesktop.systemd1"..., iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 64 writev(2, [{iov_base="Job for systemd-oomd.service fai"..., iov_len=67}, {iov_base="\n", iov_len=1}], 2Job for systemd-oomd.service failed because a timeout was exceeded. ) = 68 writev(2, [{iov_base="See \"systemctl status systemd-oo"..., iov_len=99}, {iov_base="\n", iov_len=1}], 2See "systemctl status systemd-oomd.service" and "journalctl -xeu systemd-oomd.service" for details. ) = 100 close(3) = 0 kill(152887, SIGTERM) = 0 kill(152887, SIGCONT) = 0 waitid(P_PID, 152887, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=152887, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WEXITED, NULL) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=152887, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- exit_group(1) = ? +++ exited with 1 +++ [1]+ Exit 1 systemctl start systemd-oomd.service