Bug 1844939 - Some systemd services randomly killed (KILL)
Summary: Some systemd services randomly killed (KILL)
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.z
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
: 4.6.0
Assignee: Jonathan Lebon
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1186913
TreeView+ depends on / blocked
 
Reported: 2020-06-08 05:12 UTC by Brendan Shirren
Modified: 2023-10-06 20:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-21 20:59:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journalctl - 1min around issue (51.41 KB, text/plain)
2020-06-08 05:19 UTC, Brendan Shirren
no flags Details

Description Brendan Shirren 2020-06-08 05:12:02 UTC
Description of problem:

Some services are randomly killed (SIGKILL) for no identifiable reason.

This is mostly a placeholder bug to record further investigation.

Version-Release number of selected component (if applicable):

OCP 4.3.10
Linux infra-0 4.18.0-147.5.1.el8_1.x86_64 #1 SMP Tue Jan 14 15:50:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


How reproducible: unable to reproduce.


Steps to Reproduce:
1. N/A
2.
3.

Actual results:

[2018635.227288] systemd[1]: NetworkManager.service: Main process exited, code=killed, status=9/KILL

Expected results:

No services receive KILL signal.

Additional info:

May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.178:761): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:762): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=vgauthd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:763): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:764): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=vmtoolsd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.197:765): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=rpc-statd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.211:766): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=systemd-udevd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.239:767): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=irqbalance comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'


May 18 18:43:19 infra-0 systemd[1]: NetworkManager.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: sshd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: auditd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: vgauthd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: chronyd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: vmtoolsd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: zabbix-agent.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: rpc-statd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: irqbalance.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: sssd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: rpcbind.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: dbus.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: polkit.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL

Comment 1 Brendan Shirren 2020-06-08 05:19:58 UTC
Created attachment 1695987 [details]
journalctl - 1min around issue

Comment 2 Jonathan Lebon 2020-06-08 13:57:24 UTC
This is likely the OOM killer. Can you provide full logs from boot start? Also, how much memory is provided to the machines?

Comment 3 Brendan Shirren 2020-06-09 01:33:21 UTC
@Jonathan thanks for checking it has me baffled not seeing signs of OOM kill, segfault, page fault, hung tasks or any usual suspects.

Infra node = 16GB RAM / 4 CPU. Will attach journalctl logs covering the period of incident (May 18 18:43 to 18:47 AEST)

-- Logs begin at Fri 2020-05-15 17:38:10 AEST, end at Tue 2020-05-19 07:48:32 AEST. --

May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.033026    1370 eviction_manager.go:229] eviction manager: synchronize housekeeping
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.056977    1370 helpers.go:781] eviction manager: observations: signal=memory.available, available: 12662468Ki, capacity: 16421332Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057198    1370 helpers.go:781] eviction manager: observations: signal=allocatableMemory.available, available: 14987256Ki, capacity: 15909332Ki, time: 2020-05-18 18:41:10.056903168 +1000 AEST m=+2018428.838474157
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057291    1370 helpers.go:781] eviction manager: observations: signal=nodefs.available, available: 37761312Ki, capacity: 133156844Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057370    1370 helpers.go:781] eviction manager: observations: signal=nodefs.inodesFree, available: 63914439, capacity: 66583488, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057455    1370 helpers.go:781] eviction manager: observations: signal=imagefs.available, available: 37761312Ki, capacity: 133156844Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057534    1370 helpers.go:781] eviction manager: observations: signal=imagefs.inodesFree, available: 63914439, capacity: 66583488, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057608    1370 helpers.go:781] eviction manager: observations: signal=pid.available, available: 4193622, capacity: 4Mi, time: 2020-05-18 18:41:10.055950277 +1000 AEST m=+2018428.837521278
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057792    1370 eviction_manager.go:320] eviction manager: no resources are starved

Comment 4 Micah Abbott 2020-06-16 19:25:02 UTC
@Brendan please provide the full journal from the affected node(s).

Comment 7 Jonathan Lebon 2020-07-31 19:29:22 UTC
@Brendan, have you been able to reproduce this again since that first time it happened?

Comment 8 Brendan Shirren 2020-08-10 06:16:43 UTC
I've never seen it reproduced sorry it only happened once on a customer VM. Starting to suspect issue at VM level (hypervisor) just no certain signs.

Comment 9 Jonathan Lebon 2020-08-21 20:59:23 UTC
Ack. Going to close this for now. Feel free to re-open if you hit it again and have more information! (Deep down I still think this is the OOM killer and the journal just doesn't have any traces of it.)


Note You need to log in before you can comment on or make changes to this bug.