Bug 1844939

Summary: Some systemd services randomly killed (KILL)
Product: OpenShift Container Platform Reporter: Brendan Shirren <bshirren>
Component: RHCOSAssignee: Jonathan Lebon <jlebon>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Michael Nguyen <mnguyen>
Severity: low Docs Contact:
Priority: low    
Version: 4.3.zCC: bbreard, bshirren, dornelas, imcleod, jlebon, jligon, mharri, miabbott, nstielau
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-21 20:59:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186913    
Attachments:
Description Flags
journalctl - 1min around issue none

Description Brendan Shirren 2020-06-08 05:12:02 UTC
Description of problem:

Some services are randomly killed (SIGKILL) for no identifiable reason.

This is mostly a placeholder bug to record further investigation.

Version-Release number of selected component (if applicable):

OCP 4.3.10
Linux infra-0 4.18.0-147.5.1.el8_1.x86_64 #1 SMP Tue Jan 14 15:50:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


How reproducible: unable to reproduce.


Steps to Reproduce:
1. N/A
2.
3.

Actual results:

[2018635.227288] systemd[1]: NetworkManager.service: Main process exited, code=killed, status=9/KILL

Expected results:

No services receive KILL signal.

Additional info:

May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.178:761): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:762): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=vgauthd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:763): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:764): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=vmtoolsd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.197:765): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=rpc-statd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.211:766): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=systemd-udevd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.239:767): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u
nit=irqbalance comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'


May 18 18:43:19 infra-0 systemd[1]: NetworkManager.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: sshd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: auditd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: vgauthd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: chronyd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: vmtoolsd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: zabbix-agent.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: rpc-statd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: irqbalance.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: sssd.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: rpcbind.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: dbus.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: polkit.service: Main process exited, code=killed, status=9/KILL
May 18 18:43:19 infra-0 systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL

Comment 1 Brendan Shirren 2020-06-08 05:19:58 UTC
Created attachment 1695987 [details]
journalctl - 1min around issue

Comment 2 Jonathan Lebon 2020-06-08 13:57:24 UTC
This is likely the OOM killer. Can you provide full logs from boot start? Also, how much memory is provided to the machines?

Comment 3 Brendan Shirren 2020-06-09 01:33:21 UTC
@Jonathan thanks for checking it has me baffled not seeing signs of OOM kill, segfault, page fault, hung tasks or any usual suspects.

Infra node = 16GB RAM / 4 CPU. Will attach journalctl logs covering the period of incident (May 18 18:43 to 18:47 AEST)

-- Logs begin at Fri 2020-05-15 17:38:10 AEST, end at Tue 2020-05-19 07:48:32 AEST. --

May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.033026    1370 eviction_manager.go:229] eviction manager: synchronize housekeeping
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.056977    1370 helpers.go:781] eviction manager: observations: signal=memory.available, available: 12662468Ki, capacity: 16421332Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057198    1370 helpers.go:781] eviction manager: observations: signal=allocatableMemory.available, available: 14987256Ki, capacity: 15909332Ki, time: 2020-05-18 18:41:10.056903168 +1000 AEST m=+2018428.838474157
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057291    1370 helpers.go:781] eviction manager: observations: signal=nodefs.available, available: 37761312Ki, capacity: 133156844Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057370    1370 helpers.go:781] eviction manager: observations: signal=nodefs.inodesFree, available: 63914439, capacity: 66583488, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057455    1370 helpers.go:781] eviction manager: observations: signal=imagefs.available, available: 37761312Ki, capacity: 133156844Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057534    1370 helpers.go:781] eviction manager: observations: signal=imagefs.inodesFree, available: 63914439, capacity: 66583488, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057608    1370 helpers.go:781] eviction manager: observations: signal=pid.available, available: 4193622, capacity: 4Mi, time: 2020-05-18 18:41:10.055950277 +1000 AEST m=+2018428.837521278
May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057792    1370 eviction_manager.go:320] eviction manager: no resources are starved

Comment 4 Micah Abbott 2020-06-16 19:25:02 UTC
@Brendan please provide the full journal from the affected node(s).

Comment 7 Jonathan Lebon 2020-07-31 19:29:22 UTC
@Brendan, have you been able to reproduce this again since that first time it happened?

Comment 8 Brendan Shirren 2020-08-10 06:16:43 UTC
I've never seen it reproduced sorry it only happened once on a customer VM. Starting to suspect issue at VM level (hypervisor) just no certain signs.

Comment 9 Jonathan Lebon 2020-08-21 20:59:23 UTC
Ack. Going to close this for now. Feel free to re-open if you hit it again and have more information! (Deep down I still think this is the OOM killer and the journal just doesn't have any traces of it.)