Description of problem: Some services are randomly killed (SIGKILL) for no identifiable reason. This is mostly a placeholder bug to record further investigation. Version-Release number of selected component (if applicable): OCP 4.3.10 Linux infra-0 4.18.0-147.5.1.el8_1.x86_64 #1 SMP Tue Jan 14 15:50:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux How reproducible: unable to reproduce. Steps to Reproduce: 1. N/A 2. 3. Actual results: [2018635.227288] systemd[1]: NetworkManager.service: Main process exited, code=killed, status=9/KILL Expected results: No services receive KILL signal. Additional info: May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.178:761): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u nit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:762): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u nit=vgauthd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:763): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u nit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.184:764): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u nit=vmtoolsd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.197:765): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u nit=rpc-statd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.211:766): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u nit=systemd-udevd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 18 18:43:19 infra-0 kernel: audit: type=1131 audit(1589791399.239:767): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='u nit=irqbalance comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 18 18:43:19 infra-0 systemd[1]: NetworkManager.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: sshd.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: auditd.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: vgauthd.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: chronyd.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: vmtoolsd.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: zabbix-agent.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: rpc-statd.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: irqbalance.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: sssd.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: rpcbind.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: dbus.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: polkit.service: Main process exited, code=killed, status=9/KILL May 18 18:43:19 infra-0 systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Created attachment 1695987 [details] journalctl - 1min around issue
This is likely the OOM killer. Can you provide full logs from boot start? Also, how much memory is provided to the machines?
@Jonathan thanks for checking it has me baffled not seeing signs of OOM kill, segfault, page fault, hung tasks or any usual suspects. Infra node = 16GB RAM / 4 CPU. Will attach journalctl logs covering the period of incident (May 18 18:43 to 18:47 AEST) -- Logs begin at Fri 2020-05-15 17:38:10 AEST, end at Tue 2020-05-19 07:48:32 AEST. -- May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.033026 1370 eviction_manager.go:229] eviction manager: synchronize housekeeping May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.056977 1370 helpers.go:781] eviction manager: observations: signal=memory.available, available: 12662468Ki, capacity: 16421332Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594 May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057198 1370 helpers.go:781] eviction manager: observations: signal=allocatableMemory.available, available: 14987256Ki, capacity: 15909332Ki, time: 2020-05-18 18:41:10.056903168 +1000 AEST m=+2018428.838474157 May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057291 1370 helpers.go:781] eviction manager: observations: signal=nodefs.available, available: 37761312Ki, capacity: 133156844Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594 May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057370 1370 helpers.go:781] eviction manager: observations: signal=nodefs.inodesFree, available: 63914439, capacity: 66583488, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594 May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057455 1370 helpers.go:781] eviction manager: observations: signal=imagefs.available, available: 37761312Ki, capacity: 133156844Ki, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594 May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057534 1370 helpers.go:781] eviction manager: observations: signal=imagefs.inodesFree, available: 63914439, capacity: 66583488, time: 2020-05-18 18:41:10.034400571 +1000 AEST m=+2018428.815971594 May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057608 1370 helpers.go:781] eviction manager: observations: signal=pid.available, available: 4193622, capacity: 4Mi, time: 2020-05-18 18:41:10.055950277 +1000 AEST m=+2018428.837521278 May 18 18:41:10 infra-0 hyperkube[1370]: I0518 18:41:10.057792 1370 eviction_manager.go:320] eviction manager: no resources are starved
@Brendan please provide the full journal from the affected node(s).
@Brendan, have you been able to reproduce this again since that first time it happened?
I've never seen it reproduced sorry it only happened once on a customer VM. Starting to suspect issue at VM level (hypervisor) just no certain signs.
Ack. Going to close this for now. Feel free to re-open if you hit it again and have more information! (Deep down I still think this is the OOM killer and the journal just doesn't have any traces of it.)