This is strange as the kubelet should not be managing systemd-coredump. Maybe this is an OOM situation? Has there been another process crashing on the node?
As a test of a successful coredump, I booted a 4.6 cluster in AWS and triggered a coredump of a `sleep` process. When I inspected the journal for the related messages, I can see similar entries for `systemd-coredump` being reported by `hyperkube` ``` sh-4.4# journalctl -b | grep coredump ... Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.870574 1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice" Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871041 1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice" (aliases: [], namespace: "") Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871392 1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice" Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.873739 1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874117 1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "") Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874388 1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" Jul 12 20:59:16 ip-10-0-221-19 systemd[1]: Created slice system-systemd\x2dcoredump.slice. Jul 12 20:59:17 ip-10-0-221-19 systemd-coredump[32960]: Process 32765 (sleep) of user 0 dumped core. Jul 12 20:59:17 ip-10-0-221-19 systemd[1]: systemd-coredump: Consumed 407ms CPU time Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096 1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "") ``` What's not clear from the customer log messages is why the coredump service is timing out or why we are seeing the `systemd-coredump[1822262]: Failed to send coredump datagram: Connection reset by peer` messages. @Derrick could you get someone from the CEE org with `systemd` expertise to see if they can help here?
this is weird ``` Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096 1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "") ``` it looks like Kubelet is misinterpreting the systemd-coredump as a container it started, so it begins managing the cgroup, and ends up killing it? very odd. Tossing to Ryan to triage for the Kubelet
Checked on 4.10.0-0.nightly-2022-01-20-082726 on couple of clusters. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-20-082726 True False 70m Cluster version is 4.10.0-0.nightly-2022-01-20-082726
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056