Bug 1978528

Summary:	systemd-coredump started and failed intermittently for unknown reasons
Product:	OpenShift Container Platform	Reporter:	Xingbin Li <xingli>
Component:	Node	Assignee:	Swarup Ghosh <swghosh>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	aos-bugs, dornelas, harpatil, jligon, miabbott, mrussell, nstielau, pehunt, rphillips, smilner, svanka, travier
Version:	4.6
Target Milestone:	---
Target Release:	4.10.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: cadvisor reporting coredump messages from systemd namespace Consequence: systemd-coredump messages in kubelet logs Fix: cadvisor raw factory filters out systemd namespace Result:	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-12 04:35:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 2 Timothée Ravier 2021-07-07 11:19:32 UTC

This is strange as the kubelet should not be managing systemd-coredump. Maybe this is an OOM situation? Has there been another process crashing on the node?

Comment 4 Micah Abbott 2021-07-12 21:06:51 UTC

As a test of a successful coredump, I booted a 4.6 cluster in AWS and triggered a coredump of a `sleep` process.

When I inspected the journal for the related messages, I can see similar entries for `systemd-coredump` being reported by `hyperkube`

```
sh-4.4# journalctl -b | grep coredump
...
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.870574    1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871041    1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice" (aliases: [], namespace: "")
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871392    1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.873739    1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874117    1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874388    1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump"
Jul 12 20:59:16 ip-10-0-221-19 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Jul 12 20:59:17 ip-10-0-221-19 systemd-coredump[32960]: Process 32765 (sleep) of user 0 dumped core.
Jul 12 20:59:17 ip-10-0-221-19 systemd[1]: systemd-coredump: Consumed 407ms CPU time
Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096    1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
```

What's not clear from the customer log messages is why the coredump service is timing out or why we are seeing the `systemd-coredump[1822262]: Failed to send coredump datagram: Connection reset by peer` messages.

@Derrick could you get someone from the CEE org with `systemd` expertise to see if they can help here?

Comment 11 Peter Hunt 2021-08-02 14:47:01 UTC

this is weird
```
Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096    1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
```

it looks like Kubelet is misinterpreting the systemd-coredump as a container it started, so it begins managing the cgroup, and ends up killing it? very odd. Tossing to Ryan to triage for the Kubelet

Comment 19 Sunil Choudhary 2022-01-21 05:47:14 UTC

Checked on 4.10.0-0.nightly-2022-01-20-082726 on couple of clusters.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-20-082726   True        False         70m     Cluster version is 4.10.0-0.nightly-2022-01-20-082726

Comment 22 errata-xmlrpc 2022-03-12 04:35:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056