1978528 – systemd-coredump started and failed intermittently for unknown reasons

Bug 1978528 - systemd-coredump started and failed intermittently for unknown reasons

Summary: systemd-coredump started and failed intermittently for unknown reasons

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Swarup Ghosh
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-02 05:15 UTC by Xingbin Li
Modified:	2024-12-20 20:23 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: cadvisor reporting coredump messages from systemd namespace Consequence: systemd-coredump messages in kubelet logs Fix: cadvisor raw factory filters out systemd namespace Result:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:35:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 1049	0	None	open	Bug 1978528: UPSTREAM: <drop>: bump cadvisor for 2957, 2999 and 2979 upstream patches	2021-11-12 09:11:08 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:36:15 UTC

Comment 2 Timothée Ravier 2021-07-07 11:19:32 UTC

This is strange as the kubelet should not be managing systemd-coredump. Maybe this is an OOM situation? Has there been another process crashing on the node?

Comment 4 Micah Abbott 2021-07-12 21:06:51 UTC

As a test of a successful coredump, I booted a 4.6 cluster in AWS and triggered a coredump of a `sleep` process.

When I inspected the journal for the related messages, I can see similar entries for `systemd-coredump` being reported by `hyperkube`

```
sh-4.4# journalctl -b | grep coredump
...
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.870574    1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871041    1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice" (aliases: [], namespace: "")
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871392    1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.873739    1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874117    1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874388    1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump"
Jul 12 20:59:16 ip-10-0-221-19 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Jul 12 20:59:17 ip-10-0-221-19 systemd-coredump[32960]: Process 32765 (sleep) of user 0 dumped core.
Jul 12 20:59:17 ip-10-0-221-19 systemd[1]: systemd-coredump: Consumed 407ms CPU time
Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096    1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
```

What's not clear from the customer log messages is why the coredump service is timing out or why we are seeing the `systemd-coredump[1822262]: Failed to send coredump datagram: Connection reset by peer` messages.

@Derrick could you get someone from the CEE org with `systemd` expertise to see if they can help here?

Comment 11 Peter Hunt 2021-08-02 14:47:01 UTC

this is weird
```
Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096    1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
```

it looks like Kubelet is misinterpreting the systemd-coredump as a container it started, so it begins managing the cgroup, and ends up killing it? very odd. Tossing to Ryan to triage for the Kubelet

Comment 19 Sunil Choudhary 2022-01-21 05:47:14 UTC

Checked on 4.10.0-0.nightly-2022-01-20-082726 on couple of clusters.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-20-082726   True        False         70m     Cluster version is 4.10.0-0.nightly-2022-01-20-082726

Comment 22 errata-xmlrpc 2022-03-12 04:35:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.