Bug 1978528 - systemd-coredump started and failed intermittently for unknown reasons
Summary: systemd-coredump started and failed intermittently for unknown reasons
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
: 4.10.0
Assignee: Swarup Ghosh
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-02 05:15 UTC by Xingbin Li
Modified: 2024-12-20 20:23 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: cadvisor reporting coredump messages from systemd namespace Consequence: systemd-coredump messages in kubelet logs Fix: cadvisor raw factory filters out systemd namespace Result:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:35:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 1049 0 None open Bug 1978528: UPSTREAM: <drop>: bump cadvisor for 2957, 2999 and 2979 upstream patches 2021-11-12 09:11:08 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:36:15 UTC

Comment 2 Timothée Ravier 2021-07-07 11:19:32 UTC
This is strange as the kubelet should not be managing systemd-coredump. Maybe this is an OOM situation? Has there been another process crashing on the node?

Comment 4 Micah Abbott 2021-07-12 21:06:51 UTC
As a test of a successful coredump, I booted a 4.6 cluster in AWS and triggered a coredump of a `sleep` process.

When I inspected the journal for the related messages, I can see similar entries for `systemd-coredump` being reported by `hyperkube`

```
sh-4.4# journalctl -b | grep coredump
...
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.870574    1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871041    1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice" (aliases: [], namespace: "")
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871392    1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.873739    1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874117    1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874388    1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump"
Jul 12 20:59:16 ip-10-0-221-19 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Jul 12 20:59:17 ip-10-0-221-19 systemd-coredump[32960]: Process 32765 (sleep) of user 0 dumped core.
Jul 12 20:59:17 ip-10-0-221-19 systemd[1]: systemd-coredump: Consumed 407ms CPU time
Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096    1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
```

What's not clear from the customer log messages is why the coredump service is timing out or why we are seeing the `systemd-coredump[1822262]: Failed to send coredump datagram: Connection reset by peer` messages.

@Derrick could you get someone from the CEE org with `systemd` expertise to see if they can help here?

Comment 11 Peter Hunt 2021-08-02 14:47:01 UTC
this is weird
```
Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096    1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
```

it looks like Kubelet is misinterpreting the systemd-coredump as a container it started, so it begins managing the cgroup, and ends up killing it? very odd. Tossing to Ryan to triage for the Kubelet

Comment 19 Sunil Choudhary 2022-01-21 05:47:14 UTC
Checked on 4.10.0-0.nightly-2022-01-20-082726 on couple of clusters.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-20-082726   True        False         70m     Cluster version is 4.10.0-0.nightly-2022-01-20-082726

Comment 22 errata-xmlrpc 2022-03-12 04:35:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.