Bug 1978528

Summary: systemd-coredump started and failed intermittently for unknown reasons
Product: OpenShift Container Platform Reporter: Xingbin Li <xingli>
Component: NodeAssignee: Swarup Ghosh <swghosh>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: aos-bugs, dornelas, harpatil, jligon, miabbott, mrussell, nstielau, pehunt, rphillips, smilner, svanka, travier
Version: 4.6   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: cadvisor reporting coredump messages from systemd namespace Consequence: systemd-coredump messages in kubelet logs Fix: cadvisor raw factory filters out systemd namespace Result:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:35:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Timothée Ravier 2021-07-07 11:19:32 UTC
This is strange as the kubelet should not be managing systemd-coredump. Maybe this is an OOM situation? Has there been another process crashing on the node?

Comment 4 Micah Abbott 2021-07-12 21:06:51 UTC
As a test of a successful coredump, I booted a 4.6 cluster in AWS and triggered a coredump of a `sleep` process.

When I inspected the journal for the related messages, I can see similar entries for `systemd-coredump` being reported by `hyperkube`

```
sh-4.4# journalctl -b | grep coredump
...
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.870574    1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871041    1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice" (aliases: [], namespace: "")
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871392    1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.873739    1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump"
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874117    1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874388    1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump"
Jul 12 20:59:16 ip-10-0-221-19 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Jul 12 20:59:17 ip-10-0-221-19 systemd-coredump[32960]: Process 32765 (sleep) of user 0 dumped core.
Jul 12 20:59:17 ip-10-0-221-19 systemd[1]: systemd-coredump: Consumed 407ms CPU time
Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096    1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
```

What's not clear from the customer log messages is why the coredump service is timing out or why we are seeing the `systemd-coredump[1822262]: Failed to send coredump datagram: Connection reset by peer` messages.

@Derrick could you get someone from the CEE org with `systemd` expertise to see if they can help here?

Comment 11 Peter Hunt 2021-08-02 14:47:01 UTC
this is weird
```
Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096    1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "")
```

it looks like Kubelet is misinterpreting the systemd-coredump as a container it started, so it begins managing the cgroup, and ends up killing it? very odd. Tossing to Ryan to triage for the Kubelet

Comment 19 Sunil Choudhary 2022-01-21 05:47:14 UTC
Checked on 4.10.0-0.nightly-2022-01-20-082726 on couple of clusters.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-20-082726   True        False         70m     Cluster version is 4.10.0-0.nightly-2022-01-20-082726

Comment 22 errata-xmlrpc 2022-03-12 04:35:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056