Bug 1978528
Summary: | systemd-coredump started and failed intermittently for unknown reasons | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xingbin Li <xingli> |
Component: | Node | Assignee: | Swarup Ghosh <swghosh> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | low | ||
Priority: | low | CC: | aos-bugs, dornelas, harpatil, jligon, miabbott, mrussell, nstielau, pehunt, rphillips, smilner, svanka, travier |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: cadvisor reporting coredump messages from systemd namespace
Consequence: systemd-coredump messages in kubelet logs
Fix: cadvisor raw factory filters out systemd namespace
Result:
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-12 04:35:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 2
Timothée Ravier
2021-07-07 11:19:32 UTC
As a test of a successful coredump, I booted a 4.6 cluster in AWS and triggered a coredump of a `sleep` process. When I inspected the journal for the related messages, I can see similar entries for `systemd-coredump` being reported by `hyperkube` ``` sh-4.4# journalctl -b | grep coredump ... Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.870574 1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice" Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871041 1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice" (aliases: [], namespace: "") Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.871392 1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice" Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.873739 1540 factory.go:212] Using factory "raw" for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874117 1540 manager.go:987] Added container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "") Jul 12 20:59:16 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:16.874388 1540 container.go:490] Start housekeeping for container "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" Jul 12 20:59:16 ip-10-0-221-19 systemd[1]: Created slice system-systemd\x2dcoredump.slice. Jul 12 20:59:17 ip-10-0-221-19 systemd-coredump[32960]: Process 32765 (sleep) of user 0 dumped core. Jul 12 20:59:17 ip-10-0-221-19 systemd[1]: systemd-coredump: Consumed 407ms CPU time Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096 1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "") ``` What's not clear from the customer log messages is why the coredump service is timing out or why we are seeing the `systemd-coredump[1822262]: Failed to send coredump datagram: Connection reset by peer` messages. @Derrick could you get someone from the CEE org with `systemd` expertise to see if they can help here? this is weird ``` Jul 12 20:59:17 ip-10-0-221-19 hyperkube[1540]: I0712 20:59:17.373096 1540 manager.go:1044] Destroyed container: "/system.slice/system-systemd\\x2dcoredump.slice/systemd-coredump" (aliases: [], namespace: "") ``` it looks like Kubelet is misinterpreting the systemd-coredump as a container it started, so it begins managing the cgroup, and ends up killing it? very odd. Tossing to Ryan to triage for the Kubelet Checked on 4.10.0-0.nightly-2022-01-20-082726 on couple of clusters. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-20-082726 True False 70m Cluster version is 4.10.0-0.nightly-2022-01-20-082726 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |