Bug 2003197 - CRI-O leaks some children PIDs
Summary: CRI-O leaks some children PIDs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.9.0
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On: 2002434
Blocks: 2003199 2009752
TreeView+ depends on / blocked
 
Reported: 2021-09-10 14:58 UTC by Peter Hunt
Modified: 2021-10-18 17:51 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Systemd experiencing load, causing CRI-O to fail to move conmon to the systemd cgroup Consequence: A bug in CRI-O that leaked the conmon process, causing a zombie Fix: Don't leak the conmon process Result: No zombies under CRI-O, even if systemd is overloaded
Clone Of: 2002434
Environment:
Last Closed: 2021-10-18 17:51:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5295 0 None None None 2021-09-10 14:59:48 UTC
Github cri-o cri-o pull 5309 0 None Merged [release-1.22] server: do not wait forever on conmon cgroup move fail 2021-09-15 14:32:06 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:51:49 UTC

Description Peter Hunt 2021-09-10 14:58:52 UTC
+++ This bug was initially created as a clone of Bug #2002434 +++

Description of problem:
Occasionally, CRI-O may leak a child pid of a process it creates. These situations are weird and tough to reproduce. The most common one is if systemd fails to move conmon to the conmon cgroup for some reason. 

I don't have a great reproducer, but this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1994444 (though branched away to allow for other bugs to be investigated there).

Version-Release number of selected component (if applicable):
All released CRI-O versions

--- Additional comment from Peter Hunt on 2021-09-10 13:29:31 UTC ---

PR merged

--- Additional comment from OpenShift Automated Release Tooling on 2021-09-10 14:39:39 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.10 release created.

Comment 1 Peter Hunt 2021-09-10 14:59:48 UTC
fixed in attached PR

Comment 3 Peter Hunt 2021-09-14 16:55:03 UTC
oops, we need a 4.9 variant of https://github.com/cri-o/cri-o/pull/5306 as well

Comment 5 Peter Hunt 2021-09-15 14:32:26 UTC
PR merged

Comment 7 Sunil Choudhary 2021-09-21 15:56:12 UTC
I see this is tough to reproduce. Verifying it based on some sanity checks on 4.9.0-0.nightly-2021-09-20-203004

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-20-203004   True        False         175m    Cluster version is 4.9.0-0.nightly-2021-09-20-203004

Comment 8 Michael Burke 2021-09-30 20:26:53 UTC
Peter --

Can you review my proposed release note for this issue? I see the doc text you added. Thank you. However I think it needs to be cleaned up a bit.  

* Previously, a bug in CRI-O caused CRI-O to leak a child pid of a process it created. As a result, under load Systemd could create a significant number of zombie processes. CRI-O was fixed to prevent the leakage, As a result, theze zombie process are no longer being created.

Is there a consequence of the zombie processes? Node failure or such?

Thank you for your help.

Comment 9 Peter Hunt 2021-10-01 13:51:50 UTC
the blurb content LGTM!

The consequence of zombies *could* be node failure if the node runs out of PIDs, which is quite unlikely. More likely than not, it'll just hold entries in the kernel process table, look bad and be wasteful.

Comment 12 errata-xmlrpc 2021-10-18 17:51:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.