Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2003197

Summary: CRI-O leaks some children PIDs
Product: OpenShift Container Platform Reporter: Peter Hunt <pehunt>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: aos-bugs, mburke, schoudha
Version: 4.10   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Systemd experiencing load, causing CRI-O to fail to move conmon to the systemd cgroup Consequence: A bug in CRI-O that leaked the conmon process, causing a zombie Fix: Don't leak the conmon process Result: No zombies under CRI-O, even if systemd is overloaded
Story Points: ---
Clone Of: 2002434 Environment:
Last Closed: 2021-10-18 17:51:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2002434    
Bug Blocks: 2003199, 2009752    

Description Peter Hunt 2021-09-10 14:58:52 UTC
+++ This bug was initially created as a clone of Bug #2002434 +++

Description of problem:
Occasionally, CRI-O may leak a child pid of a process it creates. These situations are weird and tough to reproduce. The most common one is if systemd fails to move conmon to the conmon cgroup for some reason. 

I don't have a great reproducer, but this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1994444 (though branched away to allow for other bugs to be investigated there).

Version-Release number of selected component (if applicable):
All released CRI-O versions

--- Additional comment from Peter Hunt on 2021-09-10 13:29:31 UTC ---

PR merged

--- Additional comment from OpenShift Automated Release Tooling on 2021-09-10 14:39:39 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.10 release created.

Comment 1 Peter Hunt 2021-09-10 14:59:48 UTC
fixed in attached PR

Comment 3 Peter Hunt 2021-09-14 16:55:03 UTC
oops, we need a 4.9 variant of https://github.com/cri-o/cri-o/pull/5306 as well

Comment 5 Peter Hunt 2021-09-15 14:32:26 UTC
PR merged

Comment 7 Sunil Choudhary 2021-09-21 15:56:12 UTC
I see this is tough to reproduce. Verifying it based on some sanity checks on 4.9.0-0.nightly-2021-09-20-203004

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-20-203004   True        False         175m    Cluster version is 4.9.0-0.nightly-2021-09-20-203004

Comment 8 Michael Burke 2021-09-30 20:26:53 UTC
Peter --

Can you review my proposed release note for this issue? I see the doc text you added. Thank you. However I think it needs to be cleaned up a bit.  

* Previously, a bug in CRI-O caused CRI-O to leak a child pid of a process it created. As a result, under load Systemd could create a significant number of zombie processes. CRI-O was fixed to prevent the leakage, As a result, theze zombie process are no longer being created.

Is there a consequence of the zombie processes? Node failure or such?

Thank you for your help.

Comment 9 Peter Hunt 2021-10-01 13:51:50 UTC
the blurb content LGTM!

The consequence of zombies *could* be node failure if the node runs out of PIDs, which is quite unlikely. More likely than not, it'll just hold entries in the kernel process table, look bad and be wasteful.

Comment 12 errata-xmlrpc 2021-10-18 17:51:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759