+++ This bug was initially created as a clone of Bug #2002434 +++ Description of problem: Occasionally, CRI-O may leak a child pid of a process it creates. These situations are weird and tough to reproduce. The most common one is if systemd fails to move conmon to the conmon cgroup for some reason. I don't have a great reproducer, but this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1994444 (though branched away to allow for other bugs to be investigated there). Version-Release number of selected component (if applicable): All released CRI-O versions --- Additional comment from Peter Hunt on 2021-09-10 13:29:31 UTC --- PR merged --- Additional comment from OpenShift Automated Release Tooling on 2021-09-10 14:39:39 UTC --- Elliott changed bug status from MODIFIED to ON_QA. This bug is expected to ship in the next 4.10 release created.
fixed in attached PR
oops, we need a 4.9 variant of https://github.com/cri-o/cri-o/pull/5306 as well
PR merged
I see this is tough to reproduce. Verifying it based on some sanity checks on 4.9.0-0.nightly-2021-09-20-203004 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-20-203004 True False 175m Cluster version is 4.9.0-0.nightly-2021-09-20-203004
Peter -- Can you review my proposed release note for this issue? I see the doc text you added. Thank you. However I think it needs to be cleaned up a bit. * Previously, a bug in CRI-O caused CRI-O to leak a child pid of a process it created. As a result, under load Systemd could create a significant number of zombie processes. CRI-O was fixed to prevent the leakage, As a result, theze zombie process are no longer being created. Is there a consequence of the zombie processes? Node failure or such? Thank you for your help.
the blurb content LGTM! The consequence of zombies *could* be node failure if the node runs out of PIDs, which is quite unlikely. More likely than not, it'll just hold entries in the kernel process table, look bad and be wasteful.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759