Bug 1967808

Summary: Readiness "exec" probes causes zombie process on certain container images
Product: OpenShift Container Platform Reporter: Santhana Gopala Krishnan Iyer <saniyer>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: MinLi <minmli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: acardena, aos-bugs, dhellmann, eminguez, erismith, hgao, jteagno+bugzilla, pehunt, spasquie
Version: 4.6   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:32:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Peter Hunt 2021-06-09 15:52:14 UTC
*** Bug 1967807 has been marked as a duplicate of this bug. ***

Comment 3 Peter Hunt 2021-06-11 18:50:36 UTC
haven't gotten a moment to look at this yet

Comment 4 Peter Hunt 2021-07-02 20:38:04 UTC
sorry, still haven't

Comment 6 Peter Hunt 2021-07-08 19:24:25 UTC
*** Bug 1980522 has been marked as a duplicate of this bug. ***

Comment 7 Peter Hunt 2021-07-14 15:44:48 UTC
many of these issues should have been mitigated in the fixes for https://bugzilla.redhat.com/show_bug.cgi?id=1952137

can we have the pod spec associated with this amq image so we can test that we don't get zombies in 4.6.36

Comment 9 Peter Hunt 2021-07-15 20:11:44 UTC
Hey folks,

I've emerged from deep within the code to say that fixes in 4.6.36 do improve the situation, but don't totally fix it, but that's the applications fault. here's why:

Liveness and readiness exec probes eventually call `runc exec` to spawn the container process. `runc exec` will join the namespace of the parent container. If the container is in a private pid namespace (as is default), that means the exec process will be a child of the container process. If that container process is killed without being reaped, then its zombie will live in the process table until either it's parent reaps it (calls wait()), or that parent is killed.

This is less of an issue for liveness probes, as a timed out liveness probe eventually results in the container being killed, thus having the second condition satisfied. But for readiness probes, the container is kept alive, thus keeping the zombies "alive".

There was a bug that was finally fixed in 4.6.36 that cut conmon out of the middle. There was a period of 4.6 where conmon was at risk of being zombified (as shown in https://bugzilla.redhat.com/show_bug.cgi?id=1967808#c1). However, 4.6.36 now makes it so that *only* the container process is zombified. There is nothing further that cri-o can do about this. It is up to the application author (amq in this case) to have the pid 1 of the container (the initial container process) reap the exec processes, or to be in the pod pid namespace (where the pod infra container is pid 1 and does the reaping).

Comment 13 MinLi 2021-08-05 09:52:29 UTC
according to  Comment 12 , zombie process caused by amq image only can be resolved by installing an init container into the image. 
set verified.

Comment 19 errata-xmlrpc 2021-10-18 17:32:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 20 Red Hat Bugzilla 2023-09-15 01:08:59 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days