Bug 1967808 - Readiness "exec" probes causes zombie process on certain container images
Summary: Readiness "exec" probes causes zombie process on certain container images
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.9.0
Assignee: Peter Hunt
QA Contact: MinLi
URL:
Whiteboard:
: 1967807 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-04 04:53 UTC by Santhana Gopala Krishnan Iyer
Modified: 2024-12-20 20:10 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:32:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 4943 0 None closed oci: do not use conmon for exec sync 2021-07-14 15:44:44 UTC
Github cri-o cri-o pull 4999 0 None closed oci: fix issues with exec 2021-07-14 15:44:44 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:33:12 UTC

Comment 2 Peter Hunt 2021-06-09 15:52:14 UTC
*** Bug 1967807 has been marked as a duplicate of this bug. ***

Comment 3 Peter Hunt 2021-06-11 18:50:36 UTC
haven't gotten a moment to look at this yet

Comment 4 Peter Hunt 2021-07-02 20:38:04 UTC
sorry, still haven't

Comment 6 Peter Hunt 2021-07-08 19:24:25 UTC
*** Bug 1980522 has been marked as a duplicate of this bug. ***

Comment 7 Peter Hunt 2021-07-14 15:44:48 UTC
many of these issues should have been mitigated in the fixes for https://bugzilla.redhat.com/show_bug.cgi?id=1952137

can we have the pod spec associated with this amq image so we can test that we don't get zombies in 4.6.36

Comment 9 Peter Hunt 2021-07-15 20:11:44 UTC
Hey folks,

I've emerged from deep within the code to say that fixes in 4.6.36 do improve the situation, but don't totally fix it, but that's the applications fault. here's why:

Liveness and readiness exec probes eventually call `runc exec` to spawn the container process. `runc exec` will join the namespace of the parent container. If the container is in a private pid namespace (as is default), that means the exec process will be a child of the container process. If that container process is killed without being reaped, then its zombie will live in the process table until either it's parent reaps it (calls wait()), or that parent is killed.

This is less of an issue for liveness probes, as a timed out liveness probe eventually results in the container being killed, thus having the second condition satisfied. But for readiness probes, the container is kept alive, thus keeping the zombies "alive".

There was a bug that was finally fixed in 4.6.36 that cut conmon out of the middle. There was a period of 4.6 where conmon was at risk of being zombified (as shown in https://bugzilla.redhat.com/show_bug.cgi?id=1967808#c1). However, 4.6.36 now makes it so that *only* the container process is zombified. There is nothing further that cri-o can do about this. It is up to the application author (amq in this case) to have the pid 1 of the container (the initial container process) reap the exec processes, or to be in the pod pid namespace (where the pod infra container is pid 1 and does the reaping).

Comment 13 MinLi 2021-08-05 09:52:29 UTC
according to  Comment 12 , zombie process caused by amq image only can be resolved by installing an init container into the image. 
set verified.

Comment 19 errata-xmlrpc 2021-10-18 17:32:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 20 Red Hat Bugzilla 2023-09-15 01:08:59 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.