Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1980522

Summary: zombied processes due to failed readiness/liveness probes
Product: OpenShift Container Platform Reporter: Albert Cardenas <acardena>
Component: ContainersAssignee: Tom Sweeney <tsweeney>
Status: CLOSED DUPLICATE QA Contact: pmali
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.6CC: aos-bugs, dwalsh, gwest, jokerman, pehunt
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-08 19:24:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Albert Cardenas 2021-07-08 18:04:32 UTC
Description of problem:

When deploying pods with sidecars/containers with liveness/readiness probes configured, if they don't respond in the specified time, the containers/sidecars get reaped and restarted but the command that was executed goes into a defunct state.

When this occurs, it may take hundreds of restarts per container before it becomes healthy. This results in about 5000 defunct/zombied processes on the nodes and contributes to resource contention and inability to schedule more pods as the system's load and i/o is taxed.

The only solve is to cordon, drain node and reboot. Once uncordoned and a deployment is performed by Verizon, we start seeing this issue rise again.

Version-Release number of selected component (if applicable):
OCP 4.6.17

How reproducible:
Verizon Specific

Steps to Reproduce:
Verizon specific; N/A

Actual results:


Expected results:
No zombied processes / No resource contention on worker

Comment 2 Peter Hunt 2021-07-08 19:24:27 UTC
we're tracking this in 1967808

*** This bug has been marked as a duplicate of bug 1967808 ***

Comment 3 Peter Hunt 2021-07-14 13:44:44 UTC
So I think this bug is actually distinct from https://bugzilla.redhat.com/show_bug.cgi?id=1967808, and actually is fixed by the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1952137: 4.6.36. 

1967808 seems more about a bunk container. the pid 1 in the container is not reaping processes, and we're not yet being smart about handling the reparenting that's happening
1952137 was about excessive processes being created when a node is under load. Before, CRI-O would call conmon which would call runc, and those conmon processes would not be correctly reaped by CRI-O. The fixes for that https://github.com/cri-o/cri-o/pull/4943 and https://github.com/cri-o/cri-o/pull/4999 cut conmon out from the middle. This results in fewer processes being created, and also better handling of CRI-O's children to prevent zombies.

In investigation of this bug in general, we attempted to create many containers with many exec probes (1000 deployments) to overwhelm the system. Without the fixes for 1952137, many zombies were created. after updating to 4.6.36, there were no detectable zombie processes. Thus, I am updating the bug to which this is duplicated

*** This bug has been marked as a duplicate of bug 1952137 ***