Bug 2078634

Summary: CRI-O not killing Calico CNI stalled (zombie) processes.
Product: OpenShift Container Platform Reporter: Akash Dubey <adubey>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, mmarkand, nagrawal, nclear, openshift-bugs-escalate, pehunt
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:08:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2095405    
Attachments:
Description Flags
ps -aux - before and after deleting pod - This file shows the duplicate "bird" and "calcio-node" processes before and after deleting the calico-node pod none

Description Akash Dubey 2022-04-25 21:04:38 UTC
Created attachment 1874951 [details]
ps -aux - before and after deleting pod - This file shows the duplicate "bird" and "calcio-node" processes before and after deleting the calico-node pod

Description of problem:
Duplicate defunct Calico processes aren't being removed by CRI-O.


Version-Release number of selected component (if applicable):
Calico CNI v3.20 on OCP 4.6.26

How reproducible:
This issue is being faced on their cluster on above specified version

Steps to Reproduce:
1.
2.
3.

Actual results:
They are able to see a lot such processes --> bird/bird6
sh-4.4# ps -aux | grep bird
root 52098 0.0 0.0 10028 1432 ? Sl Mar23 4:21 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
root 52099 0.1 0.0 11220 2412 ? Sl Mar23 14:41 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg

Expected results:
There shouldn't be defunct processes.

Additional info:
As a workaround they can fix the issue by manually deleting pods. But not sure for how long this fixed the issue.

We stated that all the processes within the container are to be managed by that container code & the writer of the container image is responsible for managing zombie processes. OCP will manage only processes as a part of the container & as per the restartPolicy for the pod & will delete the container & its associated processes when a container exits or is manually deleted.

Since Calico is our partner, they have opened a TSAConnect ticket with us to collaborate & have open-discussion on the issue.

Aadhil(Tigera Calico contact) on-call illustrated the defunct processes that should be within the container were actually being seen outside of the container. They probably might be due to the health check command but we (RH engineering) need to confirm this.

I am also attaching the ps outputs that Aadhil showed us on call. I request someone from engineering to join the call to have a better understanding of the issue.

Looking forward to hear from you.

Regards
Akash

Comment 4 Peter Hunt 2022-04-29 20:32:37 UTC
I've given this some thought, and this does read as a bug. Generally, I recommend folks design their containers such that PID 1 in the container will reap any children it creates. However, there are situations where PID 1 can't do that (OOM kill is one that comes immediately to mind) and we shouldn't leak processes in these cases either.

I need to put together a reproducer to properly fix (any help with this from partners or customers would be greatly appreciated), but I have a suspicion on how to do it

Comment 7 Akash Dubey 2022-05-04 13:24:59 UTC
Our Partner has stated that the reproducer isn't successful on their end. However, the issue persists at customer clusters. Maybe we can take a look at their cluster as a part of testing.

Comment 28 Sunil Choudhary 2022-06-07 09:58:42 UTC
Since we don't have a reproducer, marking it verified based on tests on customer environment in comment #24

Comment 36 Sunil Choudhary 2022-06-15 11:08:37 UTC
Updated 4.11 PR merged.

Comment 38 errata-xmlrpc 2022-08-10 11:08:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069