Created attachment 1874951 [details] ps -aux - before and after deleting pod - This file shows the duplicate "bird" and "calcio-node" processes before and after deleting the calico-node pod Description of problem: Duplicate defunct Calico processes aren't being removed by CRI-O. Version-Release number of selected component (if applicable): Calico CNI v3.20 on OCP 4.6.26 How reproducible: This issue is being faced on their cluster on above specified version Steps to Reproduce: 1. 2. 3. Actual results: They are able to see a lot such processes --> bird/bird6 sh-4.4# ps -aux | grep bird root 52098 0.0 0.0 10028 1432 ? Sl Mar23 4:21 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg root 52099 0.1 0.0 11220 2412 ? Sl Mar23 14:41 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg Expected results: There shouldn't be defunct processes. Additional info: As a workaround they can fix the issue by manually deleting pods. But not sure for how long this fixed the issue. We stated that all the processes within the container are to be managed by that container code & the writer of the container image is responsible for managing zombie processes. OCP will manage only processes as a part of the container & as per the restartPolicy for the pod & will delete the container & its associated processes when a container exits or is manually deleted. Since Calico is our partner, they have opened a TSAConnect ticket with us to collaborate & have open-discussion on the issue. Aadhil(Tigera Calico contact) on-call illustrated the defunct processes that should be within the container were actually being seen outside of the container. They probably might be due to the health check command but we (RH engineering) need to confirm this. I am also attaching the ps outputs that Aadhil showed us on call. I request someone from engineering to join the call to have a better understanding of the issue. Looking forward to hear from you. Regards Akash
I've given this some thought, and this does read as a bug. Generally, I recommend folks design their containers such that PID 1 in the container will reap any children it creates. However, there are situations where PID 1 can't do that (OOM kill is one that comes immediately to mind) and we shouldn't leak processes in these cases either. I need to put together a reproducer to properly fix (any help with this from partners or customers would be greatly appreciated), but I have a suspicion on how to do it
Our Partner has stated that the reproducer isn't successful on their end. However, the issue persists at customer clusters. Maybe we can take a look at their cluster as a part of testing.
Since we don't have a reproducer, marking it verified based on tests on customer environment in comment #24
Updated 4.11 PR merged.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069