2078634 – CRI-O not killing Calico CNI stalled (zombie) processes.

Bug 2078634 - CRI-O not killing Calico CNI stalled (zombie) processes.

Summary: CRI-O not killing Calico CNI stalled (zombie) processes.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6.z
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2095405
TreeView+	depends on / blocked

Reported:	2022-04-25 21:04 UTC by Akash Dubey
Modified:	2022-08-10 11:08 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:08:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ps -aux - before and after deleting pod - This file shows the duplicate "bird" and "calcio-node" processes before and after deleting the calico-node pod (46.06 KB, text/plain) 2022-04-25 21:04 UTC, Akash Dubey	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 5895	None	Merged	oci: kill children of container if it is in the host pid namespace	2022-06-14 13:11:46 UTC
Github	cri-o cri-o pull 5943	None	Merged	[1.24] oci: kill children of container if it is in the host pid namespace	2022-06-14 13:11:46 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 11:08:32 UTC

Description Akash Dubey 2022-04-25 21:04:38 UTC

Created attachment 1874951 [details]
ps -aux - before and after deleting pod - This file shows the duplicate "bird" and "calcio-node" processes before and after deleting the calico-node pod

Description of problem:
Duplicate defunct Calico processes aren't being removed by CRI-O.


Version-Release number of selected component (if applicable):
Calico CNI v3.20 on OCP 4.6.26

How reproducible:
This issue is being faced on their cluster on above specified version

Steps to Reproduce:
1.
2.
3.

Actual results:
They are able to see a lot such processes --> bird/bird6
sh-4.4# ps -aux | grep bird
root 52098 0.0 0.0 10028 1432 ? Sl Mar23 4:21 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
root 52099 0.1 0.0 11220 2412 ? Sl Mar23 14:41 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg

Expected results:
There shouldn't be defunct processes.

Additional info:
As a workaround they can fix the issue by manually deleting pods. But not sure for how long this fixed the issue.

We stated that all the processes within the container are to be managed by that container code & the writer of the container image is responsible for managing zombie processes. OCP will manage only processes as a part of the container & as per the restartPolicy for the pod & will delete the container & its associated processes when a container exits or is manually deleted.

Since Calico is our partner, they have opened a TSAConnect ticket with us to collaborate & have open-discussion on the issue.

Aadhil(Tigera Calico contact) on-call illustrated the defunct processes that should be within the container were actually being seen outside of the container. They probably might be due to the health check command but we (RH engineering) need to confirm this.

I am also attaching the ps outputs that Aadhil showed us on call. I request someone from engineering to join the call to have a better understanding of the issue.

Looking forward to hear from you.

Regards
Akash

Comment 4 Peter Hunt 2022-04-29 20:32:37 UTC

I've given this some thought, and this does read as a bug. Generally, I recommend folks design their containers such that PID 1 in the container will reap any children it creates. However, there are situations where PID 1 can't do that (OOM kill is one that comes immediately to mind) and we shouldn't leak processes in these cases either.

I need to put together a reproducer to properly fix (any help with this from partners or customers would be greatly appreciated), but I have a suspicion on how to do it

Comment 7 Akash Dubey 2022-05-04 13:24:59 UTC

Our Partner has stated that the reproducer isn't successful on their end. However, the issue persists at customer clusters. Maybe we can take a look at their cluster as a part of testing.

Comment 28 Sunil Choudhary 2022-06-07 09:58:42 UTC

Since we don't have a reproducer, marking it verified based on tests on customer environment in comment #24

Comment 36 Sunil Choudhary 2022-06-15 11:08:37 UTC

Updated 4.11 PR merged.

Comment 38 errata-xmlrpc 2022-08-10 11:08:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.