Bug 2078634

Summary:

CRI-O not killing Calico CNI stalled (zombie) processes.

Product:

OpenShift Container Platform

Reporter:

Akash Dubey <adubey>

Component:

Node

Assignee:

Peter Hunt <pehunt>

Node sub component:

CRI-O

QA Contact:

Sunil Choudhary <schoudha>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

aos-bugs, mmarkand, nagrawal, nclear, openshift-bugs-escalate, pehunt

Version:

4.6.z

Target Milestone:

---

Target Release:

4.11.0

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-08-10 11:08:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2095405

Attachments:

Description	Flags
ps -aux - before and after deleting pod - This file shows the duplicate "bird" and "calcio-node" processes before and after deleting the calico-node pod	none

Description Akash Dubey 2022-04-25 21:04:38 UTC

Created attachment 1874951 [details]
ps -aux - before and after deleting pod - This file shows the duplicate "bird" and "calcio-node" processes before and after deleting the calico-node pod

Description of problem:
Duplicate defunct Calico processes aren't being removed by CRI-O.


Version-Release number of selected component (if applicable):
Calico CNI v3.20 on OCP 4.6.26

How reproducible:
This issue is being faced on their cluster on above specified version

Steps to Reproduce:
1.
2.
3.

Actual results:
They are able to see a lot such processes --> bird/bird6
sh-4.4# ps -aux | grep bird
root 52098 0.0 0.0 10028 1432 ? Sl Mar23 4:21 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
root 52099 0.1 0.0 11220 2412 ? Sl Mar23 14:41 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg

Expected results:
There shouldn't be defunct processes.

Additional info:
As a workaround they can fix the issue by manually deleting pods. But not sure for how long this fixed the issue.

We stated that all the processes within the container are to be managed by that container code & the writer of the container image is responsible for managing zombie processes. OCP will manage only processes as a part of the container & as per the restartPolicy for the pod & will delete the container & its associated processes when a container exits or is manually deleted.

Since Calico is our partner, they have opened a TSAConnect ticket with us to collaborate & have open-discussion on the issue.

Aadhil(Tigera Calico contact) on-call illustrated the defunct processes that should be within the container were actually being seen outside of the container. They probably might be due to the health check command but we (RH engineering) need to confirm this.

I am also attaching the ps outputs that Aadhil showed us on call. I request someone from engineering to join the call to have a better understanding of the issue.

Looking forward to hear from you.

Regards
Akash

Comment 4 Peter Hunt 2022-04-29 20:32:37 UTC

I've given this some thought, and this does read as a bug. Generally, I recommend folks design their containers such that PID 1 in the container will reap any children it creates. However, there are situations where PID 1 can't do that (OOM kill is one that comes immediately to mind) and we shouldn't leak processes in these cases either.

I need to put together a reproducer to properly fix (any help with this from partners or customers would be greatly appreciated), but I have a suspicion on how to do it

Comment 7 Akash Dubey 2022-05-04 13:24:59 UTC

Our Partner has stated that the reproducer isn't successful on their end. However, the issue persists at customer clusters. Maybe we can take a look at their cluster as a part of testing.

Comment 28 Sunil Choudhary 2022-06-07 09:58:42 UTC

Since we don't have a reproducer, marking it verified based on tests on customer environment in comment #24

Comment 36 Sunil Choudhary 2022-06-15 11:08:37 UTC

Updated 4.11 PR merged.

Comment 38 errata-xmlrpc 2022-08-10 11:08:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069