Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2019346

Summary:

zombie processes accumulation and Argument list too long

Product:

OpenShift Container Platform

Reporter:

Michal Minar <miminar>

Component:

Node

Assignee:

Peter Hunt <pehunt>

Node sub component:

CRI-O

QA Contact:

MinLi <minmli>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

aos-bugs, minmli, nagrawal, openshift-bugs-escalate, palshure, pehunt, srengan

Version:

4.8

Keywords:

Reopened

Target Milestone:

---

Target Release:

4.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

2032466 (view as bug list)

Environment:

Last Closed:

2022-03-10 16:24:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2032466

Attachments:

Description	Flags
zombie producers in grafana	none
grafana multus zombie spikes	none

Description Michal Minar 2021-11-02 10:18:53 UTC

Description of problem:
  After some time of uptime, containers on a node fail to start with "Argument list too long" error and zombie processes start to accumulate at a high rate. The time the issue occurs varies, once it happened after 17 days of uptime, second time after 10 days of uptime.

Version-Release number of selected component (if applicable):
  4.8.13

How reproducible:
  some time after 10 days on OCS node

Steps to Reproduce:
1. Install OCP 4.8 on-premise.
2. Dedicate 3 nodes to OCS cluster
3. Deploy OCS 4.8
4. consume S3 bucket and generate some traffic
5. monitor nodes (especially zombie processes)

Actual results:
  After some time, zombies start to accumulate roughly at rate 10 per a minute

Expected results:
  node is stable with a reasonable amount of zombie processes (<200) and allows to run new containers as long as there are enough resources

Additional info:
  Pretty much the same as https://bugzilla.redhat.com/show_bug.cgi?id=1994444 only with OCP 4.8
  systemctl daemon-reload    resolves the "Argument list too long" error and newly created containers start to run
  node remains responsive (ssh to node is relatively fast)

Comment 2 Peter Hunt 2021-11-02 17:15:47 UTC

Based on the note about the zombies, it looks like this may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2003199, can you try 4.8.16 to see if it has the same troubles?

*** This bug has been marked as a duplicate of bug 2003199 ***

Comment 4 Michal Minar 2021-11-29 15:06:36 UTC

Created attachment 1844029 [details]
zombie producers in grafana

Time window shows the start of zombie accumulation on ocs-beworker1

Comment 5 Peter Hunt 2021-11-30 14:43:43 UTC

sorry if this has already been answered, but do you have more information about what the zombie processes are?

Comment 6 Michal Minar 2021-11-30 15:50:46 UTC

The latest stats* show only 2 significant contributors on ocs-beworker1:
- conmon (parent cri-o) 99k
- conmon (parent multus) 71k

* taken just before the restart the nodes
The attachment 1844029 [details] shows just the former 30 minutes after the issue started.

Comment 7 Peter Hunt 2021-11-30 15:58:50 UTC

wait just to verify, conmon is a child of *multus*? that is unexpected.

Comment 8 Michal Minar 2021-11-30 16:55:03 UTC

Created attachment 1844219 [details]
grafana multus zombie spikes

Yes, that's what I see. Interesting are also bursts of those multus child zombies occurring at ~12 hour periods. As this attachment shows.
Is there something else I can collect that would reveal more (e.g. full cmdline)?

Comment 26 MinLi 2022-01-25 09:15:22 UTC

Deploy OpenShift Data Foundation on aws cluster with template private-templates/functionality-testing/aos-4_10/ipi-on-aws/versioned-installer of vm_type: 'm5.4xlarge'
After 48 hours, no zombie processes accumulation, set verified!


$ oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-22-102609   True        False         26h     Cluster version is 4.10.0-0.nightly-2022-01-22-102609

Comment 32 errata-xmlrpc 2022-03-10 16:24:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 33 Red Hat Bugzilla 2023-09-15 01:49:44 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days