Bug 2019346 - zombie processes accumulation and Argument list too long
Summary: zombie processes accumulation and Argument list too long
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Peter Hunt
QA Contact: MinLi
URL:
Whiteboard:
Depends On:
Blocks: 2032466
TreeView+ depends on / blocked
 
Reported: 2021-11-02 10:18 UTC by Michal Minar
Modified: 2023-09-15 01:49 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2032466 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:24:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
zombie producers in grafana (159.70 KB, image/png)
2021-11-29 15:06 UTC, Michal Minar
no flags Details
grafana multus zombie spikes (78.93 KB, image/png)
2021-11-30 16:55 UTC, Michal Minar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5500 0 None Merged oci: always reap conmon zombies 2021-12-14 14:20:21 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:24:44 UTC

Description Michal Minar 2021-11-02 10:18:53 UTC
Description of problem:
  After some time of uptime, containers on a node fail to start with "Argument list too long" error and zombie processes start to accumulate at a high rate. The time the issue occurs varies, once it happened after 17 days of uptime, second time after 10 days of uptime.

Version-Release number of selected component (if applicable):
  4.8.13

How reproducible:
  some time after 10 days on OCS node

Steps to Reproduce:
1. Install OCP 4.8 on-premise.
2. Dedicate 3 nodes to OCS cluster
3. Deploy OCS 4.8
4. consume S3 bucket and generate some traffic
5. monitor nodes (especially zombie processes)

Actual results:
  After some time, zombies start to accumulate roughly at rate 10 per a minute

Expected results:
  node is stable with a reasonable amount of zombie processes (<200) and allows to run new containers as long as there are enough resources

Additional info:
  Pretty much the same as https://bugzilla.redhat.com/show_bug.cgi?id=1994444 only with OCP 4.8
  systemctl daemon-reload    resolves the "Argument list too long" error and newly created containers start to run
  node remains responsive (ssh to node is relatively fast)

Comment 2 Peter Hunt 2021-11-02 17:15:47 UTC
Based on the note about the zombies, it looks like this may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2003199, can you try 4.8.16 to see if it has the same troubles?

*** This bug has been marked as a duplicate of bug 2003199 ***

Comment 4 Michal Minar 2021-11-29 15:06:36 UTC
Created attachment 1844029 [details]
zombie producers in grafana

Time window shows the start of zombie accumulation on ocs-beworker1

Comment 5 Peter Hunt 2021-11-30 14:43:43 UTC
sorry if this has already been answered, but do you have more information about what the zombie processes are?

Comment 6 Michal Minar 2021-11-30 15:50:46 UTC
The latest stats* show only 2 significant contributors on ocs-beworker1:
- conmon (parent cri-o) 99k
- conmon (parent multus) 71k

* taken just before the restart the nodes
The attachment 1844029 [details] shows just the former 30 minutes after the issue started.

Comment 7 Peter Hunt 2021-11-30 15:58:50 UTC
wait just to verify, conmon is a child of *multus*? that is unexpected.

Comment 8 Michal Minar 2021-11-30 16:55:03 UTC
Created attachment 1844219 [details]
grafana multus zombie spikes

Yes, that's what I see. Interesting are also bursts of those multus child zombies occurring at ~12 hour periods. As this attachment shows.
Is there something else I can collect that would reveal more (e.g. full cmdline)?

Comment 26 MinLi 2022-01-25 09:15:22 UTC
Deploy OpenShift Data Foundation on aws cluster with template private-templates/functionality-testing/aos-4_10/ipi-on-aws/versioned-installer of vm_type: 'm5.4xlarge'
After 48 hours, no zombie processes accumulation, set verified!


$ oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-22-102609   True        False         26h     Cluster version is 4.10.0-0.nightly-2022-01-22-102609

Comment 32 errata-xmlrpc 2022-03-10 16:24:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 33 Red Hat Bugzilla 2023-09-15 01:49:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.