Bug 2019346 - zombie processes accumulation and Argument list too long [NEEDINFO]
Summary: zombie processes accumulation and Argument list too long
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Peter Hunt
QA Contact: MinLi
URL:
Whiteboard:
Depends On:
Blocks: 2032466
TreeView+ depends on / blocked
 
Reported: 2021-11-02 10:18 UTC by Michal Minar
Modified: 2022-07-26 06:38 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2032466 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:24:30 UTC
Target Upstream Version:
Embargoed:
pehunt: needinfo? (srengan)


Attachments (Terms of Use)
zombie producers in grafana (159.70 KB, image/png)
2021-11-29 15:06 UTC, Michal Minar
no flags Details
grafana multus zombie spikes (78.93 KB, image/png)
2021-11-30 16:55 UTC, Michal Minar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5500 0 None Merged oci: always reap conmon zombies 2021-12-14 14:20:21 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:24:44 UTC

Description Michal Minar 2021-11-02 10:18:53 UTC
Description of problem:
  After some time of uptime, containers on a node fail to start with "Argument list too long" error and zombie processes start to accumulate at a high rate. The time the issue occurs varies, once it happened after 17 days of uptime, second time after 10 days of uptime.

Version-Release number of selected component (if applicable):
  4.8.13

How reproducible:
  some time after 10 days on OCS node

Steps to Reproduce:
1. Install OCP 4.8 on-premise.
2. Dedicate 3 nodes to OCS cluster
3. Deploy OCS 4.8
4. consume S3 bucket and generate some traffic
5. monitor nodes (especially zombie processes)

Actual results:
  After some time, zombies start to accumulate roughly at rate 10 per a minute

Expected results:
  node is stable with a reasonable amount of zombie processes (<200) and allows to run new containers as long as there are enough resources

Additional info:
  Pretty much the same as https://bugzilla.redhat.com/show_bug.cgi?id=1994444 only with OCP 4.8
  systemctl daemon-reload    resolves the "Argument list too long" error and newly created containers start to run
  node remains responsive (ssh to node is relatively fast)

Comment 2 Peter Hunt 2021-11-02 17:15:47 UTC
Based on the note about the zombies, it looks like this may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2003199, can you try 4.8.16 to see if it has the same troubles?

*** This bug has been marked as a duplicate of bug 2003199 ***

Comment 4 Michal Minar 2021-11-29 15:06:36 UTC
Created attachment 1844029 [details]
zombie producers in grafana

Time window shows the start of zombie accumulation on ocs-beworker1

Comment 5 Peter Hunt 2021-11-30 14:43:43 UTC
sorry if this has already been answered, but do you have more information about what the zombie processes are?

Comment 6 Michal Minar 2021-11-30 15:50:46 UTC
The latest stats* show only 2 significant contributors on ocs-beworker1:
- conmon (parent cri-o) 99k
- conmon (parent multus) 71k

* taken just before the restart the nodes
The attachment 1844029 [details] shows just the former 30 minutes after the issue started.

Comment 7 Peter Hunt 2021-11-30 15:58:50 UTC
wait just to verify, conmon is a child of *multus*? that is unexpected.

Comment 8 Michal Minar 2021-11-30 16:55:03 UTC
Created attachment 1844219 [details]
grafana multus zombie spikes

Yes, that's what I see. Interesting are also bursts of those multus child zombies occurring at ~12 hour periods. As this attachment shows.
Is there something else I can collect that would reveal more (e.g. full cmdline)?

Comment 26 MinLi 2022-01-25 09:15:22 UTC
Deploy OpenShift Data Foundation on aws cluster with template private-templates/functionality-testing/aos-4_10/ipi-on-aws/versioned-installer of vm_type: 'm5.4xlarge'
After 48 hours, no zombie processes accumulation, set verified!


$ oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-22-102609   True        False         26h     Cluster version is 4.10.0-0.nightly-2022-01-22-102609

Comment 32 errata-xmlrpc 2022-03-10 16:24:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.