Bug 2019346

Summary: zombie processes accumulation and Argument list too long
Product: OpenShift Container Platform Reporter: Michal Minar <miminar>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: MinLi <minmli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, minmli, nagrawal, openshift-bugs-escalate, palshure, pehunt, srengan
Version: 4.8Keywords: Reopened
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2032466 (view as bug list) Environment:
Last Closed: 2022-03-10 16:24:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2032466    
Attachments:
Description Flags
zombie producers in grafana
none
grafana multus zombie spikes none

Description Michal Minar 2021-11-02 10:18:53 UTC
Description of problem:
  After some time of uptime, containers on a node fail to start with "Argument list too long" error and zombie processes start to accumulate at a high rate. The time the issue occurs varies, once it happened after 17 days of uptime, second time after 10 days of uptime.

Version-Release number of selected component (if applicable):
  4.8.13

How reproducible:
  some time after 10 days on OCS node

Steps to Reproduce:
1. Install OCP 4.8 on-premise.
2. Dedicate 3 nodes to OCS cluster
3. Deploy OCS 4.8
4. consume S3 bucket and generate some traffic
5. monitor nodes (especially zombie processes)

Actual results:
  After some time, zombies start to accumulate roughly at rate 10 per a minute

Expected results:
  node is stable with a reasonable amount of zombie processes (<200) and allows to run new containers as long as there are enough resources

Additional info:
  Pretty much the same as https://bugzilla.redhat.com/show_bug.cgi?id=1994444 only with OCP 4.8
  systemctl daemon-reload    resolves the "Argument list too long" error and newly created containers start to run
  node remains responsive (ssh to node is relatively fast)

Comment 2 Peter Hunt 2021-11-02 17:15:47 UTC
Based on the note about the zombies, it looks like this may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2003199, can you try 4.8.16 to see if it has the same troubles?

*** This bug has been marked as a duplicate of bug 2003199 ***

Comment 4 Michal Minar 2021-11-29 15:06:36 UTC
Created attachment 1844029 [details]
zombie producers in grafana

Time window shows the start of zombie accumulation on ocs-beworker1

Comment 5 Peter Hunt 2021-11-30 14:43:43 UTC
sorry if this has already been answered, but do you have more information about what the zombie processes are?

Comment 6 Michal Minar 2021-11-30 15:50:46 UTC
The latest stats* show only 2 significant contributors on ocs-beworker1:
- conmon (parent cri-o) 99k
- conmon (parent multus) 71k

* taken just before the restart the nodes
The attachment 1844029 [details] shows just the former 30 minutes after the issue started.

Comment 7 Peter Hunt 2021-11-30 15:58:50 UTC
wait just to verify, conmon is a child of *multus*? that is unexpected.

Comment 8 Michal Minar 2021-11-30 16:55:03 UTC
Created attachment 1844219 [details]
grafana multus zombie spikes

Yes, that's what I see. Interesting are also bursts of those multus child zombies occurring at ~12 hour periods. As this attachment shows.
Is there something else I can collect that would reveal more (e.g. full cmdline)?

Comment 26 MinLi 2022-01-25 09:15:22 UTC
Deploy OpenShift Data Foundation on aws cluster with template private-templates/functionality-testing/aos-4_10/ipi-on-aws/versioned-installer of vm_type: 'm5.4xlarge'
After 48 hours, no zombie processes accumulation, set verified!


$ oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-22-102609   True        False         26h     Cluster version is 4.10.0-0.nightly-2022-01-22-102609

Comment 32 errata-xmlrpc 2022-03-10 16:24:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 33 Red Hat Bugzilla 2023-09-15 01:49:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days