Description of problem: After some time of uptime, containers on a node fail to start with "Argument list too long" error and zombie processes start to accumulate at a high rate. The time the issue occurs varies, once it happened after 17 days of uptime, second time after 10 days of uptime. Version-Release number of selected component (if applicable): 4.8.13 How reproducible: some time after 10 days on OCS node Steps to Reproduce: 1. Install OCP 4.8 on-premise. 2. Dedicate 3 nodes to OCS cluster 3. Deploy OCS 4.8 4. consume S3 bucket and generate some traffic 5. monitor nodes (especially zombie processes) Actual results: After some time, zombies start to accumulate roughly at rate 10 per a minute Expected results: node is stable with a reasonable amount of zombie processes (<200) and allows to run new containers as long as there are enough resources Additional info: Pretty much the same as https://bugzilla.redhat.com/show_bug.cgi?id=1994444 only with OCP 4.8 systemctl daemon-reload resolves the "Argument list too long" error and newly created containers start to run node remains responsive (ssh to node is relatively fast)
Based on the note about the zombies, it looks like this may be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2003199, can you try 4.8.16 to see if it has the same troubles? *** This bug has been marked as a duplicate of bug 2003199 ***
Created attachment 1844029 [details] zombie producers in grafana Time window shows the start of zombie accumulation on ocs-beworker1
sorry if this has already been answered, but do you have more information about what the zombie processes are?
The latest stats* show only 2 significant contributors on ocs-beworker1: - conmon (parent cri-o) 99k - conmon (parent multus) 71k * taken just before the restart the nodes The attachment 1844029 [details] shows just the former 30 minutes after the issue started.
wait just to verify, conmon is a child of *multus*? that is unexpected.
Created attachment 1844219 [details] grafana multus zombie spikes Yes, that's what I see. Interesting are also bursts of those multus child zombies occurring at ~12 hour periods. As this attachment shows. Is there something else I can collect that would reveal more (e.g. full cmdline)?
Deploy OpenShift Data Foundation on aws cluster with template private-templates/functionality-testing/aos-4_10/ipi-on-aws/versioned-installer of vm_type: 'm5.4xlarge' After 48 hours, no zombie processes accumulation, set verified! $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-22-102609 True False 26h Cluster version is 4.10.0-0.nightly-2022-01-22-102609
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days