Bug 1848524 - conmon processes continue to spawn as zombies but never close
Summary: conmon processes continue to spawn as zombies but never close
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.5
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Peter Hunt
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks: 1852064
TreeView+ depends on / blocked
 
Reported: 2020-06-18 13:37 UTC by guy chen
Modified: 2021-06-04 08:25 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
Clone Of:
: 1852064 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:08:08 UTC
Target Upstream Version:
nchoudhu: needinfo? (bbreard)
mdolezel: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github containers conmon pull 183 0 None closed add options to handle children differently 2021-01-27 09:36:20 UTC
Github cri-o cri-o pull 3908 0 None closed keep conmons created by execsyncs as direct children 2021-01-27 09:36:20 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:08:29 UTC

Description guy chen 2020-06-18 13:37:07 UTC
Description of problem:
process leak occurs - conmon processes continue to spawn as zombies but never close

Version-Release number of selected component (if applicable):
4.5

How reproducible:
Always

Steps to Reproduce:
I have run a scale up test on r730xd server (40 cores and 128G RAM).
Started 90 VMS on the node.
When examining the CNV node saw the following issue :
230K zombie processes (defunct processes with no father).
On closer look I saw that all zombies are conmon processes that monitor the pods: 
[core@f25-h03-000-r730xd ~]$ ps -ef | grep defunct | grep -c conmon
23158
The number of zombie processes increases at a rate of ~30K per hour and not cleaning up.
We have around 5 VMS thare stuck at scheduled mode, investigating if related.
Advise that we have updated the kernel from 4.18.0-147.8_1.x86_64 to 4.18.0-179.el8.x86_64 to overcome bug 1847070.

Actual results:
No process leak occurs

Expected results:
process leak occurs

Additional info:

Comment 1 Peter Hunt 2020-06-18 14:57:47 UTC
can I get access to this node?

Comment 2 Peter Hunt 2020-06-18 17:09:52 UTC
systemd on the node is totally hosed. I am unable to get any information about scopes coming from it. it's possible it's not garbage collecting the conmon processes, so they're reporting as defunct. I have checked, and systemd is still reporting as running but maybe it got into a bad state that renders it unable to garbage collect its children

copying some systemd folks for assistance

Comment 3 Peter Hunt 2020-06-19 15:02:54 UTC
considering:
systemd on the node is unreachable
nodes that aren't hosed clean up conmons properly
systemd version on 4.5 and 4.4 seem to be the same (systemd 239-27)
conmon's parent is systemd when they're orphaned

I am considering this a systemd bug, as well as not strictly a 4.5 problem (though it's possible it's a RHEL 8.2 bug).

moving to systemd component

Comment 10 Michal Sekletar 2020-06-22 21:20:13 UTC
Constant re-reading of /proc/1/mountinfo suggests that we may have a bug in event dispatch code. Iā€™d say that signal-fd event source should be dispatched with higher priority than mountinfo event source.

Comment 14 Dan Kenigsberg 2020-06-23 09:27:02 UTC
> The number of zombie processes increases at a rate of ~30K per hour and not cleaning up.

Peter, do you consider this rate of conmon process creation as expected? Could there be a tight loop somewhere forking too many of them?

Comment 17 Peter Hunt 2020-06-24 00:10:52 UTC
yes, it is expected there are this many conmon processes. Each VM (with around 80 causing these issues) is running a probe (is my container running?) every two seconds, which causes cri-o to run a conmon process
Though, there are instances where CRI-O has orphaned conmons (where it had not yet reparented to systemd). I will look into leads on this tomorrow.
Further, containers associated with these orphaned processes seem to have trouble being cleaned up (and some stick around after the VMs are "cleaned up"). I'll also look into this.

Comment 19 Peter Hunt 2020-06-24 21:50:01 UTC
This PR mitigates the problem https://github.com/containers/conmon/pull/183

though there is still some weird behaviour we are investigating

Comment 20 guy chen 2020-06-25 06:46:58 UTC
Additional info - we have run the following scenarios to see if the problem is reproducing :

 * Run VMS which only use containerDisk
 * Run pods which just run `sleep`with restartPolicy: Never
 * Run pods which just run `sleep`with restartPolicy: Never and readinessProbe failing 


Container disks VMS showed reproduction of the bug exactly the same as the NFS PVC VMS, thus the bug is not related to NFS mount or PVC.
As for the second scenario, again the bug did not reproduced.
At the third scenario, the problem was not reproduced, but there was a spike to 1500 zombies at 90 pods and then it recovered so it seamed to be related to the exec probs.

Comment 21 guy chen 2020-06-25 08:00:27 UTC
I have also run the following scenario :

 * Run pods which run `sleep 3600` 10 times, with restartPolicy: Never and readinessProbe failing

After 10 minutes the bug reproduced - zombies start to go up and reached 45K.

Comment 22 Peter Hunt 2020-06-26 20:38:32 UTC
As an update here:

it seems the best way to mitigate this problem (besides lowering the frequency of readiness probes) is to drop the double fork from conmon

Having conmon detach from cri-o and reparent to systemd for exec sync calls (probes) is not really needed for the call to work. cri-o still waits for all of the output from the container, so it's not like the goroutine can continue until conmon is done.

Further, removing the fork largely fixes the defunct PIDs problem. The VM I was testing on got some errors about kvm devices not being available, but I was able to reproduce the problem with runc (running ~125 pods that sleep and have readiness probes happening every second) and solve the problem with runc by dropping the fork.

As such, I've opened two pull requests, both attached. They're the respective conmon and cri-o fixes needed. Once we get those in master, we can evaluate feasability of backporting to 4.5.z

Comment 23 Nelly Credi 2020-06-28 09:44:12 UTC
Good to see this update @Peter.
Can we also create a clone, so the fix will get backported into 4.5? 
and potentially also a clone for 4.4 if it is also going to get RHCOS based on el8.2?

Comment 27 Fabian Deutsch 2020-07-01 13:51:29 UTC
Peter,

two questions:
1.a) WHen speaking about the probes - are these only the liveness/readinessProbes that are defined on the pod - or does runc do additional probes on the containers?
1.b) Which of th eprobes is cuasing the trouble?

2)Is the probing interval configurable?

Comment 28 Peter Hunt 2020-07-01 14:09:18 UTC
1a: just liveness/readiness probes
1b: the liveness probe on the launcher pods are causing the problems
2: yes, you can edit the pod spec and make the probes less frequent. I would recommend that in general, 1s is quite frequent

Comment 32 Peter Hunt 2020-07-31 21:29:41 UTC
this fell off my radar for 4.6, I will revisit next sprint

Comment 33 Peter Hunt 2020-08-20 15:46:37 UTC
https://github.com/cri-o/cri-o/pull/3908 merged, this should be fixed (assuming 4.6 has conmon 2.0.20, which I believe it does)

Comment 44 guy chen 2020-09-14 09:36:15 UTC
I have tested the fix on 4.5.6 - conmon version 2.0.20 - fixed.
Created up to 100 VMS, in batches and one by one, and the problem did not reproduce.

Comment 47 Weinan Liu 2020-09-27 15:13:40 UTC
as per: https://bugzilla.redhat.com/show_bug.cgi?id=1852064#c17
I checked  by increasing the stress on node by created more than 200 pods which run `sleep 3600`, with restartPolicy: Never and readinessProbe succeeding.
The pods were then waiting in Pending state due to low resources. I see defunct process gets cleared up.

$ oc version
Client Version: 4.5.2
Server Version: 4.6.0-0.nightly-2020-09-26-202331
Kubernetes Version: v1.19.0+e465e66

Comment 49 errata-xmlrpc 2020-10-27 16:08:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.