Description of problem: process leak occurs - conmon processes continue to spawn as zombies but never close Version-Release number of selected component (if applicable): 4.5 How reproducible: Always Steps to Reproduce: I have run a scale up test on r730xd server (40 cores and 128G RAM). Started 90 VMS on the node. When examining the CNV node saw the following issue : 230K zombie processes (defunct processes with no father). On closer look I saw that all zombies are conmon processes that monitor the pods: [core@f25-h03-000-r730xd ~]$ ps -ef | grep defunct | grep -c conmon 23158 The number of zombie processes increases at a rate of ~30K per hour and not cleaning up. We have around 5 VMS thare stuck at scheduled mode, investigating if related. Advise that we have updated the kernel from 4.18.0-147.8_1.x86_64 to 4.18.0-179.el8.x86_64 to overcome bug 1847070. Actual results: No process leak occurs Expected results: process leak occurs Additional info:
can I get access to this node?
systemd on the node is totally hosed. I am unable to get any information about scopes coming from it. it's possible it's not garbage collecting the conmon processes, so they're reporting as defunct. I have checked, and systemd is still reporting as running but maybe it got into a bad state that renders it unable to garbage collect its children copying some systemd folks for assistance
considering: systemd on the node is unreachable nodes that aren't hosed clean up conmons properly systemd version on 4.5 and 4.4 seem to be the same (systemd 239-27) conmon's parent is systemd when they're orphaned I am considering this a systemd bug, as well as not strictly a 4.5 problem (though it's possible it's a RHEL 8.2 bug). moving to systemd component
Constant re-reading of /proc/1/mountinfo suggests that we may have a bug in event dispatch code. Iād say that signal-fd event source should be dispatched with higher priority than mountinfo event source.
> The number of zombie processes increases at a rate of ~30K per hour and not cleaning up. Peter, do you consider this rate of conmon process creation as expected? Could there be a tight loop somewhere forking too many of them?
yes, it is expected there are this many conmon processes. Each VM (with around 80 causing these issues) is running a probe (is my container running?) every two seconds, which causes cri-o to run a conmon process Though, there are instances where CRI-O has orphaned conmons (where it had not yet reparented to systemd). I will look into leads on this tomorrow. Further, containers associated with these orphaned processes seem to have trouble being cleaned up (and some stick around after the VMs are "cleaned up"). I'll also look into this.
This PR mitigates the problem https://github.com/containers/conmon/pull/183 though there is still some weird behaviour we are investigating
Additional info - we have run the following scenarios to see if the problem is reproducing : * Run VMS which only use containerDisk * Run pods which just run `sleep`with restartPolicy: Never * Run pods which just run `sleep`with restartPolicy: Never and readinessProbe failing Container disks VMS showed reproduction of the bug exactly the same as the NFS PVC VMS, thus the bug is not related to NFS mount or PVC. As for the second scenario, again the bug did not reproduced. At the third scenario, the problem was not reproduced, but there was a spike to 1500 zombies at 90 pods and then it recovered so it seamed to be related to the exec probs.
I have also run the following scenario : * Run pods which run `sleep 3600` 10 times, with restartPolicy: Never and readinessProbe failing After 10 minutes the bug reproduced - zombies start to go up and reached 45K.
As an update here: it seems the best way to mitigate this problem (besides lowering the frequency of readiness probes) is to drop the double fork from conmon Having conmon detach from cri-o and reparent to systemd for exec sync calls (probes) is not really needed for the call to work. cri-o still waits for all of the output from the container, so it's not like the goroutine can continue until conmon is done. Further, removing the fork largely fixes the defunct PIDs problem. The VM I was testing on got some errors about kvm devices not being available, but I was able to reproduce the problem with runc (running ~125 pods that sleep and have readiness probes happening every second) and solve the problem with runc by dropping the fork. As such, I've opened two pull requests, both attached. They're the respective conmon and cri-o fixes needed. Once we get those in master, we can evaluate feasability of backporting to 4.5.z
Good to see this update @Peter. Can we also create a clone, so the fix will get backported into 4.5? and potentially also a clone for 4.4 if it is also going to get RHCOS based on el8.2?
Peter, two questions: 1.a) WHen speaking about the probes - are these only the liveness/readinessProbes that are defined on the pod - or does runc do additional probes on the containers? 1.b) Which of th eprobes is cuasing the trouble? 2)Is the probing interval configurable?
1a: just liveness/readiness probes 1b: the liveness probe on the launcher pods are causing the problems 2: yes, you can edit the pod spec and make the probes less frequent. I would recommend that in general, 1s is quite frequent
this fell off my radar for 4.6, I will revisit next sprint
https://github.com/cri-o/cri-o/pull/3908 merged, this should be fixed (assuming 4.6 has conmon 2.0.20, which I believe it does)
I have tested the fix on 4.5.6 - conmon version 2.0.20 - fixed. Created up to 100 VMS, in batches and one by one, and the problem did not reproduce.
as per: https://bugzilla.redhat.com/show_bug.cgi?id=1852064#c17 I checked by increasing the stress on node by created more than 200 pods which run `sleep 3600`, with restartPolicy: Never and readinessProbe succeeding. The pods were then waiting in Pending state due to low resources. I see defunct process gets cleared up. $ oc version Client Version: 4.5.2 Server Version: 4.6.0-0.nightly-2020-09-26-202331 Kubernetes Version: v1.19.0+e465e66
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days