On OpenShift Online nodes, I'm seeing a lot of defunct "runuser" processes with parent PID corresponding to mcollectived. It looks like the problem is that when a process times out, we simply kill it and do not follow that with a Process.wait or Process.waitpid.
On a related note, is their a reason we're immediately doing a SIGKILL here, instead of attempting something less destructive first?
An issue we've seen in the past is that mcollective has its own internal timeout. When that timeout happens, the thread handling your mcollective call is just nuked and there's no chance to wait/waitpid. The timeout to SIGKILL is intended to wipe out processes ahead of the mcollective timeout to avoid three problems we've seen. 1. Two conflicting tasks running simultaneously. Ex: configure and destroy running on the same gear, the gear is left behind because confiture is holding the polydir /tmp open when destroy calls userdel. 2. Processes (ex: git receive) that get stuck on a file descriptor, intercept SIGINT and will linger for ever. 3. The SIGKILL timeout is occasionally up to 100 seconds late. Basically, when that timeout fires its an emergency because you may have no time left in your thread for a waitpid and all processes related to the task must go away before the broker makes a conflicting request. Every so often, the timeouts get adjusted in the code without an understanding of the other timeouts in the system - especially the mcollective internal timeout.
Pull request: https://github.com/openshift/origin-server/pull/3864 Stage pull request: https://github.com/openshift/origin-server/pull/3865 The underlying issue of what's happening to the processes will be opened in another ticket.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/45bbe71c42e8a8c7ae82fff70108a5ac964dc671 Bug 1018009 - spawn a separate thread to waitpid after killing.
Tested on devenv-stage_496, Add sleep to the cartridge setup script to simulate the mcollective timeout. Create the app with the cartridge. After time out, check the process of the runuser. There is no such defunct process left. Move bug to verified.