Bug 1018009

Summary: mcollective operations fail to reap child processes after timeout
Product: OpenShift Online Reporter: Andy Grimm <agrimm>
Component: ContainersAssignee: Rob Millner <rmillner>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.xCC: bmeng, jgoulding, mfisher, rmillner
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-17 13:35:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Andy Grimm 2013-10-11 00:45:50 UTC
On OpenShift Online nodes, I'm seeing a lot of defunct "runuser" processes with parent PID corresponding to mcollectived.  It looks like the problem is that when a process times out, we simply kill it and do not follow that with a Process.wait or Process.waitpid.

Comment 1 Andy Grimm 2013-10-11 01:06:20 UTC
On a related note, is their a reason we're immediately doing a SIGKILL here, instead of attempting something less destructive first?

Comment 2 Rob Millner 2013-10-11 17:32:31 UTC
An issue we've seen in the past is that mcollective has its own internal timeout.  When that timeout happens, the thread handling your mcollective call is just nuked and there's no chance to wait/waitpid.

The timeout to SIGKILL is intended to wipe out processes ahead of the mcollective timeout to avoid three problems we've seen.

1. Two conflicting tasks running simultaneously.
  Ex: configure and destroy running on the same gear, the gear is left behind because confiture is holding the polydir /tmp open when destroy calls userdel.

2. Processes (ex: git receive) that get stuck on a file descriptor, intercept SIGINT and will linger for ever.

3. The SIGKILL timeout is occasionally up to 100 seconds late.


Basically, when that timeout fires its an emergency because you may have no time left in your thread for a waitpid and all processes related to the task must go away before the broker makes a conflicting request.

Every so often, the timeouts get adjusted in the code without an understanding of the other timeouts in the system - especially the mcollective internal timeout.

Comment 3 Rob Millner 2013-10-11 18:57:43 UTC
Pull request:
https://github.com/openshift/origin-server/pull/3864

Stage pull request:
https://github.com/openshift/origin-server/pull/3865


The underlying issue of what's happening to the processes will be opened in another ticket.

Comment 4 openshift-github-bot 2013-10-11 21:04:16 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/45bbe71c42e8a8c7ae82fff70108a5ac964dc671
Bug 1018009 - spawn a separate thread to waitpid after killing.

Comment 5 Meng Bo 2013-10-12 10:01:58 UTC
Tested on devenv-stage_496,

Add sleep to the cartridge setup script to simulate the mcollective timeout.
Create the app with the cartridge.
After time out, check the process of the runuser.

There is no such defunct process left.

Move bug to verified.