Bug 1018009

Summary:	mcollective operations fail to reap child processes after timeout
Product:	OpenShift Online	Reporter:	Andy Grimm <agrimm>
Component:	Containers	Assignee:	Rob Millner <rmillner>
Status:	CLOSED CURRENTRELEASE	QA Contact:	libra bugs <libra-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	2.x	CC:	bmeng, jgoulding, mfisher, rmillner
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-10-17 13:35:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andy Grimm 2013-10-11 00:45:50 UTC

On OpenShift Online nodes, I'm seeing a lot of defunct "runuser" processes with parent PID corresponding to mcollectived.  It looks like the problem is that when a process times out, we simply kill it and do not follow that with a Process.wait or Process.waitpid.

Comment 1 Andy Grimm 2013-10-11 01:06:20 UTC

On a related note, is their a reason we're immediately doing a SIGKILL here, instead of attempting something less destructive first?

Comment 2 Rob Millner 2013-10-11 17:32:31 UTC

An issue we've seen in the past is that mcollective has its own internal timeout.  When that timeout happens, the thread handling your mcollective call is just nuked and there's no chance to wait/waitpid.

The timeout to SIGKILL is intended to wipe out processes ahead of the mcollective timeout to avoid three problems we've seen.

1. Two conflicting tasks running simultaneously.
  Ex: configure and destroy running on the same gear, the gear is left behind because confiture is holding the polydir /tmp open when destroy calls userdel.

2. Processes (ex: git receive) that get stuck on a file descriptor, intercept SIGINT and will linger for ever.

3. The SIGKILL timeout is occasionally up to 100 seconds late.


Basically, when that timeout fires its an emergency because you may have no time left in your thread for a waitpid and all processes related to the task must go away before the broker makes a conflicting request.

Every so often, the timeouts get adjusted in the code without an understanding of the other timeouts in the system - especially the mcollective internal timeout.

Comment 3 Rob Millner 2013-10-11 18:57:43 UTC

Pull request:
https://github.com/openshift/origin-server/pull/3864

Stage pull request:
https://github.com/openshift/origin-server/pull/3865


The underlying issue of what's happening to the processes will be opened in another ticket.

Comment 4 openshift-github-bot 2013-10-11 21:04:16 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/45bbe71c42e8a8c7ae82fff70108a5ac964dc671
Bug 1018009 - spawn a separate thread to waitpid after killing.

Comment 5 Meng Bo 2013-10-12 10:01:58 UTC

Tested on devenv-stage_496,

Add sleep to the cartridge setup script to simulate the mcollective timeout.
Create the app with the cartridge.
After time out, check the process of the runuser.

There is no such defunct process left.

Move bug to verified.