1018009 – mcollective operations fail to reap child processes after timeout

Bug 1018009 - mcollective operations fail to reap child processes after timeout

Summary: mcollective operations fail to reap child processes after timeout

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Rob Millner
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-11 00:45 UTC by Andy Grimm
Modified:	2016-11-08 03:47 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-10-17 13:35:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Andy Grimm 2013-10-11 00:45:50 UTC

On OpenShift Online nodes, I'm seeing a lot of defunct "runuser" processes with parent PID corresponding to mcollectived.  It looks like the problem is that when a process times out, we simply kill it and do not follow that with a Process.wait or Process.waitpid.

Comment 1 Andy Grimm 2013-10-11 01:06:20 UTC

On a related note, is their a reason we're immediately doing a SIGKILL here, instead of attempting something less destructive first?

Comment 2 Rob Millner 2013-10-11 17:32:31 UTC

An issue we've seen in the past is that mcollective has its own internal timeout.  When that timeout happens, the thread handling your mcollective call is just nuked and there's no chance to wait/waitpid.

The timeout to SIGKILL is intended to wipe out processes ahead of the mcollective timeout to avoid three problems we've seen.

1. Two conflicting tasks running simultaneously.
  Ex: configure and destroy running on the same gear, the gear is left behind because confiture is holding the polydir /tmp open when destroy calls userdel.

2. Processes (ex: git receive) that get stuck on a file descriptor, intercept SIGINT and will linger for ever.

3. The SIGKILL timeout is occasionally up to 100 seconds late.


Basically, when that timeout fires its an emergency because you may have no time left in your thread for a waitpid and all processes related to the task must go away before the broker makes a conflicting request.

Every so often, the timeouts get adjusted in the code without an understanding of the other timeouts in the system - especially the mcollective internal timeout.

Comment 3 Rob Millner 2013-10-11 18:57:43 UTC

Pull request:
https://github.com/openshift/origin-server/pull/3864

Stage pull request:
https://github.com/openshift/origin-server/pull/3865


The underlying issue of what's happening to the processes will be opened in another ticket.

Comment 4 openshift-github-bot 2013-10-11 21:04:16 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/45bbe71c42e8a8c7ae82fff70108a5ac964dc671
Bug 1018009 - spawn a separate thread to waitpid after killing.

Comment 5 Meng Bo 2013-10-12 10:01:58 UTC

Tested on devenv-stage_496,

Add sleep to the cartridge setup script to simulate the mcollective timeout.
Create the app with the cartridge.
After time out, check the process of the runuser.

There is no such defunct process left.

Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.