988901 – oo-admin-ctl-gears stopgear exits with 0 status before all processes are stopped

Bug 988901 - oo-admin-ctl-gears stopgear exits with 0 status before all processes are stopped

Summary: oo-admin-ctl-gears stopgear exits with 0 status before all processes are stopped

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Rob Millner
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-26 16:26 UTC by Sten Turpin
Modified:	2015-05-14 23:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-11-14 22:12:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sten Turpin 2013-07-26 16:26:19 UTC

Description of problem: issued oo-admin-ctl-gears stopgear <uuid>. Process exited with 0 status, but processes belonging to the gear remained


Version-Release number of selected component (if applicable): openshift-origin-node-util-1.11.8-1.el6oso.noarch


How reproducible: sometimes - on gears with long-running, high-cpu processes


Steps to Reproduce:
1. Find a gear that is constantly running at its cgroup-limited CPU threshold (in this case, it was a mysqld_safe process) 
2. issued oo-admin-ctl-gears stopgear <uuid>

Actual results:
oo-admin-ctl-gears stopgear exited with return code of 0, but mysql cartridge continued to run, now uncontrolled by cgroups. 


Expected results:
processes owned by the gear stop before oo-admin-ctl-gears exits, or oo-admin-ctl-gears returns a warning that processes remain, and leaves the errant processes constrained by cgroups.

Comment 1 Thomas Wiest 2013-07-27 13:40:41 UTC

Just to underscore this further.

The problem is that we had a gear that was using all of it's cgroup allotted cpu resrouces. We thought that the gear might possibly just need to be restarted.

The box was overloaded, so we wanted to stop the gear, let the system recover a bit, then restart the gear.

What happened instead was we stopped the gear, the stopgear operation took the cgroups constraints off of the gear, but failed to kill the processes.

This gear that was using as many cpu resources as it could get, was suddenly no longer constrained by cgroups and could use all available cpu resources on the box.

So, in affect, stopgear exacerbated the problem.

Eventually we had to kill off the gear processes manually.

If stopgear fails to kill the gear processes, it should:
1) return a non-zero exit code
2) put the remaining processes back in cgroups

Comment 2 Rob Millner 2013-07-27 17:05:58 UTC

Interesting problem.  We go through several motions, including freezing the cgroup to ensure we can deliver SIGKILL to every process owned by the gear and that the gear can't out run the killer ("Fri 13'th mode").

What processes escaped?

Comment 3 Thomas Wiest 2013-07-27 21:20:18 UTC

It was a mysqld_safe process. The gear looked like it was a mysql db only gear.

Comment 4 Rob Millner 2013-07-30 21:08:31 UTC

Have not been able to reproduce this issue on devenv.  Next time it is observed, please leave the process running and collect all logs related to the gear so that it can be diagnosed.  Thanks!

Comment 5 Rob Millner 2013-11-14 22:12:04 UTC

A "forcestopgear" command was added to oo-admin-ctl-gears to stop gear processes that are both stuck and have escaped cgroups.  A Trello card was filed to make it harder to escape cgroups.

I believe there's nothing left for this ticket.

Note You need to log in before you can comment on or make changes to this bug.