Bug 988901 - oo-admin-ctl-gears stopgear exits with 0 status before all processes are stopped
Summary: oo-admin-ctl-gears stopgear exits with 0 status before all processes are stopped
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
high
low
Target Milestone: ---
: ---
Assignee: Rob Millner
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-26 16:26 UTC by Sten Turpin
Modified: 2015-05-14 23:24 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-11-14 22:12:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sten Turpin 2013-07-26 16:26:19 UTC
Description of problem: issued oo-admin-ctl-gears stopgear <uuid>. Process exited with 0 status, but processes belonging to the gear remained


Version-Release number of selected component (if applicable): openshift-origin-node-util-1.11.8-1.el6oso.noarch


How reproducible: sometimes - on gears with long-running, high-cpu processes


Steps to Reproduce:
1. Find a gear that is constantly running at its cgroup-limited CPU threshold (in this case, it was a mysqld_safe process) 
2. issued oo-admin-ctl-gears stopgear <uuid>

Actual results:
oo-admin-ctl-gears stopgear exited with return code of 0, but mysql cartridge continued to run, now uncontrolled by cgroups. 


Expected results:
processes owned by the gear stop before oo-admin-ctl-gears exits, or oo-admin-ctl-gears returns a warning that processes remain, and leaves the errant processes constrained by cgroups.

Comment 1 Thomas Wiest 2013-07-27 13:40:41 UTC
Just to underscore this further.

The problem is that we had a gear that was using all of it's cgroup allotted cpu resrouces. We thought that the gear might possibly just need to be restarted.

The box was overloaded, so we wanted to stop the gear, let the system recover a bit, then restart the gear.

What happened instead was we stopped the gear, the stopgear operation took the cgroups constraints off of the gear, but failed to kill the processes.

This gear that was using as many cpu resources as it could get, was suddenly no longer constrained by cgroups and could use all available cpu resources on the box.

So, in affect, stopgear exacerbated the problem.

Eventually we had to kill off the gear processes manually.

If stopgear fails to kill the gear processes, it should:
1) return a non-zero exit code
2) put the remaining processes back in cgroups

Comment 2 Rob Millner 2013-07-27 17:05:58 UTC
Interesting problem.  We go through several motions, including freezing the cgroup to ensure we can deliver SIGKILL to every process owned by the gear and that the gear can't out run the killer ("Fri 13'th mode").

What processes escaped?

Comment 3 Thomas Wiest 2013-07-27 21:20:18 UTC
It was a mysqld_safe process. The gear looked like it was a mysql db only gear.

Comment 4 Rob Millner 2013-07-30 21:08:31 UTC
Have not been able to reproduce this issue on devenv.  Next time it is observed, please leave the process running and collect all logs related to the gear so that it can be diagnosed.  Thanks!

Comment 5 Rob Millner 2013-11-14 22:12:04 UTC
A "forcestopgear" command was added to oo-admin-ctl-gears to stop gear processes that are both stuck and have escaped cgroups.  A Trello card was filed to make it harder to escape cgroups.

I believe there's nothing left for this ticket.


Note You need to log in before you can comment on or make changes to this bug.