Bug 988901 - oo-admin-ctl-gears stopgear exits with 0 status before all processes are stopped
oo-admin-ctl-gears stopgear exits with 0 status before all processes are stopped
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
Unspecified Unspecified
high Severity low
: ---
: ---
Assigned To: Rob Millner
libra bugs
Depends On:
  Show dependency treegraph
Reported: 2013-07-26 12:26 EDT by Sten Turpin
Modified: 2015-05-14 19:24 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-11-14 17:12:04 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Sten Turpin 2013-07-26 12:26:19 EDT
Description of problem: issued oo-admin-ctl-gears stopgear <uuid>. Process exited with 0 status, but processes belonging to the gear remained

Version-Release number of selected component (if applicable): openshift-origin-node-util-1.11.8-1.el6oso.noarch

How reproducible: sometimes - on gears with long-running, high-cpu processes

Steps to Reproduce:
1. Find a gear that is constantly running at its cgroup-limited CPU threshold (in this case, it was a mysqld_safe process) 
2. issued oo-admin-ctl-gears stopgear <uuid>

Actual results:
oo-admin-ctl-gears stopgear exited with return code of 0, but mysql cartridge continued to run, now uncontrolled by cgroups. 

Expected results:
processes owned by the gear stop before oo-admin-ctl-gears exits, or oo-admin-ctl-gears returns a warning that processes remain, and leaves the errant processes constrained by cgroups.
Comment 1 Thomas Wiest 2013-07-27 09:40:41 EDT
Just to underscore this further.

The problem is that we had a gear that was using all of it's cgroup allotted cpu resrouces. We thought that the gear might possibly just need to be restarted.

The box was overloaded, so we wanted to stop the gear, let the system recover a bit, then restart the gear.

What happened instead was we stopped the gear, the stopgear operation took the cgroups constraints off of the gear, but failed to kill the processes.

This gear that was using as many cpu resources as it could get, was suddenly no longer constrained by cgroups and could use all available cpu resources on the box.

So, in affect, stopgear exacerbated the problem.

Eventually we had to kill off the gear processes manually.

If stopgear fails to kill the gear processes, it should:
1) return a non-zero exit code
2) put the remaining processes back in cgroups
Comment 2 Rob Millner 2013-07-27 13:05:58 EDT
Interesting problem.  We go through several motions, including freezing the cgroup to ensure we can deliver SIGKILL to every process owned by the gear and that the gear can't out run the killer ("Fri 13'th mode").

What processes escaped?
Comment 3 Thomas Wiest 2013-07-27 17:20:18 EDT
It was a mysqld_safe process. The gear looked like it was a mysql db only gear.
Comment 4 Rob Millner 2013-07-30 17:08:31 EDT
Have not been able to reproduce this issue on devenv.  Next time it is observed, please leave the process running and collect all logs related to the gear so that it can be diagnosed.  Thanks!
Comment 5 Rob Millner 2013-11-14 17:12:04 EST
A "forcestopgear" command was added to oo-admin-ctl-gears to stop gear processes that are both stuck and have escaped cgroups.  A Trello card was filed to make it harder to escape cgroups.

I believe there's nothing left for this ticket.

Note You need to log in before you can comment on or make changes to this bug.