Description of problem: issued oo-admin-ctl-gears stopgear <uuid>. Process exited with 0 status, but processes belonging to the gear remained Version-Release number of selected component (if applicable): openshift-origin-node-util-1.11.8-1.el6oso.noarch How reproducible: sometimes - on gears with long-running, high-cpu processes Steps to Reproduce: 1. Find a gear that is constantly running at its cgroup-limited CPU threshold (in this case, it was a mysqld_safe process) 2. issued oo-admin-ctl-gears stopgear <uuid> Actual results: oo-admin-ctl-gears stopgear exited with return code of 0, but mysql cartridge continued to run, now uncontrolled by cgroups. Expected results: processes owned by the gear stop before oo-admin-ctl-gears exits, or oo-admin-ctl-gears returns a warning that processes remain, and leaves the errant processes constrained by cgroups.
Just to underscore this further. The problem is that we had a gear that was using all of it's cgroup allotted cpu resrouces. We thought that the gear might possibly just need to be restarted. The box was overloaded, so we wanted to stop the gear, let the system recover a bit, then restart the gear. What happened instead was we stopped the gear, the stopgear operation took the cgroups constraints off of the gear, but failed to kill the processes. This gear that was using as many cpu resources as it could get, was suddenly no longer constrained by cgroups and could use all available cpu resources on the box. So, in affect, stopgear exacerbated the problem. Eventually we had to kill off the gear processes manually. If stopgear fails to kill the gear processes, it should: 1) return a non-zero exit code 2) put the remaining processes back in cgroups
Interesting problem. We go through several motions, including freezing the cgroup to ensure we can deliver SIGKILL to every process owned by the gear and that the gear can't out run the killer ("Fri 13'th mode"). What processes escaped?
It was a mysqld_safe process. The gear looked like it was a mysql db only gear.
Have not been able to reproduce this issue on devenv. Next time it is observed, please leave the process running and collect all logs related to the gear so that it can be diagnosed. Thanks!
A "forcestopgear" command was added to oo-admin-ctl-gears to stop gear processes that are both stuck and have escaped cgroups. A Trello card was filed to make it harder to escape cgroups. I believe there's nothing left for this ticket.