+++ This bug was initially created as a clone of Bug #1160494 +++ Description of problem: One of the primary cases for running "forcestop" on a gear is to deal with situations where "runuser" fails or where the unprivileged user cannot kill processes for some reason. The fallback is to pkill processes from _outside_ the gear. However, when the current version of the code fails to stop a cartridge, it raises a ShellExecutionException before the "pkill" is run. The result is that if a gear hits a ulimit (specifically nproc), an administrator must log in and clean up. Version-Release number of selected component (if applicable): rubygem-openshift-origin-node-1.31.9-1.el6oso.noarch How reproducible: Always Steps to Reproduce: 1. Run a gear with 250 process threads 2. Attempt to stop, restart, or forcestop the gear Actual results: All attempts will fail. Expected results: forcestop should kill all gear processes, and allow a subsequent start/restart to work successfully. --- Additional comment from Jhon Honce on 2014-11-04 20:47:33 EST --- oo-admin-ctl-gears uses ApplicationContainer#stop_gear() not ApplicationContainer#force_stop() for better control of gear state. This method does not handle the ShellExecutionException properly. --- Additional comment from Jhon Honce on 2014-11-07 17:16:09 EST --- Fixed in https://github.com/openshift/origin-server/pull/5937 --- Additional comment from openshift-github-bot on 2014-11-07 18:08:47 EST --- Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/26e43acb8358e1d415abe185b968910cfd54c651 Bug 1160494 - Protect Ops stop_gear from cartridge errors --- Additional comment from Liu Ruikai on 2014-11-09 21:59:43 EST --- force-stop now works on devenv_5288 --- Additional comment from Liu Ruikai on 2014-11-09 22:25:55 EST --- Verified as follows: 0. Check the max processes limit: [app0-ruliu0.dev.rhcloud.com 54606c4629133999e5000012]\> ulimit -u 250 1. Run as many as processes as possible: [app0-ruliu0.dev.rhcloud.com 54606c4629133999e5000012]\> cat /tmp/1.sh #!/bin/bash for i in `seq 0 249`; do sleep 3600 & done [app0-ruliu0.dev.rhcloud.com 54606c4629133999e5000012]\> /tmp/1.sh 2. Force stop the gear: [root@ip-10-231-32-4 ~]# oo-admin-ctl-gears forcestopgear 54606c4629133999e5000012 Then the gear is stopped and all sleep processes killed. 3. Start and restart the gear: [root@ip-10-231-32-4 ~]# oo-admin-ctl-gears startgear 54606c4629133999e5000012 [root@ip-10-231-32-4 ~]# oo-admin-ctl-gears restartgear 54606c4629133999e5000012 Succeed and the gear is now started.
Verfied and pass on puddle-2-2-2014-11-24 1) add fork script to run out of nproc. 2) rhc app stop php failed with with suggestion message. [anli@broker ~]$ rhc app stop php Resources unavailable for operation. You may need to run 'rhc force-stop-app -a php' and retry. Failed to execute: 'control stop' for /var/lib/openshift/547439dde5fed5d73e00009c/php 3) "rhc app force-stop php" can stop the app without error. [anli@broker ~]$ rhc app force-stop php RESULT: php force stopped 4) The ssh session started before step 1) is as below: [anli@broker ~]$ rhc ssh php Connecting to 547439dde5fed5d73e00009c.com.cn ... bash: fork: retry: Resource temporarily unavailable bash: fork: retry: Resource temporarily unavailable bash: fork: retry: Resource temporarily unavailable bash: fork: retry: Resource temporarily unavailable Connection to php-anlidom.ose22-manual.com.cn closed by remote host. Connection to php-anlidom.ose22-manual.com.cn closed. 5)after force-stop, the app can be started,can be ssh, can create files and can be access. [anli@broker ~]$ rhc app start php RESULT: php started [anli@broker ~]$ rhc ssh php [php-anlidom.ose22-manual.com.cn 547439dde5fed5d73e00009c]\> cd /tmp/ [php-anlidom.ose22-manual.com.cn tmp]\> touch abc [php-anlidom.ose22-manual.com.cn tmp]\> ls -1 abc
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2014-1979.html