Bug 1162192

Summary: Unhandled ShellExecutionException prevents oo-admin-ctl-gears forcestopgear from pkill'ing gear processes
Product: OpenShift Container Platform Reporter: Brenton Leanhardt <bleanhar>
Component: ContainersAssignee: Brenton Leanhardt <bleanhar>
Status: CLOSED ERRATA QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.2.0CC: adellape, agrimm, anli, jhonce, jokerman, libra-bugs, libra-onpremise-devel, mmccomas, pruan, ruliu
Target Milestone: ---Keywords: NeedsTestCase, Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rubygem-openshift-origin-node-1.32.3.1-1 Doc Type: Bug Fix
Doc Text:
A bug in the oo-admin-ctl-gears tool prevented the forcestopgear command from killing all gear processes. As a result, the forcestopgear command could leave processes running. This bug fix updates the oo-admin-ctl-gears to ensure all processes are killed successfully.
Story Points: ---
Clone Of: 1160494 Environment:
Last Closed: 2014-12-10 13:25:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1160494    
Bug Blocks:    

Description Brenton Leanhardt 2014-11-10 13:25:41 UTC
+++ This bug was initially created as a clone of Bug #1160494 +++

Description of problem:

One of the primary cases for running "forcestop" on a gear is to deal with situations where "runuser" fails or where the unprivileged user cannot kill processes for some reason.  The fallback is to pkill processes from _outside_ the gear.  However, when the current version of the code fails to stop a cartridge, it raises a ShellExecutionException before the "pkill" is run.

The result is that if a gear hits a ulimit (specifically nproc), an administrator must log in and clean up.

Version-Release number of selected component (if applicable):

rubygem-openshift-origin-node-1.31.9-1.el6oso.noarch

How reproducible:

Always

Steps to Reproduce:
1. Run a gear with 250 process threads
2. Attempt to stop, restart, or forcestop the gear

Actual results:

All attempts will fail.

Expected results:

forcestop should kill all gear processes, and allow a subsequent start/restart to work successfully.

--- Additional comment from Jhon Honce on 2014-11-04 20:47:33 EST ---

oo-admin-ctl-gears uses ApplicationContainer#stop_gear() not ApplicationContainer#force_stop() for better control of gear state. This method does not handle the ShellExecutionException properly.

--- Additional comment from Jhon Honce on 2014-11-07 17:16:09 EST ---

Fixed in https://github.com/openshift/origin-server/pull/5937

--- Additional comment from openshift-github-bot on 2014-11-07 18:08:47 EST ---

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/26e43acb8358e1d415abe185b968910cfd54c651
Bug 1160494 - Protect Ops stop_gear from cartridge errors

--- Additional comment from Liu Ruikai on 2014-11-09 21:59:43 EST ---

force-stop now works on devenv_5288

--- Additional comment from Liu Ruikai on 2014-11-09 22:25:55 EST ---

Verified as follows:

0. Check the max processes limit:
[app0-ruliu0.dev.rhcloud.com 54606c4629133999e5000012]\> ulimit -u
250

1. Run as many as processes as possible:
[app0-ruliu0.dev.rhcloud.com 54606c4629133999e5000012]\> cat /tmp/1.sh
#!/bin/bash

for i in `seq 0 249`; do
    sleep 3600 &
done
[app0-ruliu0.dev.rhcloud.com 54606c4629133999e5000012]\> /tmp/1.sh

2. Force stop the gear:
[root@ip-10-231-32-4 ~]# oo-admin-ctl-gears forcestopgear 54606c4629133999e5000012
Then the gear is stopped and all sleep processes killed.

3. Start and restart the gear:
[root@ip-10-231-32-4 ~]# oo-admin-ctl-gears startgear 54606c4629133999e5000012
[root@ip-10-231-32-4 ~]# oo-admin-ctl-gears restartgear 54606c4629133999e5000012
Succeed and the gear is now started.

Comment 3 Anping Li 2014-11-25 08:27:35 UTC
Verfied and pass on puddle-2-2-2014-11-24

1) add fork script to run out of nproc.
2) rhc app stop php failed with with suggestion message.
[anli@broker ~]$ rhc app stop php
Resources unavailable for operation. You may need to run 'rhc force-stop-app -a php' and retry.
Failed to execute: 'control stop' for /var/lib/openshift/547439dde5fed5d73e00009c/php
3) "rhc app force-stop php" can stop the app without error.
[anli@broker ~]$ rhc app force-stop php
RESULT:
php force stopped

4) The ssh session started before step 1) is as below:
[anli@broker ~]$ rhc ssh php
Connecting to 547439dde5fed5d73e00009c.com.cn ...
bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable
Connection to php-anlidom.ose22-manual.com.cn closed by remote host.
Connection to php-anlidom.ose22-manual.com.cn closed.



5)after force-stop, the app can be started,can be ssh, can create files and can be access.
[anli@broker ~]$ rhc app start php
RESULT:
php started

[anli@broker ~]$ rhc ssh php
[php-anlidom.ose22-manual.com.cn 547439dde5fed5d73e00009c]\> cd /tmp/
[php-anlidom.ose22-manual.com.cn tmp]\> touch abc
[php-anlidom.ose22-manual.com.cn tmp]\> ls -1
abc

Comment 5 errata-xmlrpc 2014-12-10 13:25:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2014-1979.html