Bug 1190856

Summary: RFE: add a "stop-and-lock" operation to oo-admin-ctl-gears
Product: OpenShift Online Reporter: Andy Grimm <agrimm>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.xCC: bmeng, jgoulding, jokerman, mmccomas
Target Milestone: ---Keywords: NeedsTestCase
Target Release: 2.x   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1202512 (view as bug list) Environment:
Last Closed: 2015-04-21 18:01:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1202512    

Description Andy Grimm 2015-02-09 19:30:48 UTC
Description of problem:

In some cases, it is desirable to have a "stopgear" operation on nodes which creates a .stop_lock file.  I would even argue that outside of watchman (and maybe upgrades) this is the more common desired behavior.

The main problem with the current behavior is this:

Suppose you have a node running at full capacity.  You use "oo-admin-ctl-gears stopgear" to stop 10 of them for various reasons -- they are misbehaving due to bad code, because they hit quota, or maybe they are just secondary gears with idle frontends.  After you do that, 10 more active gears get placed on the node.  The next time the node is rebooted, the gears you administratively stopped start again, and aside from the fact that you have to rediscover these problem gears, you are also now 10 gears over capacity.  In the worst case, you get into a state where the node is so far over capacity that the attempt to start all of the gears which are not stop-locked causes a crash, and the node reboots repeatedly until someone manually intervenes.


Version-Release number of selected component (if applicable):
openshift-origin-node-util-1.33.4-1.el6oso.noarch

Comment 1 Jhon Honce 2015-02-11 17:00:04 UTC
.stop_lock is the flag for when a gear is stopped by the user/developer vs. an administrative stop. It is currently the only way a Node knows the difference.

Comment 2 Andy Grimm 2015-02-11 18:33:07 UTC
So far the only thing I'm aware of stop_lock being used for is to tell "startall" whether or not to start the gear when the node boots, and that is precisely the thing I am trying to affect here.  If stop_lock has other uses, then we need to add another flag and make startall key off the other flag (or a combination of the two flags).

Comment 3 Andy Grimm 2015-02-13 15:28:57 UTC
In reference to our IRC conversation, I'm fine with adding content to the file to differentiate between administrative stops (potentially with a reason given for the stop) and user-initiated stops.

Comment 4 Jhon Honce 2015-02-23 22:24:40 UTC
Fixed in https://github.com/openshift/origin-server/pull/6083

Comment 5 openshift-github-bot 2015-02-23 23:07:45 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/e42f7414fa7426397683111ec5880fd520f251ab
Bug 1190856 - Allow Operator to stop gear with .stop_lock

Comment 6 Meng Bo 2015-02-25 07:23:57 UTC
Checked on devenv_5449, the .stop_lock will be generated with given message as content.

[root@ip-10-136-82-250 runtime]# cat .stop_lock 
TEST_MESSAGE

The gear can be stopped by the option successfully. And can be started by user.

Move bug to verified.