Bug 1202512

Summary: RFE: add a "stop-and-lock" operation to oo-admin-ctl-gears
Product: OpenShift Container Platform Reporter: Brenton Leanhardt <bleanhar>
Component: ContainersAssignee: Brenton Leanhardt <bleanhar>
Status: CLOSED ERRATA QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 2.2.0CC: adellape, agrimm, anli, bmeng, jhonce, jokerman, libra-bugs, libra-onpremise-devel, mmccomas, pruan
Target Milestone: ---Keywords: NeedsTestCase
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-origin-node-util-1.34.4.1-1.el6op Doc Type: Enhancement
Doc Text:
Previously, gears stopped by administrators using the "oo-admin-ctl-gears stopgear" command would always be restarted by a subsequent "oo-admin-ctl-gears startall" command or the next time the node was rebooted. In certain situations, this could be undesirable depending on the node capacity. This enhancement adds an additional command, "oo-admin-ctl-gears stoplockgear", which allows administrators to stop a gear and add a .stop_lock file. The presence of a .stop_lock file ensures that the gear does not start during operations that take .stop_lock files into account, such as "oo-admin-ctl-gears startall", or after a node reboot. A message explaining the reason why the gear should not be started in the future can written to the .stop_lock file using the "--message" option with the command as well.
Story Points: ---
Clone Of: 1190856 Environment:
Last Closed: 2015-04-06 17:06:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1190856    
Bug Blocks:    

Description Brenton Leanhardt 2015-03-16 19:00:51 UTC
+++ This bug was initially created as a clone of Bug #1190856 +++

Description of problem:

In some cases, it is desirable to have a "stopgear" operation on nodes which creates a .stop_lock file.  I would even argue that outside of watchman (and maybe upgrades) this is the more common desired behavior.

The main problem with the current behavior is this:

Suppose you have a node running at full capacity.  You use "oo-admin-ctl-gears stopgear" to stop 10 of them for various reasons -- they are misbehaving due to bad code, because they hit quota, or maybe they are just secondary gears with idle frontends.  After you do that, 10 more active gears get placed on the node.  The next time the node is rebooted, the gears you administratively stopped start again, and aside from the fact that you have to rediscover these problem gears, you are also now 10 gears over capacity.  In the worst case, you get into a state where the node is so far over capacity that the attempt to start all of the gears which are not stop-locked causes a crash, and the node reboots repeatedly until someone manually intervenes.


Version-Release number of selected component (if applicable):
openshift-origin-node-util-1.33.4-1.el6oso.noarch

--- Additional comment from Jhon Honce on 2015-02-11 12:00:04 EST ---

.stop_lock is the flag for when a gear is stopped by the user/developer vs. an administrative stop. It is currently the only way a Node knows the difference.

--- Additional comment from Andy Grimm on 2015-02-11 13:33:07 EST ---

So far the only thing I'm aware of stop_lock being used for is to tell "startall" whether or not to start the gear when the node boots, and that is precisely the thing I am trying to affect here.  If stop_lock has other uses, then we need to add another flag and make startall key off the other flag (or a combination of the two flags).

--- Additional comment from Andy Grimm on 2015-02-13 10:28:57 EST ---

In reference to our IRC conversation, I'm fine with adding content to the file to differentiate between administrative stops (potentially with a reason given for the stop) and user-initiated stops.

--- Additional comment from Jhon Honce on 2015-02-23 17:24:40 EST ---

Fixed in https://github.com/openshift/origin-server/pull/6083

--- Additional comment from openshift-github-bot on 2015-02-23 18:07:45 EST ---

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/e42f7414fa7426397683111ec5880fd520f251ab
Bug 1190856 - Allow Operator to stop gear with .stop_lock

--- Additional comment from Meng Bo on 2015-02-25 02:23:57 EST ---

Checked on devenv_5449, the .stop_lock will be generated with given message as content.

[root@ip-10-136-82-250 runtime]# cat .stop_lock 
TEST_MESSAGE

The gear can be stopped by the option successfully. And can be started by user.

Move bug to verified.

Comment 3 Anping Li 2015-03-17 09:25:16 UTC
Verified and pass on puddle-2-2-2015-03-16

1) A new option "stoplockgear" was added .stop_lock and the message was in .stop_lock
2) oo-admin-ctl-gears  stoplockgear will create 
 oo-admin-ctl-gears  stoplockgear 5507c11f4add7185960000b2 --message " Create Lock File"

[root@node2 runtime]# cat /var/lib/openshift/5507c11f4add7185960000b2/app-root/runtime/.stop_lock 
 Create Lock File[root@node2 runtime]#

Comment 5 errata-xmlrpc 2015-04-06 17:06:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0779.html