Bug 1006557

Summary: stop_lock gears have status of "started"
Product: OpenShift Online Reporter: Matt Woodson <mwoodson>
Component: ImageAssignee: Paul Morie <pmorie>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: low Docs Contact:
Priority: medium    
Version: 2.xCC: bmeng, chunchen, dmcphers, mwoodson, pmorie, twiest, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1207486 (view as bug list) Environment:
Last Closed: 2014-01-30 00:48:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1207486    

Description Matt Woodson 2013-09-10 19:54:26 UTC
Description of problem:

There are many gears we have found that have a .stop_lock file but their .state file shows started. Also, they don't have any running processes.

# ll /var/lib/openshift/4cd359f0852e4146ba8a6600de5f213b/app-root/runtime/.stop_lock
-rw-r--r--. 1 4cd359f0852e4146ba8a6600de5f213b 4cd359f0852e4146ba8a6600de5f213b 0 Jun  5 16:49 /var/lib/openshift/4cd359f0852e4146ba8a6600de5f213b/app-root/runtime/.stop_lock

# cat /var/lib/openshift/4cd359f0852e4146ba8a6600de5f213b/app-root/runtime/.state 
started

# ps -u 4cd359f0852e4146ba8a6600de5f213b
  PID TTY          TIME CMD
# 

This is causing alerts to go off because we show a high number of gears in the "started" state, but without any processes.


NOTE: As part of the fix for this bug, please add a check for this to oo-accept-node so that bugs like this are found with the OpenShift unit tests.



Version-Release number of selected component (if applicable):
rhc-node-1.13.6-1.el6oso.x86_64


How reproducible:
unknown, found in PROD


Steps to Reproduce:
1. unknown, found in PROD


Actual results:
Gears with a .stop_lock file and "started" in the .state file.


Expected results:
Gears with .stop_lock file should always be in a "stopped" state.

Comment 1 Jhon Honce 2013-09-10 20:47:17 UTC
Any change of state, ie starting the application will reset the values correctly.

Are these V1 gears upgraded to V2?

Comment 2 Matt Woodson 2013-09-10 21:03:09 UTC
(In reply to Jhon Honce from comment #1)
> Any change of state, ie starting the application will reset the values
> correctly.
> 

The .state file is update when stop/starting one of these gears.



> Are these V1 gears upgraded to V2?

On the host I looked at, all of the gears in this state have been migrated from V1 to V2.  To tell that it was migrated to V2 I checked for the existence of .env/CARTRIDGE_VERSION_2

Comment 3 Jhon Honce 2013-09-10 21:18:46 UTC
This has only been found on V1 created applications.

Resolving issue would require writing a script to transverse all gears on a Node and ensuring that stop_lock and .state are consistent.

Comment 4 Paul Morie 2013-11-05 19:03:21 UTC
I have modified watchman to detect and correct this when it happens.

Comment 5 openshift-github-bot 2013-11-05 21:09:09 UTC
Commit pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/300f8aaa02a1d3631f2c181e633c276a526d9c17
Fix bug 1006557: make watchman check for hanging stop_lock files

Comment 7 Meng Bo 2013-11-08 07:04:54 UTC
Checked on devenv_4003,

1. Touch .stop_lock for a started gear,
[root@ip-10-239-15-120 runtime]# ls -al
total 28
drwxr-x---. 5 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov  8 01:59 .
drwxr-xr-x. 4 root                     527c7e73aef9e993f300000c 4096 Nov  8 01:02 ..
drwxr-x---. 2 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov  8 01:02 build-dependencies
lrwxrwxrwx. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c    7 Nov  8 01:02 data -> ../data
drwxr-x---. 3 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov  8 01:02 dependencies
drwxr-x---. 6 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov  8 01:04 repo
-rw-r-----. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c    8 Nov  8 01:28 .state
-rw-r-----. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c    8 Nov  8 01:28 .stop_lock


2. Check the /var/log/messages for the rhc-watchman log
Found:
Nov  8 02:00:10 ip-10-239-15-120 rhc-watchman[2051]: watchman deleted stop lock for user 527c7e73aef9e993f300000c because the state of the gear was STARTED

3. Check the stop_lock in the gear dir again, the stop_lock has been removed automatically.
[root@ip-10-239-15-120 runtime]# ls -al
total 24
drwxr-x---. 5 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov  8 02:00 .
drwxr-xr-x. 4 root                     527c7e73aef9e993f300000c 4096 Nov  8 01:02 ..
drwxr-x---. 2 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov  8 01:02 build-dependencies
lrwxrwxrwx. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c    7 Nov  8 01:02 data -> ../data
drwxr-x---. 3 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov  8 01:02 dependencies
drwxr-x---. 6 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov  8 01:04 repo
-rw-r-----. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c    8 Nov  8 01:28 .state


For stopped gear, it will not be impacted.

Move bug to verified.