Hide Forgot
Description of problem: There are many gears we have found that have a .stop_lock file but their .state file shows started. Also, they don't have any running processes. # ll /var/lib/openshift/4cd359f0852e4146ba8a6600de5f213b/app-root/runtime/.stop_lock -rw-r--r--. 1 4cd359f0852e4146ba8a6600de5f213b 4cd359f0852e4146ba8a6600de5f213b 0 Jun 5 16:49 /var/lib/openshift/4cd359f0852e4146ba8a6600de5f213b/app-root/runtime/.stop_lock # cat /var/lib/openshift/4cd359f0852e4146ba8a6600de5f213b/app-root/runtime/.state started # ps -u 4cd359f0852e4146ba8a6600de5f213b PID TTY TIME CMD # This is causing alerts to go off because we show a high number of gears in the "started" state, but without any processes. NOTE: As part of the fix for this bug, please add a check for this to oo-accept-node so that bugs like this are found with the OpenShift unit tests. Version-Release number of selected component (if applicable): rhc-node-1.13.6-1.el6oso.x86_64 How reproducible: unknown, found in PROD Steps to Reproduce: 1. unknown, found in PROD Actual results: Gears with a .stop_lock file and "started" in the .state file. Expected results: Gears with .stop_lock file should always be in a "stopped" state.
Any change of state, ie starting the application will reset the values correctly. Are these V1 gears upgraded to V2?
(In reply to Jhon Honce from comment #1) > Any change of state, ie starting the application will reset the values > correctly. > The .state file is update when stop/starting one of these gears. > Are these V1 gears upgraded to V2? On the host I looked at, all of the gears in this state have been migrated from V1 to V2. To tell that it was migrated to V2 I checked for the existence of .env/CARTRIDGE_VERSION_2
This has only been found on V1 created applications. Resolving issue would require writing a script to transverse all gears on a Node and ensuring that stop_lock and .state are consistent.
I have modified watchman to detect and correct this when it happens.
Commit pushed to master at https://github.com/openshift/li https://github.com/openshift/li/commit/300f8aaa02a1d3631f2c181e633c276a526d9c17 Fix bug 1006557: make watchman check for hanging stop_lock files
Checked on devenv_4003, 1. Touch .stop_lock for a started gear, [root@ip-10-239-15-120 runtime]# ls -al total 28 drwxr-x---. 5 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov 8 01:59 . drwxr-xr-x. 4 root 527c7e73aef9e993f300000c 4096 Nov 8 01:02 .. drwxr-x---. 2 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov 8 01:02 build-dependencies lrwxrwxrwx. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 7 Nov 8 01:02 data -> ../data drwxr-x---. 3 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov 8 01:02 dependencies drwxr-x---. 6 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov 8 01:04 repo -rw-r-----. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 8 Nov 8 01:28 .state -rw-r-----. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 8 Nov 8 01:28 .stop_lock 2. Check the /var/log/messages for the rhc-watchman log Found: Nov 8 02:00:10 ip-10-239-15-120 rhc-watchman[2051]: watchman deleted stop lock for user 527c7e73aef9e993f300000c because the state of the gear was STARTED 3. Check the stop_lock in the gear dir again, the stop_lock has been removed automatically. [root@ip-10-239-15-120 runtime]# ls -al total 24 drwxr-x---. 5 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov 8 02:00 . drwxr-xr-x. 4 root 527c7e73aef9e993f300000c 4096 Nov 8 01:02 .. drwxr-x---. 2 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov 8 01:02 build-dependencies lrwxrwxrwx. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 7 Nov 8 01:02 data -> ../data drwxr-x---. 3 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov 8 01:02 dependencies drwxr-x---. 6 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 4096 Nov 8 01:04 repo -rw-r-----. 1 527c7e73aef9e993f300000c 527c7e73aef9e993f300000c 8 Nov 8 01:28 .state For stopped gear, it will not be impacted. Move bug to verified.