Bug 1061926

Summary: Gears idle but not in idlerdb
Product: OpenShift Online Reporter: Andy Grimm <agrimm>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: bmeng, corinthianmonthly, jgoulding, jhonce, lzhang
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-15 15:28:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Error Log When Trying To Restart An Idled Application none

Description Andy Grimm 2014-02-05 21:38:38 UTC
Description of problem:

We had a situation today where two gears on a node had "idle" in their state file, but they also had a .stop_lock file, and they were not in the idlerdb.  The net effect of this is that the app will not unidle when accessed.

I worked around the problem by removing the .stop_lock file and running "oo-admin-ctl-gears idle <uuid>", which then added the gear to the idlerdb.

I'm still researching how wide-spread this problem is in our environment.

Version-Release number of selected component (if applicable):

openshift-origin-node-util-1.18.8-1.el6oso.noarch

Comment 1 Jhon Honce 2014-02-07 18:25:05 UTC
Please update with your research

Comment 2 Andy Grimm 2014-02-07 19:46:12 UTC
Research shows that over 99% of our idle apps currently have a .stop_lock file.
Approximately 1135 of them are also not in their node's idler.txt, which means they will not properly unidle.

Comment 3 Lei Zhang 2014-02-10 08:03:12 UTC
Test on devenv_4352, after idle one gear using 'oo-admin-ctl-gears idlegear $gearid' command, it shows 'Idling gear $gearid... [OK]', but when visit the app via browser(http://myjbossews10-chunchen.dev.rhcloud.com/), it shows "The page isn't redirecting properly" and gear still in idle status, when unidle the gear using 'oo-admin-ctl-gears unidlegear $gearid' command, it prompts error message.


[root@ip-10-225-9-204 ~]# oo-admin-ctl-gears idlegear 52f87f331de5448356000258
Idling gear 52f87f331de5448356000258 ... [ OK ]
[root@ip-10-225-9-204 ~]# oo-admin-ctl-gears unidlegear 52f87f331de5448356000258
error: Gear is locked: 52f87f331de5448356000258. Use --trace to view backtrace

Comment 4 Jhon Honce 2014-02-10 18:22:38 UTC
(In reply to Andy Grimm from comment #2)
> Research shows that over 99% of our idle apps currently have a .stop_lock
> file.
> Approximately 1135 of them are also not in their node's idler.txt, which
> means they will not properly unidle.

Idled gears _should_ have a stop_lock and be in idler.txt.

Comment 5 Andy Grimm 2014-02-13 00:59:23 UTC
The problem appears to be that gear moves do not adjust the idler database on the target system.

Comment 6 Andy Grimm 2014-02-13 01:28:42 UTC
Nevermind my previous comment.  The pattern just looked suspicious because I picked a newer node to look at, where the majority of the idle gears not in the idler db happened to have been moved from elsewhere.

Comment 8 corinthianmonthly 2014-03-13 13:45:11 UTC
Created attachment 874003 [details]
Error Log When Trying To Restart An Idled Application

Comment 9 Jhon Honce 2014-03-18 23:25:00 UTC
Fix in https://github.com/openshift/origin-server/pull/4996

Comment 10 openshift-github-bot 2014-03-19 00:53:56 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/7e08d4d4fc40ce39566b870a06f5eb22c5d1cf3f
Bug 1061926 - Ensure frontend unidled if backend unidled

Comment 11 Meng Bo 2014-03-20 12:11:37 UTC
Checked on devenv_4540, after gear auto idled by auto-idler, the gear has .stop_lock under app-root/runtime/ and gear info can be found in idler.txt and idler.db.

After access the app dns, the gear info was cleared from idler.txt and idler.db, .stop_lock also be removed form app-root/runtime/.

Move bug verified.

Comment 12 Andy Grimm 2014-03-20 13:25:54 UTC
I apologize that my mention of the stop_lock file may have confused the issue here somewhat, but the real problem that we need to address here is not the stop_lock file.  It is that, on idle, a gear sometimes does not end up in the idlerdb.  If they are not in the idlerdb, then the openshift-idler rewrite map will not redirect to the restorer script.

Comment 13 Jhon Honce 2014-03-20 16:45:38 UTC
Andy, the code change commit addresses your issue and has nothing to do with stop lock.  If the issue resurfaces after this code is deployed, please reopen.

Comment 14 Andy Grimm 2014-04-14 18:58:09 UTC
Reopening, as we are back to having 91 gears in this state in OpenShift Online.

Comment 16 Andy Grimm 2014-04-14 19:27:52 UTC
I suspect that the most reliable way to reproduce this is to create an app with the following properties:

1) the app pings itself via the frontend apache server every second
2) the app's control script sleeps a few seconds in its "stop" function before it actually shuts down the web server process

Comment 17 Jhon Honce 2014-04-15 19:08:31 UTC
Lock file fix introduced in https://github.com/openshift/origin-server/pull/5274

Comment 18 openshift-github-bot 2014-04-16 18:26:00 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/3b94f0e7085940f283a86a0eb0683947a934a9cc
Bug 1061926 - Use lock file to prevent race between idle/unidle

Comment 19 Meng Bo 2014-04-17 11:44:34 UTC
Checked on devenv_4692, the gear.{uuid} lock was introduced for the gear idle/unidle.

Do parallel idle/unidle for 4 gears, the gear.{uuid} lock was generated per gear, and all the gear info can be found in idler.txt and can be removed after unidle.

Move bug to verified.


# ls /var/lock/
gear.534fe0e7fd382ff534000085
gear.534fe10bfd382ff534000099
gear.534fe128fd382ff5340000ad
gear.534fe14cfd382ff5340000c1