974268 – oo-accept-node fails when gears are created / deleted while it's running

Bug 974268 - oo-accept-node fails when gears are created / deleted while it's running

Summary: oo-accept-node fails when gears are created / deleted while it's running

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Rob Millner
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-13 19:39 UTC by Thomas Wiest
Modified:	2015-05-14 23:21 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-06-24 14:54:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Thomas Wiest 2013-06-13 19:39:44 UTC

Description of problem:
If a gear is created or deleted while oo-accept-node is running, oo-accept-node will error on parts of the new gear that are missing.

On ex-nodes that are either really slow, or have a lot of gear creates / deletes happening, this causes many transient errors.

Since we're monitoring oo-accept-node, this causes a lot of false alerts for us.

We need oo-accept-node to _only_ flag real problems, not transient changes.

This is especially bad on ex-nodes with thousands of gears, where oo-accept-node can take over 10 minutes to run.


Version-Release number of selected component (if applicable):
openshift-origin-node-util-1.9.11-1.el6oso.noarch


How reproducible:
Very on either a really slow box where gear creates are happening, or a box with a lot of 


Steps to Reproduce:
1. Load up a box with a lot of gears (like 4000+)
2. While running oo-accept-node, create or delete a gear on the system
3. Notice that this is flagged by oo-accept-node.


Actual results:
oo-accept-node flags gears that are either being created or deleted as errors.


Expected results:
oo-accept-node needs to only flag real problems.

Comment 1 Rob Millner 2013-06-14 23:13:32 UTC

Tested on my C9 node that was creating 4000 gears, 5 at a time.

https://github.com/openshift/origin-server/pull/2858

Comment 2 openshift-github-bot 2013-06-15 01:48:42 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/f2e95067fba4b8c55120043f1318d5d9250769c3
Bug 974268 - Squash error messages for gears which have been created or destroyed while the accept-node script is run.

Comment 3 Meng Bo 2013-06-17 10:22:16 UTC

Tested on devenv_3368 with following method


[root@ip-10-60-129-152 ~]# for i in `seq 1 100` ;do oo-app-create --with-app-uuid 123123$i --with-container-uuid 123123$i --with-namespace dom1  --with-app-name app$i & done

During the oo-app-create running.

Use oo-accept-node to check the transient issues.

[root@ip-10-60-129-152 ~]# oo-accept-node 
FAIL: user 12312351 does not have quotas imposed
FAIL: user 12312381 does not have quotas imposed
2 ERRORS
[root@ip-10-60-129-152 ~]# oo-accept-node 
PASS

It will report the gear issue in the 1st time run, and PASS in the following try.

Assign the bug back.

Comment 4 Rob Millner 2013-06-17 21:33:16 UTC

Narrowed down the set of places where the user list and quotas can get out of sync.  Also now using the lock file from unix_user.rb as another way to determine if a gear create/delete ran.

Used the above script, and its mirror image with oo-app-destroy, in a loop 10 times.  The oo-accept-node script running in a loop no longer fails with the following pull request.

https://github.com/openshift/origin-server/pull/2867

Comment 5 openshift-github-bot 2013-06-18 02:18:24 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/3b2d0950c41dc82436a89224f57b16773e042e80
Bug 974268 - Narrow the window where user and quota data can get out of sync and set the start time prior to any other collection.  Deal with a race condition with the lock files in unix_user.

Comment 6 Meng Bo 2013-06-18 07:08:31 UTC

Checked on devenv_3375,

oo-accept-node will not report error for both oo-app-create and oo-app-destroy with multiple operations parallel run.


Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.