Bug 974268 - oo-accept-node fails when gears are created / deleted while it's running
Summary: oo-accept-node fails when gears are created / deleted while it's running
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Rob Millner
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-06-13 19:39 UTC by Thomas Wiest
Modified: 2015-05-14 23:21 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-06-24 14:54:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Thomas Wiest 2013-06-13 19:39:44 UTC
Description of problem:
If a gear is created or deleted while oo-accept-node is running, oo-accept-node will error on parts of the new gear that are missing.

On ex-nodes that are either really slow, or have a lot of gear creates / deletes happening, this causes many transient errors.

Since we're monitoring oo-accept-node, this causes a lot of false alerts for us.

We need oo-accept-node to _only_ flag real problems, not transient changes.

This is especially bad on ex-nodes with thousands of gears, where oo-accept-node can take over 10 minutes to run.


Version-Release number of selected component (if applicable):
openshift-origin-node-util-1.9.11-1.el6oso.noarch


How reproducible:
Very on either a really slow box where gear creates are happening, or a box with a lot of 


Steps to Reproduce:
1. Load up a box with a lot of gears (like 4000+)
2. While running oo-accept-node, create or delete a gear on the system
3. Notice that this is flagged by oo-accept-node.


Actual results:
oo-accept-node flags gears that are either being created or deleted as errors.


Expected results:
oo-accept-node needs to only flag real problems.

Comment 1 Rob Millner 2013-06-14 23:13:32 UTC
Tested on my C9 node that was creating 4000 gears, 5 at a time.

https://github.com/openshift/origin-server/pull/2858

Comment 2 openshift-github-bot 2013-06-15 01:48:42 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/f2e95067fba4b8c55120043f1318d5d9250769c3
Bug 974268 - Squash error messages for gears which have been created or destroyed while the accept-node script is run.

Comment 3 Meng Bo 2013-06-17 10:22:16 UTC
Tested on devenv_3368 with following method


[root@ip-10-60-129-152 ~]# for i in `seq 1 100` ;do oo-app-create --with-app-uuid 123123$i --with-container-uuid 123123$i --with-namespace dom1  --with-app-name app$i & done

During the oo-app-create running.

Use oo-accept-node to check the transient issues.

[root@ip-10-60-129-152 ~]# oo-accept-node 
FAIL: user 12312351 does not have quotas imposed
FAIL: user 12312381 does not have quotas imposed
2 ERRORS
[root@ip-10-60-129-152 ~]# oo-accept-node 
PASS

It will report the gear issue in the 1st time run, and PASS in the following try.

Assign the bug back.

Comment 4 Rob Millner 2013-06-17 21:33:16 UTC
Narrowed down the set of places where the user list and quotas can get out of sync.  Also now using the lock file from unix_user.rb as another way to determine if a gear create/delete ran.

Used the above script, and its mirror image with oo-app-destroy, in a loop 10 times.  The oo-accept-node script running in a loop no longer fails with the following pull request.

https://github.com/openshift/origin-server/pull/2867

Comment 5 openshift-github-bot 2013-06-18 02:18:24 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/3b2d0950c41dc82436a89224f57b16773e042e80
Bug 974268 - Narrow the window where user and quota data can get out of sync and set the start time prior to any other collection.  Deal with a race condition with the lock files in unix_user.

Comment 6 Meng Bo 2013-06-18 07:08:31 UTC
Checked on devenv_3375,

oo-accept-node will not report error for both oo-app-create and oo-app-destroy with multiple operations parallel run.


Move bug to verified.


Note You need to log in before you can comment on or make changes to this bug.