Bug 974268 - oo-accept-node fails when gears are created / deleted while it's running
oo-accept-node fails when gears are created / deleted while it's running
Status: CLOSED CURRENTRELEASE
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
2.x
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Rob Millner
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-13 15:39 EDT by Thomas Wiest
Modified: 2015-05-14 19:21 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-24 10:54:52 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Thomas Wiest 2013-06-13 15:39:44 EDT
Description of problem:
If a gear is created or deleted while oo-accept-node is running, oo-accept-node will error on parts of the new gear that are missing.

On ex-nodes that are either really slow, or have a lot of gear creates / deletes happening, this causes many transient errors.

Since we're monitoring oo-accept-node, this causes a lot of false alerts for us.

We need oo-accept-node to _only_ flag real problems, not transient changes.

This is especially bad on ex-nodes with thousands of gears, where oo-accept-node can take over 10 minutes to run.


Version-Release number of selected component (if applicable):
openshift-origin-node-util-1.9.11-1.el6oso.noarch


How reproducible:
Very on either a really slow box where gear creates are happening, or a box with a lot of 


Steps to Reproduce:
1. Load up a box with a lot of gears (like 4000+)
2. While running oo-accept-node, create or delete a gear on the system
3. Notice that this is flagged by oo-accept-node.


Actual results:
oo-accept-node flags gears that are either being created or deleted as errors.


Expected results:
oo-accept-node needs to only flag real problems.
Comment 1 Rob Millner 2013-06-14 19:13:32 EDT
Tested on my C9 node that was creating 4000 gears, 5 at a time.

https://github.com/openshift/origin-server/pull/2858
Comment 2 openshift-github-bot 2013-06-14 21:48:42 EDT
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/f2e95067fba4b8c55120043f1318d5d9250769c3
Bug 974268 - Squash error messages for gears which have been created or destroyed while the accept-node script is run.
Comment 3 Meng Bo 2013-06-17 06:22:16 EDT
Tested on devenv_3368 with following method


[root@ip-10-60-129-152 ~]# for i in `seq 1 100` ;do oo-app-create --with-app-uuid 123123$i --with-container-uuid 123123$i --with-namespace dom1  --with-app-name app$i & done

During the oo-app-create running.

Use oo-accept-node to check the transient issues.

[root@ip-10-60-129-152 ~]# oo-accept-node 
FAIL: user 12312351 does not have quotas imposed
FAIL: user 12312381 does not have quotas imposed
2 ERRORS
[root@ip-10-60-129-152 ~]# oo-accept-node 
PASS

It will report the gear issue in the 1st time run, and PASS in the following try.

Assign the bug back.
Comment 4 Rob Millner 2013-06-17 17:33:16 EDT
Narrowed down the set of places where the user list and quotas can get out of sync.  Also now using the lock file from unix_user.rb as another way to determine if a gear create/delete ran.

Used the above script, and its mirror image with oo-app-destroy, in a loop 10 times.  The oo-accept-node script running in a loop no longer fails with the following pull request.

https://github.com/openshift/origin-server/pull/2867
Comment 5 openshift-github-bot 2013-06-17 22:18:24 EDT
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/3b2d0950c41dc82436a89224f57b16773e042e80
Bug 974268 - Narrow the window where user and quota data can get out of sync and set the start time prior to any other collection.  Deal with a race condition with the lock files in unix_user.
Comment 6 Meng Bo 2013-06-18 03:08:31 EDT
Checked on devenv_3375,

oo-accept-node will not report error for both oo-app-create and oo-app-destroy with multiple operations parallel run.


Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.