Bug 974268 - oo-accept-node fails when gears are created / deleted while it's running
oo-accept-node fails when gears are created / deleted while it's running
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Rob Millner
libra bugs
Depends On:
  Show dependency treegraph
Reported: 2013-06-13 15:39 EDT by Thomas Wiest
Modified: 2015-05-14 19:21 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-06-24 10:54:52 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Thomas Wiest 2013-06-13 15:39:44 EDT
Description of problem:
If a gear is created or deleted while oo-accept-node is running, oo-accept-node will error on parts of the new gear that are missing.

On ex-nodes that are either really slow, or have a lot of gear creates / deletes happening, this causes many transient errors.

Since we're monitoring oo-accept-node, this causes a lot of false alerts for us.

We need oo-accept-node to _only_ flag real problems, not transient changes.

This is especially bad on ex-nodes with thousands of gears, where oo-accept-node can take over 10 minutes to run.

Version-Release number of selected component (if applicable):

How reproducible:
Very on either a really slow box where gear creates are happening, or a box with a lot of 

Steps to Reproduce:
1. Load up a box with a lot of gears (like 4000+)
2. While running oo-accept-node, create or delete a gear on the system
3. Notice that this is flagged by oo-accept-node.

Actual results:
oo-accept-node flags gears that are either being created or deleted as errors.

Expected results:
oo-accept-node needs to only flag real problems.
Comment 1 Rob Millner 2013-06-14 19:13:32 EDT
Tested on my C9 node that was creating 4000 gears, 5 at a time.

Comment 2 openshift-github-bot 2013-06-14 21:48:42 EDT
Commit pushed to master at https://github.com/openshift/origin-server

Bug 974268 - Squash error messages for gears which have been created or destroyed while the accept-node script is run.
Comment 3 Meng Bo 2013-06-17 06:22:16 EDT
Tested on devenv_3368 with following method

[root@ip-10-60-129-152 ~]# for i in `seq 1 100` ;do oo-app-create --with-app-uuid 123123$i --with-container-uuid 123123$i --with-namespace dom1  --with-app-name app$i & done

During the oo-app-create running.

Use oo-accept-node to check the transient issues.

[root@ip-10-60-129-152 ~]# oo-accept-node 
FAIL: user 12312351 does not have quotas imposed
FAIL: user 12312381 does not have quotas imposed
[root@ip-10-60-129-152 ~]# oo-accept-node 

It will report the gear issue in the 1st time run, and PASS in the following try.

Assign the bug back.
Comment 4 Rob Millner 2013-06-17 17:33:16 EDT
Narrowed down the set of places where the user list and quotas can get out of sync.  Also now using the lock file from unix_user.rb as another way to determine if a gear create/delete ran.

Used the above script, and its mirror image with oo-app-destroy, in a loop 10 times.  The oo-accept-node script running in a loop no longer fails with the following pull request.

Comment 5 openshift-github-bot 2013-06-17 22:18:24 EDT
Commit pushed to master at https://github.com/openshift/origin-server

Bug 974268 - Narrow the window where user and quota data can get out of sync and set the start time prior to any other collection.  Deal with a race condition with the lock files in unix_user.
Comment 6 Meng Bo 2013-06-18 03:08:31 EDT
Checked on devenv_3375,

oo-accept-node will not report error for both oo-app-create and oo-app-destroy with multiple operations parallel run.

Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.