Bug 974268

Summary: oo-accept-node fails when gears are created / deleted while it's running
Product: OpenShift Online Reporter: Thomas Wiest <twiest>
Component: ContainersAssignee: Rob Millner <rmillner>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: bmeng, mfisher, xtian, yadu
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-24 14:54:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Thomas Wiest 2013-06-13 19:39:44 UTC
Description of problem:
If a gear is created or deleted while oo-accept-node is running, oo-accept-node will error on parts of the new gear that are missing.

On ex-nodes that are either really slow, or have a lot of gear creates / deletes happening, this causes many transient errors.

Since we're monitoring oo-accept-node, this causes a lot of false alerts for us.

We need oo-accept-node to _only_ flag real problems, not transient changes.

This is especially bad on ex-nodes with thousands of gears, where oo-accept-node can take over 10 minutes to run.


Version-Release number of selected component (if applicable):
openshift-origin-node-util-1.9.11-1.el6oso.noarch


How reproducible:
Very on either a really slow box where gear creates are happening, or a box with a lot of 


Steps to Reproduce:
1. Load up a box with a lot of gears (like 4000+)
2. While running oo-accept-node, create or delete a gear on the system
3. Notice that this is flagged by oo-accept-node.


Actual results:
oo-accept-node flags gears that are either being created or deleted as errors.


Expected results:
oo-accept-node needs to only flag real problems.

Comment 1 Rob Millner 2013-06-14 23:13:32 UTC
Tested on my C9 node that was creating 4000 gears, 5 at a time.

https://github.com/openshift/origin-server/pull/2858

Comment 2 openshift-github-bot 2013-06-15 01:48:42 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/f2e95067fba4b8c55120043f1318d5d9250769c3
Bug 974268 - Squash error messages for gears which have been created or destroyed while the accept-node script is run.

Comment 3 Meng Bo 2013-06-17 10:22:16 UTC
Tested on devenv_3368 with following method


[root@ip-10-60-129-152 ~]# for i in `seq 1 100` ;do oo-app-create --with-app-uuid 123123$i --with-container-uuid 123123$i --with-namespace dom1  --with-app-name app$i & done

During the oo-app-create running.

Use oo-accept-node to check the transient issues.

[root@ip-10-60-129-152 ~]# oo-accept-node 
FAIL: user 12312351 does not have quotas imposed
FAIL: user 12312381 does not have quotas imposed
2 ERRORS
[root@ip-10-60-129-152 ~]# oo-accept-node 
PASS

It will report the gear issue in the 1st time run, and PASS in the following try.

Assign the bug back.

Comment 4 Rob Millner 2013-06-17 21:33:16 UTC
Narrowed down the set of places where the user list and quotas can get out of sync.  Also now using the lock file from unix_user.rb as another way to determine if a gear create/delete ran.

Used the above script, and its mirror image with oo-app-destroy, in a loop 10 times.  The oo-accept-node script running in a loop no longer fails with the following pull request.

https://github.com/openshift/origin-server/pull/2867

Comment 5 openshift-github-bot 2013-06-18 02:18:24 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/3b2d0950c41dc82436a89224f57b16773e042e80
Bug 974268 - Narrow the window where user and quota data can get out of sync and set the start time prior to any other collection.  Deal with a race condition with the lock files in unix_user.

Comment 6 Meng Bo 2013-06-18 07:08:31 UTC
Checked on devenv_3375,

oo-accept-node will not report error for both oo-app-create and oo-app-destroy with multiple operations parallel run.


Move bug to verified.