Description of problem: If a gear is created or deleted while oo-accept-node is running, oo-accept-node will error on parts of the new gear that are missing. On ex-nodes that are either really slow, or have a lot of gear creates / deletes happening, this causes many transient errors. Since we're monitoring oo-accept-node, this causes a lot of false alerts for us. We need oo-accept-node to _only_ flag real problems, not transient changes. This is especially bad on ex-nodes with thousands of gears, where oo-accept-node can take over 10 minutes to run. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.9.11-1.el6oso.noarch How reproducible: Very on either a really slow box where gear creates are happening, or a box with a lot of Steps to Reproduce: 1. Load up a box with a lot of gears (like 4000+) 2. While running oo-accept-node, create or delete a gear on the system 3. Notice that this is flagged by oo-accept-node. Actual results: oo-accept-node flags gears that are either being created or deleted as errors. Expected results: oo-accept-node needs to only flag real problems.
Tested on my C9 node that was creating 4000 gears, 5 at a time. https://github.com/openshift/origin-server/pull/2858
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/f2e95067fba4b8c55120043f1318d5d9250769c3 Bug 974268 - Squash error messages for gears which have been created or destroyed while the accept-node script is run.
Tested on devenv_3368 with following method [root@ip-10-60-129-152 ~]# for i in `seq 1 100` ;do oo-app-create --with-app-uuid 123123$i --with-container-uuid 123123$i --with-namespace dom1 --with-app-name app$i & done During the oo-app-create running. Use oo-accept-node to check the transient issues. [root@ip-10-60-129-152 ~]# oo-accept-node FAIL: user 12312351 does not have quotas imposed FAIL: user 12312381 does not have quotas imposed 2 ERRORS [root@ip-10-60-129-152 ~]# oo-accept-node PASS It will report the gear issue in the 1st time run, and PASS in the following try. Assign the bug back.
Narrowed down the set of places where the user list and quotas can get out of sync. Also now using the lock file from unix_user.rb as another way to determine if a gear create/delete ran. Used the above script, and its mirror image with oo-app-destroy, in a loop 10 times. The oo-accept-node script running in a loop no longer fails with the following pull request. https://github.com/openshift/origin-server/pull/2867
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/3b2d0950c41dc82436a89224f57b16773e042e80 Bug 974268 - Narrow the window where user and quota data can get out of sync and set the start time prior to any other collection. Deal with a race condition with the lock files in unix_user.
Checked on devenv_3375, oo-accept-node will not report error for both oo-app-create and oo-app-destroy with multiple operations parallel run. Move bug to verified.