Bug 878205 - concurrent user actions result in inconsistencies in gear DB
concurrent user actions result in inconsistencies in gear DB
Status: CLOSED ERRATA
Product: Atomic Enterprise Platform and OpenShift Enterprise
Classification: Red Hat
Component: Kubernetes (Show other bugs)
1.1.1
Unspecified Unspecified
medium Severity medium
: ---
: 1.2
Assigned To: Luke Meyer
libra bugs
: Triaged
Depends On: 855307
Blocks:
  Show dependency treegraph
 
Reported: 2012-11-19 15:45 EST by Luke Meyer
Modified: 2013-07-09 15:49 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: MongoDB access was not done in a way that always guaranteed consistency. Consequence: If multiple alterations were performed to a user's application(s) concurrently, it was possible for some of them to get overwritten (thus lost) by others, making MongoDB inconsistent with the reality of the gears on the node. The canonical example was if the same app was scaled up by two separate logins concurrently, one of the gears would not be known to MongoDB. Fix: Distributed locking mechanisms were introduced with the DB schema and model refactor that went into OSE 1.2. Upgrade to OSE 1.2. Result: User actions should be successfully queued for consistentcy.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-07-09 15:49:27 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Luke Meyer 2012-11-19 15:45:54 EST
Description of problem:
Essentially, there are some race conditions currently that can cause changes to MongoDB to be overwritten by concurrent changes to the same user's apps/gears. This may cause gears to exist on node hosts that are unreferenced by MongoDB, or vice versa.

Version-Release number of selected component (if applicable):
OSE 1.0

How reproducible:
In the upstream bug, this was reliably produced by manually triggering multiple  concurrent scale-up events against a scaled app. There are probably other cases of user concurrent actions with similar results.

Additional info:
This sort of problem can be detected by regular monitoring of the "oo-admin-chk" command on the broker. Administrative action will be required to adjust gear usage counts (oo-admin-ctl-user), remove phantom apps from the MongoDB (oo-admin-ctl-app), or remove unreferenced gears from node hosts.
Comment 2 Gaoyun Pei 2013-03-27 03:06:30 EDT
Found one issue belong to this bug against puddle:
http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.1.z/2013-03-21.1/

Description of problem:
After trigger 3 scale-up events at same time, "oo-admin-chk" would report an error about the inconsistency between node and mongodb.

How reproducible:
always

Steps to Reproduce:
1.Create scalable app and disable auto-scaling

2.Trigger 3 scale-up events at same time
for i in `seq 1 3 `; do curl -k -X POST -H 'Accept: application/xml' -d event=scale-up --user gpei@redhat.com:redhat https://broker.rhn.com/broker/rest/domains/1010/applications/app/events &  done

3.Run oo-admin-chk on broker
[root@broker ~]# oo-admin-chk 
Check failed.
FAIL: user gpei@redhat.com has a mismatch in consumed gears (5) and actual gears (4)!
Gear 2c61a7ebb19a4a68a7bd2c8b5454f298 exists on node [node1.rhn.com, uid:1154] but does not exist in mongo database

Actual results:
Some gears exist on node but does not exist in mongodb.
Comment 3 Gaoyun Pei 2013-03-27 03:14:54 EDT
Sometimes, after I trigger 3 or 5 scale-up events at same time, when I checking the gear number of the scalable app via REST api, the result does not match the real number of gears on nodes.

QE would like to make this bug to trace multiple concurrent scale-up issue.
Comment 5 xjia 2013-05-02 21:02:28 EDT
Version:
http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.2/2013-05-02.1

Verify:
Scale up 10 times. 4 fail. 6 success.
When failed, it tells that:
Application is currently busy performing another operation. Please try again in a minute.

Whatever, the data in mongodb is accordance with the actual gear info. 

[root@broker ~]#
[root@broker ~]# rhc apps
php @ http://php-jia1.osev2.com/ (uuid: 518309ae4052a73a05000006)
-----------------------------------------------------------------
  Created: 5:49 PM
  Gears:   6 (defaults to small)
  Git URL: ssh://518309ae4052a73a05000006@php-jia1.osev2.com/~/git/php.git/
  SSH:     518309ae4052a73a05000006@php-jia1.osev2.com

  php-5.3 (PHP 5.3)
  -----------------
    Scaling: x6 (minimum: 1, maximum: available) on small gears

  haproxy-1.4 (OpenShift Web Balancer)
  ------------------------------------
    Gears: Located with php-5.3

You have 1 applications

[root@broker ~]#  oo-admin-chk  -v
Started at: 2013-05-02 17:57:33 -0700
Time to fetch mongo data: 0.01s
Total gears found in mongo: 6
Time to get all gears from nodes: 20.277s
Total gears found on the nodes: 6
Total nodes that responded : 2
Checking application gears and ssh keys on corresponding nodes:
518309ae4052a73a05000006 : String...    OK
51830a0a4052a73a05000028 : String...    OK
51830a404052a73a05000035 : String...    OK
51830a774052a73a05000042 : String...    OK
51830ab14052a73a0500004f : String...    OK
51830aef4052a7f30a000002 : String...    OK
Checking node gears in application database:
51830a0a4052a73a05000028...     OK
51830aef4052a7f30a000002...     OK
51830a404052a73a05000035...     OK
518309ae4052a73a05000006...     OK
51830a774052a73a05000042...     OK
51830ab14052a73a0500004f...     OK
Success
Total time: 20.287s
Finished at: 2013-05-02 17:57:53 -0700
Comment 7 errata-xmlrpc 2013-07-09 15:49:27 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1031.html

Note You need to log in before you can comment on or make changes to this bug.