Bug 878205 - concurrent user actions result in inconsistencies in gear DB
Summary: concurrent user actions result in inconsistencies in gear DB
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 1.1.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Luke Meyer
QA Contact: libra bugs
URL:
Whiteboard:
Depends On: 855307
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-19 20:45 UTC by Luke Meyer
Modified: 2017-03-08 17:35 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: MongoDB access was not done in a way that always guaranteed consistency. Consequence: If multiple alterations were performed to a user's application(s) concurrently, it was possible for some of them to get overwritten (thus lost) by others, making MongoDB inconsistent with the reality of the gears on the node. The canonical example was if the same app was scaled up by two separate logins concurrently, one of the gears would not be known to MongoDB. Fix: Distributed locking mechanisms were introduced with the DB schema and model refactor that went into OSE 1.2. Upgrade to OSE 1.2. Result: User actions should be successfully queued for consistentcy.
Clone Of:
Environment:
Last Closed: 2013-07-09 19:49:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2013:1031 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 1.2 Infrastructure Release Advisory 2013-07-09 23:48:02 UTC

Description Luke Meyer 2012-11-19 20:45:54 UTC
Description of problem:
Essentially, there are some race conditions currently that can cause changes to MongoDB to be overwritten by concurrent changes to the same user's apps/gears. This may cause gears to exist on node hosts that are unreferenced by MongoDB, or vice versa.

Version-Release number of selected component (if applicable):
OSE 1.0

How reproducible:
In the upstream bug, this was reliably produced by manually triggering multiple  concurrent scale-up events against a scaled app. There are probably other cases of user concurrent actions with similar results.

Additional info:
This sort of problem can be detected by regular monitoring of the "oo-admin-chk" command on the broker. Administrative action will be required to adjust gear usage counts (oo-admin-ctl-user), remove phantom apps from the MongoDB (oo-admin-ctl-app), or remove unreferenced gears from node hosts.

Comment 2 Gaoyun Pei 2013-03-27 07:06:30 UTC
Found one issue belong to this bug against puddle:
http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.1.z/2013-03-21.1/

Description of problem:
After trigger 3 scale-up events at same time, "oo-admin-chk" would report an error about the inconsistency between node and mongodb.

How reproducible:
always

Steps to Reproduce:
1.Create scalable app and disable auto-scaling

2.Trigger 3 scale-up events at same time
for i in `seq 1 3 `; do curl -k -X POST -H 'Accept: application/xml' -d event=scale-up --user gpei:redhat https://broker.rhn.com/broker/rest/domains/1010/applications/app/events &  done

3.Run oo-admin-chk on broker
[root@broker ~]# oo-admin-chk 
Check failed.
FAIL: user gpei has a mismatch in consumed gears (5) and actual gears (4)!
Gear 2c61a7ebb19a4a68a7bd2c8b5454f298 exists on node [node1.rhn.com, uid:1154] but does not exist in mongo database

Actual results:
Some gears exist on node but does not exist in mongodb.

Comment 3 Gaoyun Pei 2013-03-27 07:14:54 UTC
Sometimes, after I trigger 3 or 5 scale-up events at same time, when I checking the gear number of the scalable app via REST api, the result does not match the real number of gears on nodes.

QE would like to make this bug to trace multiple concurrent scale-up issue.

Comment 5 xjia 2013-05-03 01:02:28 UTC
Version:
http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.2/2013-05-02.1

Verify:
Scale up 10 times. 4 fail. 6 success.
When failed, it tells that:
Application is currently busy performing another operation. Please try again in a minute.

Whatever, the data in mongodb is accordance with the actual gear info. 

[root@broker ~]#
[root@broker ~]# rhc apps
php @ http://php-jia1.osev2.com/ (uuid: 518309ae4052a73a05000006)
-----------------------------------------------------------------
  Created: 5:49 PM
  Gears:   6 (defaults to small)
  Git URL: ssh://518309ae4052a73a05000006.com/~/git/php.git/
  SSH:     518309ae4052a73a05000006.com

  php-5.3 (PHP 5.3)
  -----------------
    Scaling: x6 (minimum: 1, maximum: available) on small gears

  haproxy-1.4 (OpenShift Web Balancer)
  ------------------------------------
    Gears: Located with php-5.3

You have 1 applications

[root@broker ~]#  oo-admin-chk  -v
Started at: 2013-05-02 17:57:33 -0700
Time to fetch mongo data: 0.01s
Total gears found in mongo: 6
Time to get all gears from nodes: 20.277s
Total gears found on the nodes: 6
Total nodes that responded : 2
Checking application gears and ssh keys on corresponding nodes:
518309ae4052a73a05000006 : String...    OK
51830a0a4052a73a05000028 : String...    OK
51830a404052a73a05000035 : String...    OK
51830a774052a73a05000042 : String...    OK
51830ab14052a73a0500004f : String...    OK
51830aef4052a7f30a000002 : String...    OK
Checking node gears in application database:
51830a0a4052a73a05000028...     OK
51830aef4052a7f30a000002...     OK
51830a404052a73a05000035...     OK
518309ae4052a73a05000006...     OK
51830a774052a73a05000042...     OK
51830ab14052a73a0500004f...     OK
Success
Total time: 20.287s
Finished at: 2013-05-02 17:57:53 -0700

Comment 7 errata-xmlrpc 2013-07-09 19:49:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1031.html


Note You need to log in before you can comment on or make changes to this bug.