Created attachment 643767 [details] mco log (see kwoodson for full log) Description of problem: During some gear moves and district compaction, I noticed that a gear existed in mongo but not on the disk where it should have been. Upon further investigation I noticed that there were two move logs with the same gear uuid. These two moves were called at the same time as I was attempting to automate the moves by using more than 1 ex-srv at the same time. When looking into the logs it appears that the first move succeeded and the second move failed. The second move proceeded with deconfiguring the application on the same node. It removed the user, gear directory, and removed the dns of the gear. The mongo entry was the only entry left for this gear. Version-Release number of selected component (if applicable): Current, 2.0.19 How reproducible: Very reproducible. Steps to Reproduce: 1. Create an application. 2. Call move on the application twice at the same time 3. Verify the application's state. Actual results: Application was deleted from the node. Expected results: One of the moves should fail as it is currently being moved and the application should survive the move. Additional info: I discussed this issue with the development team and they mentioned a locking mechanism is needed in order to handle these types of situations where an api call has been made and a gear is being moved, scaled, created, or deleted.
I think this needs to become a user story as it requires some locking mechanism to be added for gear control. DLMs are tricky to get right and impose more rigorous clustering semantics than we're likely to want to put up with (quorum and fencing). The least DLM-like solution would be some kind of local gear operation lock that's held by mcollective for the duration of a single operation. Holding this bug for a day to see if there's any further discussion and then I'll make it a user story.
I am not quite sure I understand how this would happen. The move steps shouldn't remove anything from the source unless the move was successful.
Regardless the short answer is... Don't do that. But whatever issue you hit should be resolved by the model refactor.
(In reply to comment #2) > I am not quite sure I understand how this would happen. The move steps > shouldn't remove anything from the source unless the move was successful. This happened multiple times. If the delay of the second move is anywhere form immediate up to around 30 seconds behind the original move _and_ they are both moving to the same destination node within a district, the move has the potential to delete the gear entirely. I noticed this happened several times. Yes, the solution was not to have the same UUIDs be called at the same time. This was the result of some overlapping UUIDs in a that was then divided into two separate move scripts and then executed at the same time. During my discussion with ramr, he mentioned that this could be the case for all API calls that deal with a gear. Could include moves, scale up/down, and parallel creates/deletes.
Your comment about moving to the same gear makes sense since it would be a successful target to one and an unsuccessful to target to the other (each deleting their counterpart). I was thinking about this case after my last comment.... Are you specifying the move target in this case or is it random luck that the same gears are moving to the same nodes?
This is inside of a district so the range of nodes where the gear could migrate to is (depending on the district) anywhere from 4-10 nodes currently. Therefore, if the moves were running at the same time, the broker chose the same node often. It was not 100% but often enough to cause this case.
Fixed in model refactor.
Verified on devenv_2737 1. Call move this application a little early, move success and in the log find the locks 2013-01-28 22:16:26.451 [DEBUG] MOPED: 127.0.0.1:27017 COMMAND database=openshift_broker_dev command={:findAndModify=>"locks", :query=>{"user_id"=>"51073ae8b96195b708000001", "app_ids"=>{"$nin"=>["51073c1eb96195b708000005"]}}, :new=>true, :update=>{"$push"=>{:app_ids=>"51073c1eb96195b708000005"}}} (0.5264ms) (pid:12997 2. Call move this application a little later in the same time, move failed(It will hang there until the 1st move finished, then show fail) [root@ip-10-112-79-167 ~]# oo-admin-move --gear_uuid 51073c1eb96195b708000005 -i ip-10-118-193-171 URL: http://q2php-qgong6.dev.rhcloud.com Login: qgong App UUID: 51073c1eb96195b708000005 Gear UUID: 51073c1eb96195b708000005 DEBUG: Source district uuid: 51073b60b96195b610000001 DEBUG: Destination district uuid: 51073b60b96195b610000001 DEBUG: District unchanged keeping uid DEBUG: Getting existing app 'q2php' status before moving DEBUG: Error performing status on existing app on try 1: Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support. DEBUG: Error performing status on existing app on try 2: Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support. /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:2539:in `parse_result': Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support. (OpenShift::NodeException) from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:2676:in `run_cartridge_command' from /var/www/openshift/broker/lib/express/broker/mcollective_ext.rb:12:in `run_cartridge_command' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:1120:in `status' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:2095:in `block in get_cart_status' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:2239:in `block in do_with_retry' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:2237:in `each' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:2237:in `do_with_retry' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:2094:in `get_cart_status' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:2070:in `get_app_status' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:1768:in `move_gear' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:1746:in `block in move_gear_secure' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.4.1/app/models/application.rb:1005:in `run_in_application_lock' from /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-msg-broker-mcollective-1.4.1/lib/openshift/mcollective_application_container_proxy.rb:1745:in `move_gear_secure' from /usr/sbin/oo-admin-move:110:in `<main>'