Description of problem: The oo-admin-clear-pending-ops script will delete more data besides pending_op_groups object This script will remove more data like the data in group_instances and group_overrides when delete pending_op_groups object Version-Release number of selected component (if applicable): devenv_stage_313 How reproducible: Always Steps to Reproduce: 1.Get app's pending_op_groups data When Create an application, I get it from the http://$instance/datastore/ 2.After create this application, paste back the pending_op_groups data to the application 3.Update all the created_at time 1 hour before for this applicaiton 4.Execute command oo-admin-clear-pending-ops on instance [root@ip-10-202-34-146 ~]# oo-admin-clear-pending-ops Executing op for app (5139abda3e20f162fb00014e) - #<PendingAppOpGroup _id: 5139abda3e20f162fb00014f, _type: nil, created_at: 2013-03-07 09:14:02 UTC, updated_at: 2013-03-08 09:14:02 UTC, op_type: "add_features", args: {"features"=>["php-5.3"], "group_overrides"=>[], "init_git_url"=>nil}, parent_op_id: nil, num_gears_added: 1.0, num_gears_removed: 0.0, num_gears_created: 1.0, num_gears_destroyed: 0.0, num_gears_rolled_back: 0.0, user_agent: "rhc/1.6.1 (ruby 1.8.7; x86_64-linux) (2.3.2, ruby 1.8.7 (2011-06-30) [x86_64-linux])"> Execution failed. Rolling back.. complete. 5.After delete, check by oo-admin-chk [root@ip-10-202-34-146 ~]# id -u 5139abda3e20f162fb00014e 513 [root@ip-10-202-34-146 ~]# oo-admin-chk Started at: 2013-03-08 04:36:36 -0500 Time to fetch mongo data: 0.014s Total gears found in mongo: 16 Time to get all gears from nodes: 20.702s Total gears found on the nodes: 17 Total nodes that responded : 1 Check failed. Gear 5139abda3e20f162fb00014e exists on node ip-10-202-34-146 (uid: 513) but does not exist in mongo database Total time: 20.722s Finished at: 2013-03-08 04:36:57 -0500 6. Check the data of this application in mongodb Actual results: .... ], "group_instances": [ ], "group_overrides": [ ], "init_git_url": null, "name": "q2php", "pending_op_groups": [ ], ... Expected results: There application mongo data should keep same as before delete, that should have data in group_instances and group_overrides Additional info:
That is because the op_group that was pasted as 'stuck' was 'add_features'. So, it rolled back the features and emptied the application. Thats what it is supposed to do. As a further enhancement, the following can be done : 1. If the op_group was 'add_features' and a rollback is done. Then the app should be deleted too. 2. If the op_group was 'delete' and an execute is performed, then the app should be deleted too. Long term, the issue is that cartridge hooks are not re-entrant. That should be fixed. Keeping the bug open and thinking about other alternatives.
Commits pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/5ecd2fe772d6f71bbfbc75d66316a46cc58a1281 fix bug919379 https://github.com/openshift/origin-server/commit/64013e8c90ec66a42a7d7dfa60005c99e380a633 Merge pull request #1605 from rajatchopra/master fix bug919379 - clear-pending-ops marks delete_app op
Tested on devenv_2978,Same error as before @Rajat: for this point 1. If the op_group was 'add_features' and a rollback is done. Then the app should be deleted too. when rollback,here will delete the data of group_instance but will not delete data in node, like uid, folder This is design?
Fixed with https://github.com/openshift/origin-server/pull/1773 If a rollback fails, the app would now NOT get deleted from mongo, unless all gears have been cleared up.
Reassigned on devenv_2998, same error as before if one op "op_type": "create_group_instance" is "state": "completed" why it need execute op.execute, then raise exception "roll back", this will remove the data of group_instance. @Rajat, Do we need add filter out the op that "state" is"completed" ?
op.execute is done on op_group (like add_features) and not on a particular op within (e.g. create_group_instance). The 'eligible_ops' function in pending_app_op_group.rb ensures that only non-complete ops are really executed. If you do a copy-paste of the op_group after it was actually executed, then op_group.execute will start from the point the copy was done. to me, the sequence of operations is this - 1. op_group with add_features is created in mongo 2. op_group goes ahead and executes until create_group_instance, but init_gear/create_gear/configure etc are still not complete. 3. A copy operation is done from mongo at this point 4. the op_group goes ahead and completes itself, thereby creating a gear etc.. 5. Now a paste is done of what was copied in #3, so mongo shows that group_instance is complete but gears need to be created 6. admin_clear_pending_ops script comes around and executes this op_group - it fails at creating the gear because gear already exists 7. all completed ops are rolled back, which means group_instance is emptied (but the gear is not deleted) If the above is true, then we would see what you are seeing but the flaw is that it would never occur in reality. Can you verify this by providing the broker/mcollective logs as well when the admin-clear-pending-ops operation takes place?
@Rajat, the above is true, then agree with your opinion. here is the logs after above steps happened, when run "oo-admin-clear-pending-ops"
Created attachment 716989 [details] development.log
Created attachment 716990 [details] mcollective-client.log
Created attachment 716992 [details] mcollective.log