Bug 919379 - The oo-admin-clear-pending-ops script will delete more data besides pending_op_groups object
Summary: The oo-admin-clear-pending-ops script will delete more data besides pending_o...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OKD
Classification: Red Hat
Component: Pod
Version: 2.x
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Rajat Chopra
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-08 09:59 UTC by Rony Gong 🔥
Modified: 2015-05-15 02:16 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-04-02 14:27:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
development.log (58.20 KB, text/plain)
2013-03-27 10:08 UTC, Rony Gong 🔥
no flags Details
mcollective-client.log (22.37 KB, text/plain)
2013-03-27 10:09 UTC, Rony Gong 🔥
no flags Details
mcollective.log (19.11 KB, text/plain)
2013-03-27 10:10 UTC, Rony Gong 🔥
no flags Details

Description Rony Gong 🔥 2013-03-08 09:59:45 UTC
Description of problem:
The oo-admin-clear-pending-ops script will delete more data besides pending_op_groups object

This script will remove more data like the data in group_instances and group_overrides when delete pending_op_groups object

Version-Release number of selected component (if applicable):
devenv_stage_313

How reproducible:
Always

Steps to Reproduce:
1.Get app's pending_op_groups data When Create an application,
I get it from the http://$instance/datastore/
2.After create this application, paste back the pending_op_groups data to the application 
3.Update all the created_at time 1 hour before for this applicaiton
4.Execute command oo-admin-clear-pending-ops  on instance
[root@ip-10-202-34-146 ~]# oo-admin-clear-pending-ops  
Executing op for app (5139abda3e20f162fb00014e) - #<PendingAppOpGroup _id: 5139abda3e20f162fb00014f, _type: nil, created_at: 2013-03-07 09:14:02 UTC, updated_at: 2013-03-08 09:14:02 UTC, op_type: "add_features", args: {"features"=>["php-5.3"], "group_overrides"=>[], "init_git_url"=>nil}, parent_op_id: nil, num_gears_added: 1.0, num_gears_removed: 0.0, num_gears_created: 1.0, num_gears_destroyed: 0.0, num_gears_rolled_back: 0.0, user_agent: "rhc/1.6.1 (ruby 1.8.7; x86_64-linux) (2.3.2, ruby 1.8.7 (2011-06-30) [x86_64-linux])"> 
Execution failed. Rolling back.. complete.

5.After delete, check by oo-admin-chk
[root@ip-10-202-34-146 ~]# id -u 5139abda3e20f162fb00014e
513
[root@ip-10-202-34-146 ~]# oo-admin-chk
Started at: 2013-03-08 04:36:36 -0500
Time to fetch mongo data: 0.014s
Total gears found in mongo: 16
Time to get all gears from nodes: 20.702s
Total gears found on the nodes: 17
Total nodes that responded : 1
Check failed.
 Gear 5139abda3e20f162fb00014e exists on node ip-10-202-34-146 (uid: 513) but does not exist in mongo database
Total time: 20.722s
Finished at: 2013-03-08 04:36:57 -0500

6. Check the data of this application in mongodb

Actual results:
....

     
  ],
   "group_instances": [
     
  ],
   "group_overrides": [
     
  ],
   "init_git_url": null,
   "name": "q2php",
   "pending_op_groups": [
     
  ],
...
Expected results:
There application mongo data should keep same as before delete, that should have data in 
group_instances and group_overrides


Additional info:

Comment 1 Rajat Chopra 2013-03-08 16:17:46 UTC
That is because the op_group that was pasted as 'stuck' was 'add_features'. So, it rolled back the features and emptied the application. Thats what it is supposed to do.

As a further enhancement, the following can be done :

1. If the op_group was 'add_features' and a rollback is done. Then the app should be deleted too.
2. If the op_group was 'delete' and an execute is performed, then the app should be deleted too.

Long term, the issue is that cartridge hooks are not re-entrant. That should be fixed.


Keeping the bug open and thinking about other alternatives.

Comment 2 openshift-github-bot 2013-03-11 21:32:18 UTC
Commits pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/5ecd2fe772d6f71bbfbc75d66316a46cc58a1281
fix bug919379

https://github.com/openshift/origin-server/commit/64013e8c90ec66a42a7d7dfa60005c99e380a633
Merge pull request #1605 from rajatchopra/master

fix bug919379 - clear-pending-ops marks delete_app op

Comment 3 Rony Gong 🔥 2013-03-21 10:45:51 UTC
Tested on devenv_2978,Same error as before
@Rajat:
for this point
1. If the op_group was 'add_features' and a rollback is done. Then the app should be deleted too.
when rollback,here will delete the data of group_instance but will not delete data in node, like uid, folder
This is design?

Comment 4 Rajat Chopra 2013-03-23 00:01:43 UTC
Fixed with https://github.com/openshift/origin-server/pull/1773
If a rollback fails, the app would now NOT get deleted from mongo, unless all gears have been cleared up.

Comment 5 Rony Gong 🔥 2013-03-26 11:08:58 UTC
Reassigned on devenv_2998, same error as before
if one op "op_type": "create_group_instance" is "state": "completed" 
why it need execute op.execute,  then raise exception "roll back", this will remove the data of group_instance.

@Rajat,  Do we need add filter out the op that "state" is"completed" ?

Comment 6 Rajat Chopra 2013-03-26 13:40:42 UTC
op.execute is done on op_group (like add_features) and not on a particular op within (e.g. create_group_instance). The 'eligible_ops' function in pending_app_op_group.rb ensures that only non-complete ops are really executed.

If you do a copy-paste of the op_group after it was actually executed, then op_group.execute will start from the point the copy was done. to me, the sequence of operations is this -

1. op_group with add_features is created in mongo
2. op_group goes ahead and executes until create_group_instance, but init_gear/create_gear/configure etc are still not complete.
3. A copy operation is done from mongo at this point
4. the op_group goes ahead and completes itself, thereby creating a gear etc..
5. Now a paste is done of what was copied in #3, so mongo shows that group_instance is complete but gears need to be created
6. admin_clear_pending_ops script comes around and executes this op_group - it fails at creating the gear because gear already exists
7. all completed ops are rolled back, which means group_instance is emptied (but the gear is not deleted)


If the above is true, then we would see what you are seeing but the flaw is that it would never occur in reality. Can you verify this by providing the broker/mcollective logs as well when the admin-clear-pending-ops operation takes place?

Comment 7 Rony Gong 🔥 2013-03-27 10:07:19 UTC
@Rajat, the above is true, then agree with your opinion.
here is the logs after above steps happened, when run "oo-admin-clear-pending-ops"

Comment 8 Rony Gong 🔥 2013-03-27 10:08:06 UTC
Created attachment 716989 [details]
development.log

Comment 9 Rony Gong 🔥 2013-03-27 10:09:02 UTC
Created attachment 716990 [details]
mcollective-client.log

Comment 10 Rony Gong 🔥 2013-03-27 10:10:46 UTC
Created attachment 716992 [details]
mcollective.log


Note You need to log in before you can comment on or make changes to this bug.