Created attachment 645039 [details] move log for this gear Description of problem: When moving a scaled gear recently, the gear has become out of sync with mongo. The gear in question is listed in additional info. We moved this gear off of ex-std-node74 but then when looking at mongo it tells us the scaled gears still exist on node74 which they do not. They are now located elsewhere. For instance, a1fd466408244af08404dde6fc446e98 is on ex-std-node36. We need tools to function properly and update mongo when moves to scaled gears occur. We also need tools to fix the problems after they occur with scaled gears. Currently this gear is in disarray as we cannot repair it with the current tool set or general ops knowledge. Version-Release number of selected component (if applicable): current, 2.0.19.1 How reproducible: Steps to Reproduce: 1. Create a scaled gear. 2. Create multiple "subgears" for this gear. 3. Move the main application's gear for this gear. Actual results: The main gear migrated successfully. The site is still up. Mongo points to the incorrect locations of the scaled gears. Expected results: oo-admin-move should successfully move the gears and handle scaled gears correctly by updating mongo properly. Additional info: App Name: redhatchallenge App UUID: aa70ada22855472a8bf5a6d12a1d93b9 Creation Time: 2012-08-04 05:49:18 AM URL: http://redhatchallenge-rhc.rhcloud.com Gear[0] Server Identity: ex-std-node21.prod.rhcloud.com Gear UUID: aa70ada22855472a8bf5a6d12a1d93b9 Gear UID: 4554 Group Instance[1]: Server Identity: ex-std-node74.prod.rhcloud.com Gear UUID: a1fd466408244af08404dde6fc446e98 Gear UID: 2565 Gear[1] Server Identity: ex-std-node74.prod.rhcloud.com Gear UUID: 5e6e28a066e548aca075eaf394b7527f Gear UID: 3179 Gear[2] Server Identity: ex-std-node74.prod.rhcloud.com Gear UUID: 89181dce383e44cfa3355b88caa3c861 Gear UID: 5110 Gear[3] Server Identity: ex-std-node74.prod.rhcloud.com Gear UUID: 53a0d8c17e0549168eac836670317413 Gear UID: 3745 Gear[4] Server Identity: ex-std-node49.prod.rhcloud.com Gear UUID: 36b74e9eb31c44b58650444d2a0d0292 Gear UID: 2798 Gear[5] Server Identity: ex-std-node74.prod.rhcloud.com Gear UUID: 2e0f630d33614afd9184b0cea886e2ee Gear UID: 3180 Gear[6] Server Identity: ex-std-node74.prod.rhcloud.com Gear UUID: 4ebc1b95f48a401ab452e0154639c2a7 Gear UID: 3184 Gear[7] Server Identity: ex-std-node74.prod.rhcloud.com Gear UUID: c2db3776227549a5b481c120c6feee1d Gear UID: 3825 Group Instance[2]:
Rajat, the log looks right. Perhaps your theory about concurrent scale ups is relevant.
The logs suggest nothing wrong with the move. One way that things can go wrong is with concurrent operations. In this case a likely guess is that a scale-up operation was happening while the move (of a different gear of the app) was going on. A new method has been added in oo-ctl-admin-app that can remove such a broken gear. More information with bug#876330 !
The original concurrency problems will be fixed with model_refactor code. Until then ops will have to manually see if there are broken gears, and use the updated oo-admin-ctl-app script.
Verified on devenv_stage_254 [root@ip-10-202-193-65 data]# oo-admin-ctl-app -c removegear -l qgong -a qsphp -g 0fb4256d47cb4b61a0115148a82e8cd6 Success