Description of problem: When trying to move applications from one node to another, oo-admin-move occasionally attempts to move the wrong UUID. It isn't random, for certain apps, it will consistently try to move the same wrong UUID. Version-Release number of selected component (if applicable): openshift-origin-broker-util-1.13.11-1.el6oso.noarch How reproducible: Out of 800, I have this problem in 60 of the apps. Out of those 60 apps, it is 100% reproducable. Steps to Reproduce: 1. oo-admin-move --gear_uuid <UUID> 2. 3. Actual results: # oo-admin-move --gear_uuid <UUID> URL: http://<CORRECT URL> Login: <CORRECT LOGIN> App UUID: <WRONG APP UUID> Gear UUID: <WRONG GEAR UUID> DEBUG: Source district uuid: <CORRECT DISTRICT ID> DEBUG: Destination district uuid: <CORRECT DISTRICT ID> DEBUG: Getting existing app '<CORRECT APP NAME>' status before moving DEBUG: Error performing status on existing app on try 1: Node execution failure (invalid exit code from node). DEBUG: Error performing status on existing app on try 2: Node execution failure (invalid exit code from node). Node execution failure (invalid exit code from node). Expected results: # oo-admin-move --gear_uuid <UUID> URL: http://<CORRECT URL> Login: <CORRECT LOGIN> App UUID: <CORRECT APP UUID> Gear UUID: <CORRECT GEAR UUID> DEBUG: Source district uuid: <CORRECT DISTRICT ID> DEBUG: Destination district uuid: <CORRECT DISTRICT ID> DEBUG: Getting existing app '<CORRECT APP NAME>' status before moving ... and so forth with a successful move ... Additional info: I have tried all the options that oo-admin-move has, they consistently give the wrong App UUID and Gear UUID on the troublesome gears, and end with the same failure. This is happening on three different nodes that I am currently aware of, so I think it's a universal bug, not just one node.
Evidently the <WRONG APP UUID> is a red herring. I just retried this on some of the failed moves. One of them successfully moved the correct app, even with the <WRONG APP UUID>. One of them continued to fail with the same failures, but they had <CORRECT APP UUID>
Further Investigation: The node is not returning the right information. If we dont' try the move, but just do a status, we get the following. Normal App: # oo-admin-ctl-app -l <LOGIN ID> -a <APP NAME> -c status Application is either stopped or inaccessible # echo $? 0 Failing App: # oo-admin-ctl-app -l <LOGIN ID> -a <APP NAME> -c status DEBUG OUTPUT: Failed to execute: 'control status' for /var/lib/openshift/<UUID>/python Command return code: 7 Success # echo $? 0 Side Note: The vast majority of these are python, but not all.
All of these are python apps. I have managed to move all of the non-python apps. There have only been a handful (10 out of 2000+) python apps that were successfully moved, while all the rest give this "Command return code: 7" error.
The following two commits (master, stage) fix the "Command return code: 7" error. It was due to the exit code of the curl commands interacting poorly with the control script running with "-e". https://github.com/openshift/origin-server/pull/3529 https://github.com/openshift/origin-server/pull/3530
The above commits should have resolved the underlying issue in the ticket. Moving to Q/E.
From the IRC discussion, it appears as though the problem is not resolved.
Release ticket updated to request an mcollective reload.
Tested on devenv-stage_461 with mutli-node. Moved about 50 gears with all python version combined. No error found. And checking status for the python app can get the correct result. [root@ip-10-152-133-8 ~]# oo-admin-ctl-app -l bmeng -a py271 -c status Application is running % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 178 178 178 178 0 0 12241 0 --:--:-- --:--:-- --:--:-- 13692 Total Accesses: 0 Total kBytes: 0 Uptime: 5199 ReqPerSec: 0 BytesPerSec: 0 BusyWorkers: 1 IdleWorkers: 0 Scoreboard: W........................................................... [root@ip-10-152-133-8 ~]# echo $? 0 Move bug to verified.