Bug 858092 - rhc-admin-move fails when gears are idle
Summary: rhc-admin-move fails when gears are idle
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OKD
Classification: Red Hat
Component: Pod
Version: 2.x
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Rajat Chopra
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-09-18 00:07 UTC by Kenny Woodson
Modified: 2015-05-15 02:04 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-11-06 18:48:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
move log and mco log (3.48 KB, text/plain)
2012-09-18 00:07 UTC, Kenny Woodson
no flags Details

Description Kenny Woodson 2012-09-18 00:07:24 UTC
Created attachment 613821 [details]
move log and mco log

Description of problem:

During a recent district compaction, we were moving gears and around 297 gears failed to move.  I spoke with rchopra and he mentioned that it would succeed if we were to un-idle the gears before moving them.  This was a successful method of moving the gears.

rhc-admin-move is a significant tool we used to keep costs low and also provide availability for active applications.  We need the rhc-admin-move to handle these failures gracefully.

Version-Release number of selected component (if applicable):


How reproducible:

This is reproducible on ex-std-node50.prod currently.  I have ~300 failed moves.  I'm sure not 100% of them are due to idleness but when un-idling the first 100 I was able to move almost all of them without problem after they were un-idled.  

Steps to Reproduce:
1.  rhc-admin-move on an application when it is idled
2.  The log will say, "Moved failed"
3.  Verify
  
Actual results:

rhc-admin-move would fail when attempting to move applications for district compaction and node shuffling.

Expected results:

This rhc-admin-move should be improved in order to make the moves of active/idle gears successful.  If this means that we un-idle an application before a move to help the cause it would be worth it.  

Additional info:

Comment 1 Rajat Chopra 2012-09-19 18:22:14 UTC
Fixed with commit #42ffb8b in crankcase.repo!
The idling is backgrounded in the move hook now. This should help with getting around the timeout issue.

The real issue here may be the apache graceful queueing up on its file lock. That is a more difficult problem to solve.

Comment 2 Rony Gong 🔥 2012-09-21 07:51:14 UTC
Verified on devenv_2209, both check move idle app within/accross district.
[root@ip-10-122-34-55 ~]# rhc-admin-move --gear_uuid 0ecc072d7747470ea995cf292f112ae5 -i ip-10-122-34-55
URL: http://qperl-qgong9.dev.rhcloud.com
Login: qgong
App UUID: 0ecc072d7747470ea995cf292f112ae5
Gear UUID: 0ecc072d7747470ea995cf292f112ae5
DEBUG: Source district uuid: c4d4fbda69ba485ea64034e3089b1f15
DEBUG: Destination district uuid: c4d4fbda69ba485ea64034e3089b1f15
DEBUG: District unchanged keeping uid
DEBUG: Getting existing app 'qperl' status before moving
DEBUG: Gear component 'perl-5.10' was idle
DEBUG: Creating new account for gear 'qperl' on ip-10-122-34-55
DEBUG: Moving content for app 'qperl', gear 'qperl' to ip-10-122-34-55
Identity added: /var/www/stickshift/broker/config/keys/rsync_id_rsa (/var/www/stickshift/broker/config/keys/rsync_id_rsa)
Warning: Permanently added '10.122.34.55' (RSA) to the list of known hosts.
Agent pid 19656
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 19656 killed;
DEBUG: Performing cartridge level move for 'perl-5.10' on ip-10-122-34-55
DEBUG: Fixing DNS and mongo for gear 'qperl' after move
DEBUG: Changing server identity of 'qperl' from 'ip-10-90-246-107' to 'ip-10-122-34-55'
DEBUG: Deconfiguring old app 'qperl' on ip-10-90-246-107 after move
Successfully moved 'qperl' with gear uuid '0ecc072d7747470ea995cf292f112ae5' from 'ip-10-90-246-107' to 'ip-10-122-34-55'


Note You need to log in before you can comment on or make changes to this bug.