Bug 858092

Summary: rhc-admin-move fails when gears are idle
Product: OKD Reporter: Kenny Woodson <kwoodson>
Component: PodAssignee: Rajat Chopra <rchopra>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.xCC: bmeng, qgong, twiest
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-06 18:48:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
move log and mco log none

Description Kenny Woodson 2012-09-18 00:07:24 UTC
Created attachment 613821 [details]
move log and mco log

Description of problem:

During a recent district compaction, we were moving gears and around 297 gears failed to move.  I spoke with rchopra and he mentioned that it would succeed if we were to un-idle the gears before moving them.  This was a successful method of moving the gears.

rhc-admin-move is a significant tool we used to keep costs low and also provide availability for active applications.  We need the rhc-admin-move to handle these failures gracefully.

Version-Release number of selected component (if applicable):


How reproducible:

This is reproducible on ex-std-node50.prod currently.  I have ~300 failed moves.  I'm sure not 100% of them are due to idleness but when un-idling the first 100 I was able to move almost all of them without problem after they were un-idled.  

Steps to Reproduce:
1.  rhc-admin-move on an application when it is idled
2.  The log will say, "Moved failed"
3.  Verify
  
Actual results:

rhc-admin-move would fail when attempting to move applications for district compaction and node shuffling.

Expected results:

This rhc-admin-move should be improved in order to make the moves of active/idle gears successful.  If this means that we un-idle an application before a move to help the cause it would be worth it.  

Additional info:

Comment 1 Rajat Chopra 2012-09-19 18:22:14 UTC
Fixed with commit #42ffb8b in crankcase.repo!
The idling is backgrounded in the move hook now. This should help with getting around the timeout issue.

The real issue here may be the apache graceful queueing up on its file lock. That is a more difficult problem to solve.

Comment 2 Rony Gong 🔥 2012-09-21 07:51:14 UTC
Verified on devenv_2209, both check move idle app within/accross district.
[root@ip-10-122-34-55 ~]# rhc-admin-move --gear_uuid 0ecc072d7747470ea995cf292f112ae5 -i ip-10-122-34-55
URL: http://qperl-qgong9.dev.rhcloud.com
Login: qgong
App UUID: 0ecc072d7747470ea995cf292f112ae5
Gear UUID: 0ecc072d7747470ea995cf292f112ae5
DEBUG: Source district uuid: c4d4fbda69ba485ea64034e3089b1f15
DEBUG: Destination district uuid: c4d4fbda69ba485ea64034e3089b1f15
DEBUG: District unchanged keeping uid
DEBUG: Getting existing app 'qperl' status before moving
DEBUG: Gear component 'perl-5.10' was idle
DEBUG: Creating new account for gear 'qperl' on ip-10-122-34-55
DEBUG: Moving content for app 'qperl', gear 'qperl' to ip-10-122-34-55
Identity added: /var/www/stickshift/broker/config/keys/rsync_id_rsa (/var/www/stickshift/broker/config/keys/rsync_id_rsa)
Warning: Permanently added '10.122.34.55' (RSA) to the list of known hosts.
Agent pid 19656
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 19656 killed;
DEBUG: Performing cartridge level move for 'perl-5.10' on ip-10-122-34-55
DEBUG: Fixing DNS and mongo for gear 'qperl' after move
DEBUG: Changing server identity of 'qperl' from 'ip-10-90-246-107' to 'ip-10-122-34-55'
DEBUG: Deconfiguring old app 'qperl' on ip-10-90-246-107 after move
Successfully moved 'qperl' with gear uuid '0ecc072d7747470ea995cf292f112ae5' from 'ip-10-90-246-107' to 'ip-10-122-34-55'