Bug 1007085

Summary: oo-admin-move does not fail and rollback on rsync errors
Product: OpenShift Online Reporter: Matt Woodson <mwoodson>
Component: PodAssignee: Dan McPherson <dmcphers>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.xCC: dmcphers, fweimer, lxia, zzhao
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-19 16:50:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matt Woodson 2013-09-11 21:36:31 UTC
Description of problem:

during the oo-admin-move operations we are seeing rsync connection issues.  The script does NOT catch this and continues to move on.  This is causing apps to not be moved correctly or completely.

This is impacting live customer data

Version-Release number of selected component (if applicable):

openshift-origin-broker-util-1.13.11-1.el6oso.noarch


How reproducible:

Not Sure

Steps to Reproduce:
1.Not Sure
2.
3.


Additional info:

I have seen examples where the rsync connections error will cause a rollback.  There are many cases where it won't.

Here are examples of the move logs. Notice the "rsync errors" and that the script continues to process.

==============================================================================
Tue Sep 10 20:35:48 EDT 2013
URL: XXXXXX
Login: XXXXX
App UUID: 511a77bff2cb835c28001123
Gear UUID: 511a77bff2cb835c28001123
DEBUG: Source district uuid: d5cdcbf8c1af482594451573783958f5
DEBUG: Destination district uuid: 522e193ae0b8cd6380000001
DEBUG: Getting existing app 'sekharapps' status before moving
DEBUG: Gear component 'php-5.3' was stopped
DEBUG: Reserved uid '2701' on district: '522e193ae0b8cd6380000001'
DEBUG: Creating new account for gear 'XXXXXXXX' on ex-std-node256.prod.rhcloud.com
DEBUG: Moving content for app 'XXXXXXXX', gear 'XXXXXXXX' to ex-std-node256.prod.rhcloud.com
Identity added: /var/www/openshift/broker/config/keys/rsync_id_rsa (/var/www/openshift/broker/config/keys/rsync_id_rsa)
ssh: connect to host 10.77.1.19 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
Agent pid 23646
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 23646 killed;
DEBUG: Moving system components for app 'XXXXXXXX', gear 'XXXXXXXX' to ex-std-node256.prod.rhcloud.com
Identity added: /var/www/openshift/broker/config/keys/rsync_id_rsa (/var/www/openshift/broker/config/keys/rsync_id_rsa)
Agent pid 24733
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 24733 killed;
DEBUG: Fixing DNS and mongo for gear 'XXXXX' after move
DEBUG: Changing server identity of 'XXXXXX' from 'ex-std-node20.prod.rhcloud.com' to 'ex-std-node256.prod.rhcloud.com'
DEBUG: Deconfiguring old app 'XXXXXX' on ex-std-node20.prod.rhcloud.com after move
Successfully moved gear with uuid 'ed2039e4b1024195a3b3e44f7b362016' of app 'XXXXX' from 'ex-std-node20.prod.rhcloud.com' to 'ex-std-node256.prod.rhcloud.com'
==============================================================================

Tue Sep 10 19:51:59 EDT 2013
URL: http://XXXXXX.rhcloud.com
Login: XXXXXX
App UUID: 511a7957f2cb831848004c80
Gear UUID: 511a7957f2cb831848004c80
DEBUG: Source district uuid: d5cdcbf8c1af482594451573783958f5
DEBUG: Destination district uuid: 522e193ae0b8cd6380000001
DEBUG: Getting existing app 'XXXXXX' status before moving
DEBUG: Gear component 'jbossas-7' was stopped
DEBUG: Reserved uid '6285' on district: '522e193ae0b8cd6380000001'
DEBUG: Creating new account for gear 'XXXXXX' on ex-std-node256.prod.rhcloud.com
DEBUG: Moving content for app 'XXXXXX', gear 'XXXXXX' to ex-std-node256.prod.rhcloud.com
Identity added: /var/www/openshift/broker/config/keys/rsync_id_rsa (/var/www/openshift/broker/config/keys/rsync_id_rsa)
ssh: connect to host 10.77.1.19 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
Agent pid 27318
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 27318 killed;
DEBUG: Moving system components for app 'XXXXXX', gear 'XXXXXX' to ex-std-node256.prod.rhcloud.com
Identity added: /var/www/openshift/broker/config/keys/rsync_id_rsa (/var/www/openshift/broker/config/keys/rsync_id_rsa)
Agent pid 28597
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 28597 killed;
DEBUG: Fixing DNS and mongo for gear 'XXXXXX' after move
DEBUG: Changing server identity of 'XXXXXX' from 'ex-std-node20.prod.rhcloud.com' to 'ex-std-node256.prod.rhcloud.com'
DEBUG: Deconfiguring old app 'XXXXXX' on ex-std-node20.prod.rhcloud.com after move
Successfully moved gear with uuid 'bf4c984494454e53bef5783e52075579' of app 'XXXXXX' from 'ex-std-node20.prod.rhcloud.com' to 'ex-std-node256.prod.rhcloud.com'

Comment 1 Dan McPherson 2013-09-12 13:41:19 UTC
https://github.com/openshift/origin-server/pull/3626

Comment 3 zhaozhanqi 2013-09-13 08:37:15 UTC
Tested this bug on devenv_3780, it has been fixed

1) remove the directory on destination node 
2) move the gear

[root@ip-10-154-184-93 lib]# oo-admin-move --gear_uuid 523286b7bef23b6cea000007 -i ip-10-184-6-242
URL: http://zqphp-zqd.dev.rhcloud.com
Login: zzhao
App UUID: 523286b7bef23b6cea000007
Gear UUID: 523286b7bef23b6cea000007
DEBUG: Source district uuid: c0a525681c2411e3aad322000a9ab85d
DEBUG: Destination district uuid: NONE
DEBUG: Getting existing app 'zqphp' status before moving
DEBUG: Gear component 'php-5.3' was running
DEBUG: Stopping existing app cartridge 'php-5.3' before moving
DEBUG: Force stopping existing app cartridge 'php-5.3' before moving
DEBUG: Reserved uid '' on district: 'NONE'
DEBUG: Creating new account for gear 'zqphp' on ip-10-184-6-242
DEBUG: Moving failed.  Rolling back gear 'zqphp' in 'zqphp' with delete on 'ip-10-184-6-242'
Node execution failure (invalid exit code from node).

it quit directly when rsync error.do not show the log: "DEBUG: Moving system components for app"

Comment 4 zhaozhanqi 2013-09-13 08:38:18 UTC
Tested this bug on devenv_3780, it has been fixed

1) remove the directory on destination node 
2) move the gear

[root@ip-10-154-184-93 lib]# oo-admin-move --gear_uuid 523286b7bef23b6cea000007 -i ip-10-184-6-242
URL: http://zqphp-zqd.dev.rhcloud.com
Login: zzhao
App UUID: 523286b7bef23b6cea000007
Gear UUID: 523286b7bef23b6cea000007
DEBUG: Source district uuid: c0a525681c2411e3aad322000a9ab85d
DEBUG: Destination district uuid: NONE
DEBUG: Getting existing app 'zqphp' status before moving
DEBUG: Gear component 'php-5.3' was running
DEBUG: Stopping existing app cartridge 'php-5.3' before moving
DEBUG: Force stopping existing app cartridge 'php-5.3' before moving
DEBUG: Reserved uid '' on district: 'NONE'
DEBUG: Creating new account for gear 'zqphp' on ip-10-184-6-242
DEBUG: Moving failed.  Rolling back gear 'zqphp' in 'zqphp' with delete on 'ip-10-184-6-242'
Node execution failure (invalid exit code from node).

it quit directly when rsync error.do not show the log: "DEBUG: Moving system components for app"