Bug 1056394 - The snapshot save/restore failed after gear is moved
Summary: The snapshot save/restore failed after gear is moved
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 2.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Brenton Leanhardt
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks: 1057952
TreeView+ depends on / blocked
 
Reported: 2014-01-22 05:17 UTC by Anping Li
Modified: 2017-03-08 17:36 UTC (History)
6 users (show)

Fixed In Version: rubygem-openshift-origin-container-selinux-0.4.1.2-1.el6op rubygem-openshift-origin-msg-broker-mcollective-1.17.6-1.el6op openshift-origin-msg-node-mcollective-1.17.6-1.el6op rubygem-openshift-origin-node-1.17.5.10-1.el6op rubygem-openshift-origin-control
Doc Type: Bug Fix
Doc Text:
Attempts to restore an application snapshot would fail when restoring from a snapshot that was created after a cartridge was moved, due to empty deployments what were created during the move. This bug fix corrects the cartridge move logic and the empty deployments are no longer created. Note that this only applies to gears created after applying this fix. For existing applications experiencing this issue, the ~/app-deployments/ directory must be searched for any empty directories, which then must be removed with the rmdir command.
Clone Of:
: 1057952 (view as bug list)
Environment:
Last Closed: 2014-02-25 15:43:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0209 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 2.0.3 bugfix and enhancement update 2014-02-25 20:40:32 UTC

Description Anping Li 2014-01-22 05:17:56 UTC
Description of problem:
After 'move cartridge'. Do 'snapshot save', the snaphost can't be restored, it will failed with 'error: 422 Unprocessable Entity.' 
It seems a temp deployment timestamp directory is created during cartridge moving. The 'current' doesn't link to the last deployment. After delete the last deployment or update the deployment by 'git push' , Snapshot save/restore work fine.

Version-Release number of selected component (if applicable):
puddle-2-0-2-2014-01-16

How reproducible:
Always

Steps to Reproduce:
1.Create an app. eg: php1.
2.Move cartridge to another node
3.Don't 'git push' after move
4.Check app-deployment
5.rhc snapshot save php1
6.rhc snapshot restore php1

Actual results:
Step 4:
The current deployment is the last one.
drwxr-x---.  5 52df298a4945dc0dbd0003ea 52df298a4945dc0dbd0003ea 4.0K Jan 21 21:41 2014-01-21_21-33-18.136
drwxr-x---.  5 52df298a4945dc0dbd0003ea 52df298a4945dc0dbd0003ea 4.0K Jan 21 21:41 2014-01-21_21-35-34.236
drwxr-x---.  5 52df298a4945dc0dbd0003ea 52df298a4945dc0dbd0003ea 4.0K Jan 21 21:48 2014-01-21_21-48-32.227
drwxr-xr-x.  2 52df298a4945dc0dbd0003ea 52df298a4945dc0dbd0003ea 4.0K Jan 21 21:41 by-id
lrwxrwxrwx.  1 52df298a4945dc0dbd0003ea 52df298a4945dc0dbd0003ea   23 Jan 21 21:41 current -> 2014-01-21_21-35-34.236

Step 6:The snapshot save failed with error message below:
[ose215@dhcp-9-237 ~]$ rhc snapshot restore t1
Restoring from snapshot t1.tar.gz...
Removing old git repo: ~/git/t1.git/
Removing old data dir: ~/app-root/data/*
Restoring ~/git/t1.git and ~/app-root/data
httpd: Could not reliably determine the server's fully qualified domain name, using nd216.oseanli.cn for ServerName
error: 422 Unprocessable Entity. Use --trace to view backtrace
Error in trying to restore snapshot. You can try to restore manually by running:
cat 't1.tar.gz' | ssh 52df298a4945dc0dbd0003ea.cn 'restore INCLUDE_GIT

Expected results:
1. The temp deployment timestamp be hanlded correctly.
2. The snapshot can be save and restore.


Additional info:
The bug can be reproduced in Online.

Comment 2 Anping Li 2014-01-26 07:52:37 UTC
The bug can't be reproduced in Online devenv-stage_655. It seems other bug fix also affect this problem.

Comment 3 Anping Li 2014-01-28 02:41:07 UTC
Althoug the bug can't be reproduced in Online devenv-stage_655, there is another low bug with this reproduce step in Online. it is https://bugzilla.redhat.com/show_bug.cgi?id=1057976.

Comment 5 Brenton Leanhardt 2014-01-28 19:09:46 UTC
From what I see the error is occurring during deployment verification:

My broken gear has two deployments after following the steps in the description:

2014-01-28_13-41-48.285
2014-01-28_13-43-32.100

The first one has the following in the metadata.json:

{"git_ref":"master","git_sha1":"b15ec05","id":"b70979a6","hot_deploy":null,"force_clean_build":null,"activations":[1390934519.0726473,1390934722.984616],"checksum":"82eb881b7b17e68e57168111d3bb6369ee1a04e3"}

The second has:
{"git_ref":"master","git_sha1":null,"id":null,"hot_deploy":null,"force_clean_build":null,"activations":[],"checksum":null}

I haven't been able to reproduce this upstream.  The interesting thing is when I try to reproduce this upstream the second deployment doesn't even have a metadata.json.

Comment 6 Brenton Leanhardt 2014-01-29 00:57:52 UTC
This is a mess.  Upstream had a number of changes that masked this problem.  I backported the following commits and now it's working like Online/Origin:

commit 2a7ca5491b59bbcbbaa7504cd0c383215b28465a
Author: Paul Morie <pmorie>
Date:   Mon Jan 27 10:26:16 2014 -0500

    Fix bug 1055653 for cases when httpd is down

commit 19e2995306bff7bea037823675f5cf279bafe880
Author: Paul Morie <pmorie>
Date:   Tue Jan 21 16:05:29 2014 -0500

    Fix bug 1055653 and improve post-receive output readability

commit 836bb408aa7fff6a7605fedf55fc0294a771e9b6
Author: Ben Parees <bparees>
Date:   Tue Dec 17 00:11:52 2013 -0500

    Bug 1033523 - The hot_deploy marker/--hot-deploy option can not take
    effect when deploying app with binary deployment

commit a568bd63147b15e71ba41734421340eeee8b2b99
Author: jhadvig <jhadvig>
Date:   Wed Dec 4 17:11:30 2013 +0100

    Bug 1038129 - Gear is not started after restore when hot_deploy marker is present

commit a96ef04aa5a69db4e3d92c4bd6a6f4324bf50bcc
Author: Jhon Honce <jhonce>
Date:   Thu Jan 16 12:19:29 2014 -0700

    Bug 1054403 - Reset empty metadata.json file
    
    * Use defaults if file is empty


However, there is still a problem.  Even in Online a blank deployment is being created on application move.  When the node tries registering it with the Broker it results in a 422 Unprocessible entity.  Upstream recently added logic that ignores this error and continues.  I suspect the root cause is that this blank deployment should never be created.

Comment 7 Brenton Leanhardt 2014-02-07 16:36:50 UTC
You can disregard commits Comment #6.  Those fixes were cloned as separate bugs.  The blank deployment dir bug is was solved upstream in the following commits:

commit b0add52171ce19bb66f1f644940656e511355cc8
Author: Brenton Leanhardt <bleanhar>
Date:   Mon Feb 3 13:30:07 2014 -0500

    Insure --with-initial-deployment-dir defaults to true in case the args isn't supplied.
    
    This is the handle the case of an out of date broker that doesn't pass the
    argument.  We can't have the default behavior change.

commit 5b6c8f1e177d98c0d1c52e6a76d57aaf9d2021b0
Author: Brenton Leanhardt <bleanhar>
Date:   Mon Feb 3 13:05:18 2014 -0500

    --with-initial-deployment-dir only applies to gear creation

commit 9f1e9e744a236befe66ab01bf69c1527c61d5dd0
Author: Brenton Leanhardt <bleanhar>
Date:   Wed Jan 29 16:08:18 2014 -0500

    Fixing libvirt_container to match the new create semantics

commit 26e5ad2e66670ac77d0621975562119911a0a120
Author: Brenton Leanhardt <bleanhar>
Date:   Wed Jan 29 15:22:55 2014 -0500

    Adding a unit test

commit d350fb02e3323fdf10e28db8f5c29dd8b90a6747
Author: Brenton Leanhardt <bleanhar>
Date:   Wed Jan 29 10:17:15 2014 -0500

    First pass at avoiding deployment dir create on app moves

Comment 10 Brenton Leanhardt 2014-02-10 18:32:24 UTC
In addition to Comment #7 the following upstream commit was required for the backport:

commit baeec29c1a7db0b07bf77354f5b02e35790f6156
Author: Jhon Honce <jhonce>
Date:   Wed Jan 22 15:08:19 2014 -0700

    Node Platform - Optionally generate application key

Comment 13 Brenton Leanhardt 2014-02-10 22:42:00 UTC
Looks like the "fixed in version" field has a limit:

rubygem-openshift-origin-container-selinux-0.4.1.2-1.el6op
rubygem-openshift-origin-msg-broker-mcollective-1.17.6-1.el6op
openshift-origin-msg-node-mcollective-1.17.6-1.el6op
rubygem-openshift-origin-node-1.17.5.10-1.el6op
rubygem-openshift-origin-controller-1.17.12.3-1.el6op

Comment 14 Anping Li 2014-02-11 09:23:37 UTC
Verified and pass on puddle puddle: 2014-02-10. The result is as below now:
1. The snapshot can be saved/restored after move:
[ose215@dhcp-9-237 ~]$ rhc snapshot save sruby1
Pulling down a snapshot to sruby1.tar.gz...
Creating and sending tar.gz

RESULT:
Success
[ose215@dhcp-9-237 ~]$ rhc snapshot restore sruby1
Restoring from snapshot sruby1.tar.gz...
Removing old git repo: ~/git/sruby1.git/
Removing old data dir: ~/app-root/data/*
Restoring ~/git/sruby1.git and ~/app-root/data
httpd: Could not reliably determine the server's fully qualified domain name, using nd216.oseanli.cn for ServerName
Activation status: success

RESULT:
Success
2. No tempory deployment version after restore.
[sruby1-hanli1dom.oseanli.cn app-deployments]\> ls -lah
total 16K
drwxr-xr-x.  4 52f9ea824945dc480e000187 52f9ea824945dc480e000187 4.0K Feb 11 04:16 .
drwxr-x---. 14 root                     52f9ea824945dc480e000187 4.0K Feb 11  2014 ..
drwxr-x---.  5 52f9ea824945dc480e000187 52f9ea824945dc480e000187 4.0K Feb 11 04:16 2014-02-11_04-16-58.916
drwxr-xr-x.  2 52f9ea824945dc480e000187 52f9ea824945dc480e000187 4.0K Feb 11 04:16 by-id
lrwxrwxrwx.  1 52f9ea824945dc480e000187 52f9ea824945dc480e000187   23 Feb 11 04:16 current -> 2014-02-11_04-16-58.916

Comment 15 Jason DeTiberus 2014-02-12 01:35:29 UTC
Extended test failures highlighted that the gear ssh configs were being created with the wrong permissions after the previous commits.  This was causing issues when scaling up (the actual failure was when copying the user environment variables to the newly created gear.

The following commits are also needed:
commit e4065bf88ed8a8798129f94cd02e36365aa467d4
Author: Jhon Honce <jhonce>
Date:   Thu Jan 23 14:04:57 2014 -0700

    Bug 1049044 - Create more of .openshift_ssh environment

commit 64369335f74aaf4cbdbfb9e163b526e4304079c5
Author: Jhon Honce <jhonce>
Date:   Thu Jan 23 11:04:06 2014 -0700

    Bug 1049044 - Restore setting ssh config settings for gear
    
    * Setting the file permissions was incorrectly removed
    * https://bugzilla.redhat.com/show_bug.cgi?id=1049044#c8

The following PR has been submitted to fix: https://github.com/openshift/enterprise-server/pull/229

Comment 16 Brenton Leanhardt 2014-02-12 13:50:47 UTC
rubygem-openshift-origin-node-1.17.5.11-1.el6op has been built.

Comment 18 Anping Li 2014-02-13 05:23:44 UTC
Verified on puddle-2-0-3-2014-02-12 with following steps:
1. Create scaled app sry19.
2. check the .openshift_ssh After move
[sry19-hanli1dom.oseanli.cn .openshift_ssh]\> ls -lah
total 16K
drwxr-x---.  2 52fc53684945dcb34300006b 52fc53684945dcb34300006b 4.0K Feb 13 00:06 .
drwxr-x---. 16 root                     52fc53684945dcb34300006b 4.0K Feb 13 00:06 ..
-rw-rw----.  1 52fc53684945dcb34300006b 52fc53684945dcb34300006b    0 Feb 13 00:06 config
-rw-------.  1 52fc53684945dcb34300006b 52fc53684945dcb34300006b 1.7K Feb 13 00:06 id_rsa
-rw-------.  1 52fc53684945dcb34300006b 52fc53684945dcb34300006b  423 Feb 13 00:06 id_rsa.pub
-rw-rw----.  1 52fc53684945dcb34300006b 52fc53684945dcb34300006b    0 Feb 13 00:06 known_hosts
3.snapshot save and restore 
[ose215@dhcp-9-237 ~]$ rhc snapshot save sry19
Pulling down a snapshot to sry19.tar.gz...
Creating and sending tar.gz

RESULT:
Success
[ose215@dhcp-9-237 ~]$ rhc snapshot restore sry19
Restoring from snapshot sry19.tar.gz...
Removing old git repo: ~/git/sry19.git/
Removing old data dir: ~/app-root/data/*
Restoring ~/git/sry19.git and ~/app-root/data
httpd: Could not reliably determine the server's fully qualified domain name, using nd217.oseanli.cn for ServerName
Activation status: success

RESULT:
Success

4. check the .openshift_ssh after restore.
[sry19-hanli1dom.oseanli.cn .openshift_ssh]\> ls -lah
total 16K
-rw-rw----. 52fc53684945dcb34300006b 52fc53684945dcb34300006b system_u:object_r:openshift_var_lib_t:s0:c1,c163 config
-rw-------. 52fc53684945dcb34300006b 52fc53684945dcb34300006b system_u:object_r:openshift_var_lib_t:s0:c1,c163 id_rsa
-rw-------. 52fc53684945dcb34300006b 52fc53684945dcb34300006b system_u:object_r:openshift_var_lib_t:s0:c1,c163 id_rsa.pub
-rw-rw----. 52fc53684945dcb34300006b 52fc53684945dcb34300006b system_u:object_r:openshift_var_lib_t:s0:c1,c163 known_hosts
5. scale up app
ose215@dhcp-9-237 ~]$ rhc app show sry19 --gears
ID                       State   Cartridges           Size  SSH URL
------------------------ ------- -------------------- ----- ----------------------------------------------------------------------
52fc53684945dcb34300006b started ruby-1.9 haproxy-1.4 small 52fc53684945dcb34300006b.cn
52fc55944945dcb343000095 started ruby-1.9 haproxy-1.4 small 52fc55944945dcb343000095.cn

6. ssh the second gear
[ose215@dhcp-9-237 ~]$ ssh 52fc55944945dcb343000095.cn
The authenticity of host '52fc55944945dcb343000095-hanli1dom.oseanli.cn (10.66.78.216)' can't be established.
RSA key fingerprint is a3:ba:c0:09:5f:0f:13:50:8e:1e:2e:95:7f:66:ae:7c.
Are you sure you want to continue connecting (yes/no)? yes

Comment 20 errata-xmlrpc 2014-02-25 15:43:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0209.html


Note You need to log in before you can comment on or make changes to this bug.