Bug 962801 - "git push origin" dies while syncing with leaf gears (scaled Rails app)
Summary: "git push origin" dies while syncing with leaf gears (scaled Rails app)
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Mrunal Patel
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-14 13:07 UTC by Boris Mironov
Modified: 2015-05-14 23:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-06 18:27:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Gemfile before change (1.45 KB, text/plain)
2013-05-14 13:08 UTC, Boris Mironov
no flags Details
Gemfile.lock before change (4.31 KB, text/plain)
2013-05-14 13:09 UTC, Boris Mironov
no flags Details
Gemfile after change (1.46 KB, text/plain)
2013-05-14 13:09 UTC, Boris Mironov
no flags Details
Gemfile.lock after change (4.32 KB, text/plain)
2013-05-14 13:10 UTC, Boris Mironov
no flags Details

Description Boris Mironov 2013-05-14 13:07:51 UTC
Description of problem:
"git push origin" dies while pushing changes to leaf gears of scaled Rails app.

Version-Release number of selected component (if applicable):
www.openshift.com as of May 13, 2013

How reproducible:
Very often, especially when new gems / new version of gems are introduced to Gemfile

Steps to Reproduce:
1. create scaled Rails app on openshift.com (preferably with small gears)
2. change Gemfile & Gemfile.lock in your local to one in attachment
3. git push origin
  
Actual results:
Deployment dies with following snippet:

--------------------snip ------------------------
remote: Your bundle is complete! It was installed into ./vendor/bundle
remote: Precompiling with 'bundle exec rake assets:precompile'
remote: Running .openshift/action_hooks/build
remote: Running .openshift/action_hooks/build
remote: MySQL already running
remote: Running .openshift/action_hooks/deploy
remote: Database server found at 518ad7bb4382ec74f8000005-ag47.rhcloud.com. initializing...
remote: SSH_CMD: ssh 518c6b5f5973caa19000003d.70.78
remote: SSH_CMD: ssh 518c6b5f5973caa19000003e.147.104
Read from remote host rails-ag47.rhcloud.com: Connection reset by peer
fatal: The remote end hung up unexpectedly
error: error in sideband demultiplexer
To ssh://518ad7bb4382ec74f8000002.com/~/git/rails.git/
   fdf3694..cab3939  master -> master
error: failed to push some refs to 'ssh://518ad7bb4382ec74f8000002.com/~/git/rails.git/'


My ~/.ssh/config:
Host rails-ag47.rhcloud.com
	User 518ad7bb4382ec74f8000002
	IdentityFile ~/.ssh/id_rsa.OpenShift_AG47
	ServerAliveInterval 180
	ServerAliveCountMax 10




Expected results:
Successful deployment of application

Additional info:

I believe the issue is SSH configuration on head gear. I do not have acceess to ~/.ssh/config there, hence can not be sure.

When I login to head gear after failure and execute "rsync ruby-1.9/repo/ <my leaf gear>:reuby-1.9/repo/" there are still some files that are not syncronized between gears. Since my vendor/bundle is bigger than 100Mb it takes some significant amount of time to zip its content, ship it and unzip on leaf gear. rsync to both leaf grears in this case is started in parallel therefore 2 gzip sessions are zipping 100Mb at the same time on head gear.

Comment 1 Boris Mironov 2013-05-14 13:08:42 UTC
Created attachment 747694 [details]
Gemfile before change

Comment 2 Boris Mironov 2013-05-14 13:09:14 UTC
Created attachment 747695 [details]
Gemfile.lock before change

Comment 3 Boris Mironov 2013-05-14 13:09:37 UTC
Created attachment 747696 [details]
Gemfile after change

Comment 4 Boris Mironov 2013-05-14 13:10:02 UTC
Created attachment 747697 [details]
Gemfile.lock after change

Comment 5 Mrunal Patel 2013-05-24 18:26:58 UTC
Did you modify the post-receive / pre-receive hooks on the gear?
Also, another possibility is corrupted repository
https://help.github.com/articles/fixing-egit-corruption

Comment 6 Boris Mironov 2013-05-24 19:50:53 UTC
Hi Mrunal,

1) No I did not modify those hooks.

2) My repository seems to be OK (not corrupted) because I can use "push" and "pull" from my desktop.



I believe, the issue is delays between gears when there is no SSH traffic between them for longer then default interval. Hence, solution could be as simple as adding 2 lines into ~/.ssh/config on head gear for communication with all leaf gears:

ServerAliveInterval 180
ServerAliveCountMax 10

Of course, exact values should be in agreement with OS admins. Unfortunately, I do not have access to this file and can not prove or disprove the theory.

Thanks,
Boris

Comment 7 Mrunal Patel 2013-05-24 23:00:39 UTC
I modified the delayed_job calls to be nohup'ed and git push worked fine for the app. Could you git clone from the app and try a push again?

Comment 8 Boris Mironov 2013-05-25 03:57:47 UTC
I did it without success. See my report in bug #962807

Thanks,
Boris

Comment 9 Mrunal Patel 2013-05-25 16:49:45 UTC
Boris,
It looks like git push itself is failing because of a timeout. Could you take out 
ServerAliveInterval and ServerAliveCountMax or set them to a much higher value?

I don't have those settings and git push worked fine for me. It did take ~7min
for it to complete - much higher than 180 seconds.

remote:     https://www.openshift.com/legal
remote: 
remote:     *********************************************************************
remote: 
remote:     Welcome to OpenShift shell
remote: 
remote:     This shell will assist you in managing OpenShift applications.
remote: 
remote:     !!! IMPORTANT !!! IMPORTANT !!! IMPORTANT !!!
remote:     Shell access is quite powerful and it is possible for you to
remote:     accidentally damage your application.  Proceed with care!
remote:     If worse comes to worst, destroy your application with 'rhc app delete'
remote:     and recreate it
remote:     !!! IMPORTANT !!! IMPORTANT !!! IMPORTANT !!!
remote: 
remote:     Type "help" for more info.
remote: 
remote: Starting services
remote: WARNING: This ssh terminal was started without a tty.
remote:           It is highly recommended to login with: ssh -t
remote: Done
remote: + for rpccall in '"${OPENSHIFT_SYNC_GEARS_POST[@]}"'
remote: + ssh 518c6b5f5973caa19000003e.147.104 post_deploy.sh
remote: Running .openshift/action_hooks/post_deploy
remote: Exit code: 0
remote: hot_deploy_added=false
remote: Done
remote: Running .openshift/action_hooks/post_deploy
To ssh://518ad7bb4382ec74f8000002.com/~/git/rails.git/
   dc83e85..a75ad63  master -> master

real	7m8.676s
user	0m0.033s
sys	0m0.039s
[mrunal@localhost rails]$ 



Thanks,
Mrunal

Comment 10 Boris Mironov 2013-05-26 05:00:55 UTC
Hi Mrunal,

Thanks for update. I have updated my SSH configuration (~/.ssh/config) to:

	ServerAliveInterval 60
	ServerAliveCountMax 15

and tested it twice. Both "git push origin" were successful.

Let me check it few more times from different locations (network configurations: eg, wired, wireless, ...).

I also found that yui-compressor gem requires java JDK. This could cause a slowdown on OpenShift side when it starts on small gear that is short on memory.


Thanks again,
Boris

Comment 11 Boris Mironov 2013-05-26 18:17:44 UTC
Hi Mrunal,

I tested "git push" 2 more times (even over celluar network by pairing with iPhone). Both times it was successful.

So, now I have tried it 4 times with 100% successful rate.

Please do not close the bug for another week, so I can try it even more times.

Thanks,
Boris

Comment 12 Boris Mironov 2013-05-29 16:08:55 UTC
Hi Mrunal,

Could you please check same deployment (as in Comment #9) against scaled Rails application deployed on 2 small gears? rails-ag47.rhcloud.com uses 3 medium gears. 

Very often "git push" was dying during "rake assets:precomile" phase. This is when Rails application gets all static assets (JS, CSS) minimized, compiled and zipped. On a background 'yui-compressor' gem starts Java JDK to carry some tasks. On a small gear with 512Mb RAM this could take significant amount of time. During this phase there is no communication between my desktop and OpenShift gear. This is why SSH channel would timeout after 3 minutes (default configuration). By setting those ServerAlive* parameters I can rectify it.

Another "Achille's Heel" of the deployment is phase of syncing between head and leaf gears. Here application code gets pushed by "rsync" between gears. If it is new deployment (or significant change in Gemfile) with a number of gems, then it could take significant amount of time. For example, in my application there is more than 100Mb of gems files in "vendor/bundle/ruby/1.9.1/gems/" directory. Current "rsync" is configured to compress all data between gears, hence it should compress 100Mb of files, ship it over to another gear and decompress it there. All is done via SSH, hence add encryption & decryption of network data. If I have 3 gear-application, then this task is done in 2 parallel streams, one for each leaf gear. Needless to say, that it can take more than 3 minutes and here SSH channel between gears will timeout.

To make long story short. Would you consider setting keepalive pings into default SSH gear configuration? It can be done as SSH server configuration, or as SSH client (~/.ssh/config) configuration. I would prefer 2nd one.

Thanks,
Boris

Comment 13 Mrunal Patel 2013-06-17 23:27:25 UTC
Some of the enhancements that we are making to the platform will handle these cases better. Lowering severity as this isn't considered a blocker for the next release.

Comment 14 Xiaoli Tian 2013-06-19 09:29:03 UTC
Move it back to assigned status since there's still some enhancements.

Comment 15 Mrunal Patel 2013-09-04 19:40:30 UTC
Boris,
Are you seeing any issues with your application?

Thanks,
Mrunal

Comment 16 Mrunal Patel 2013-09-06 18:27:36 UTC
I am going to close the bug for now. Please re-open or open a new bug if you see the issue again.


Note You need to log in before you can comment on or make changes to this bug.