Bug 1124416

Summary: Could not scale a rails application
Product: OpenShift Online Reporter: Hugo Amorim <hugoamorimlyra>
Component: ImageAssignee: Maciej Szulik <maszulik>
Status: CLOSED CANTFIX QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.xCC: bparees, decarr, dmcphers, hugoamorimlyra, jokerman, mmccomas, ringrum
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-06-12 01:38:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hugo Amorim 2014-07-29 12:48:35 UTC
Description of problem:

Could not scale a rails application keep return this error:

DEBUG: Response did not include a message from server: invalid character at "<html><bod" Server returned an unexpected error code: 504

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. rhc app scale-up -a {app} -d

Actual results:

DEBUG: Request POST https://openshift.redhat.com/broker/rest/application/{{app_token}}/events
DEBUG:    code 504 300246 ms
DEBUG: Response did not include a message from server: invalid character at "<html><bod"
Server returned an unexpected error code: 504

Expected results:


Additional info:

Comment 1 Derek Carr 2014-07-29 15:47:51 UTC
I tried to reproduce the described scenario as follows on devenv_5022:

Steps to Reproduce:

1. rhc setup --server https://localhost/broker/rest/api
2. rhc create-app myapp ruby-2.0 -s
3. rhc app scale-up -a myapp -d

Actual results:

[root@ip-10-63-41-92 ~]# rhc app scale-up -a myapp -d
DEBUG: Using config file /root/.openshift/express.conf
DEBUG: Git config 'git config --get rhc.app-id' returned ''
DEBUG: Git config 'git config --get rhc.app-name' returned ''
DEBUG: Git config 'git config --get rhc.domain-name' returned ''
DEBUG: Authenticating with RHC::Auth::Token
DEBUG: Connecting to https://localhost/broker/rest/api
DEBUG: Getting all domains
DEBUG: Client supports API versions 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
DEBUG: Using token authentication
DEBUG: Created new httpclient
DEBUG: Request GET https://localhost/broker/rest/api
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200   24 ms
DEBUG: Server supports API versions 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
DEBUG:    Using API version 1.7
DEBUG: Client API version 1.7 is not current. Refetching API
DEBUG: Using token authentication
DEBUG: Request GET https://localhost/broker/rest/api
DEBUG:    code 200   13 ms
DEBUG: Using token authentication
DEBUG: Request GET https://localhost/broker/rest/domains
DEBUG:    code 200   18 ms
DEBUG: Using token authentication
DEBUG: Request GET https://localhost/broker/rest/domain/temp/application/myapp
DEBUG:    code 200   85 ms
DEBUG: Using token authentication
DEBUG: Request POST https://localhost/broker/rest/application/53d7f975676b7cd98e000029/events
DEBUG:    code 200 20627 ms
RESULT:
myapp scaled up

Comment 2 Hugo Amorim 2014-07-29 15:55:02 UTC
Derek, i removed the compiled assets process in deployment and this steps works like you show. But when I testing the autoscale with load test i get this error:

I, [2014-07-29T11:22:25.726508 #173588]  INFO -- : Starting haproxy_ctld
I, [2014-07-29T11:22:25.748485 #173588]  INFO -- : add-gear - capacity: 462.5% gear_count: 1 sessions: 74 up_thresh: 90.0%
I, [2014-07-29T11:27:10.972963 #173588]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 500

Comment 3 Hugo Amorim 2014-07-29 16:50:44 UTC
I'm try with jenkins the same error occurs with auto-scale:

I, [2014-07-29T12:42:29.305580 #262170]  INFO -- : add-gear - capacity: 112.5% gear_count: 1 sessions: 18 up_thresh: 90.0%
I, [2014-07-29T12:48:05.932454 #262170]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 504

Comment 4 Hugo Amorim 2014-07-29 18:02:15 UTC
Manual scale works!

Autoscaling with traffic keep generate this error:

add-gear - capacity: 100.0% gear_count: 1 sessions: 16 up_thresh: 90.0%
I, [2014-07-29T13:58:54.944096 #426382]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 504

Comment 5 Derek Carr 2014-07-29 20:12:41 UTC
Can you provide the following information so we can debug further:

1) user login (not password)
2) domain
3) application name attempting to scale

Thanks,

Comment 6 Lili Nader 2014-07-29 21:00:00 UTC
It looks like you have set both max and min on this app/cartridge to 2.  That means you cannot scale up or down.  What is your intention behind this setting?

To change the settings you can run

rhc cartridge scale ruby-1.9 --min <integer> --max <integer> -a saro

Comment 7 Hugo Amorim 2014-07-29 21:42:25 UTC
Lili, I really set to 2! Until I dont figure out what happens with auto scale.


(In reply to Lili Nader from comment #6)
> It looks like you have set both max and min on this app/cartridge to 2. 
> That means you cannot scale up or down.  What is your intention behind this
> setting?
> 
> To change the settings you can run
> 
> rhc cartridge scale ruby-1.9 --min <integer> --max <integer> -a saro

Comment 8 Hugo Amorim 2014-07-29 21:43:21 UTC
hugoamorimlyra
www.sistemasaro.com.br
sistemasaro

(In reply to Derek Carr from comment #5)
> Can you provide the following information so we can debug further:
> 
> 1) user login (not password)
> 2) domain
> 3) application name attempting to scale
> 
> Thanks,

Comment 9 Lili Nader 2014-07-30 17:34:09 UTC
Hi Hugo, by setting min and max to the same number, you have essentially disabled autoscaling.  Because autoscaling cannot scale-up above 2 gears and it cannot scale-down below 2 gears.  

Please change max to be greater than min value.

Comment 10 Hugo Amorim 2014-07-30 17:43:33 UTC
Lili, I understand that. But when i changed to 1-16 a example and started a load test the autoscale see the traffic and called a "action" to start another Gear, but that gear don't succeed in this action, keep return this errors:

add-gear - capacity: 100.0% gear_count: 1 sessions: 16 up_thresh: 90.0%
I, [2014-07-29T13:58:54.944096 #426382]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 504

I, [2014-07-29T11:22:25.726508 #173588]  INFO -- : Starting haproxy_ctld
I, [2014-07-29T11:22:25.748485 #173588]  INFO -- : add-gear - capacity: 462.5% gear_count: 1 sessions: 74 up_thresh: 90.0%
I, [2014-07-29T11:27:10.972963 #173588]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 500

Comment 11 Lili Nader 2014-07-31 04:51:48 UTC
Hugo, where are you getting these log entries?  When I look at our system logs I see

A success message on scale up

INFO src=useraction RESULT=SUCCESS STATUS=ok TIMESTAMP=1406654982 DATE=2014-07-29 TIME=13:29:42 ACTION=SCALE_UP_APPLICATION REQ_ID=15fce493774a8b2ea9758d6444c4b74f USER_ID=53d401b65973cac8de00018d LOGIN=hugoamorimlyra APP_UUID=53d7d1dae0b8cd1aed0003f9 DOMAIN=manguebits Application saro has scaled to 3

A failure of scale up due to time out which is ok because it will try again

INFO src=useraction RESULT=FAILURE STATUS=internal_server_error TIMESTAMP=1406656774 DATE=2014-07-29 TIME=13:59:34 ACTION=SCALE_UP_APPLICATION REQ_ID=9080f46c18a8ea1c6e4b515b1c3fb433 USER_ID=53d401b65973cac8de00018d LOGIN=hugoamorimlyra APP_UUID=53d7d1dae0b8cd1aed0003f9 DOMAIN=manguebits Unable to complete the requested operation due to: The server ex-std-node19.prod.rhcloud.com that your application is running on failed to respond in time.  This may be due to a system restart.


And a failure to scale down due to min=2

INFO src=useraction RESULT=SUCCESS STATUS=unprocessable_entity TIMESTAMP=1406657441 DATE=2014-07-29 TIME=14:10:41 ACTION=SCALE_DOWN_APPLICATION REQ_ID=32f12cece39302888ef1c98f1486e3a1 USER_ID=53d401b65973cac8de00018d LOGIN=hugoamorimlyra APP_UUID=53d7d1dae0b8cd1aed0003f9 DOMAIN=manguebits Cannot scale down below gear limit of 2.

Comment 12 Hugo Amorim 2014-07-31 11:39:49 UTC
I'm see this logs on haproxy_ctld.log!

I configure the application to scale to 1-3, now the application has one gear, and started a load test. Nothing again the scale see the traffic but nothing to successfully scale up. Again the log in haproxy_ctld.log:

I, [2014-07-31T07:21:13.706729 #254681]  INFO -- : Starting haproxy_ctld
I, [2014-07-31T07:24:59.022322 #254681]  INFO -- : add-gear - capacity: 112.5% gear_count: 1 sessions: 18 up_thresh: 90.0%
I, [2014-07-31T07:25:38.681584 #254681]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 502

The status of my application for almost 30 minutes was http://cubeupload.com/im/lwwirh.png and nothing to scale.

I will keep this settings 1-3, please try to make some request and see if you successfully scale the application.

I appreciate your help and attention.


(In reply to Lili Nader from comment #11)
> Hugo, where are you getting these log entries?  When I look at our system
> logs I see
> 
> A success message on scale up
> 
> INFO src=useraction RESULT=SUCCESS STATUS=ok TIMESTAMP=1406654982
> DATE=2014-07-29 TIME=13:29:42 ACTION=SCALE_UP_APPLICATION
> REQ_ID=15fce493774a8b2ea9758d6444c4b74f USER_ID=53d401b65973cac8de00018d
> LOGIN=hugoamorimlyra APP_UUID=53d7d1dae0b8cd1aed0003f9
> DOMAIN=manguebits Application saro has scaled to 3
> 
> A failure of scale up due to time out which is ok because it will try again
> 
> INFO src=useraction RESULT=FAILURE STATUS=internal_server_error
> TIMESTAMP=1406656774 DATE=2014-07-29 TIME=13:59:34
> ACTION=SCALE_UP_APPLICATION REQ_ID=9080f46c18a8ea1c6e4b515b1c3fb433
> USER_ID=53d401b65973cac8de00018d LOGIN=hugoamorimlyra
> APP_UUID=53d7d1dae0b8cd1aed0003f9 DOMAIN=manguebits Unable to complete the
> requested operation due to: The server ex-std-node19.prod.rhcloud.com that
> your application is running on failed to respond in time.  This may be due
> to a system restart.
> 
> 
> And a failure to scale down due to min=2
> 
> INFO src=useraction RESULT=SUCCESS STATUS=unprocessable_entity
> TIMESTAMP=1406657441 DATE=2014-07-29 TIME=14:10:41
> ACTION=SCALE_DOWN_APPLICATION REQ_ID=32f12cece39302888ef1c98f1486e3a1
> USER_ID=53d401b65973cac8de00018d LOGIN=hugoamorimlyra
> APP_UUID=53d7d1dae0b8cd1aed0003f9 DOMAIN=manguebits Cannot scale down below
> gear limit of 2.

Comment 13 Hugo Amorim 2014-07-31 11:49:58 UTC
I keep the load test for a little while:

I, [2014-07-31T07:24:59.022322 #254681]  INFO -- : add-gear - capacity: 112.5% gear_count: 1 sessions: 18 up_thresh: 90.0%
I, [2014-07-31T07:25:38.681584 #254681]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 502
I, [2014-07-31T07:35:39.770020 #254681]  INFO -- : add-gear - capacity: 518.75% gear_count: 1 sessions: 83 up_thresh: 90.0%
I, [2014-07-31T07:37:22.708368 #254681]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 500
I, [2014-07-31T07:47:24.147964 #254681]  INFO -- : add-gear - capacity: 506.25% gear_count: 1 sessions: 81 up_thresh: 90.0%
I, [2014-07-31T07:48:54.480351 #254681]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 500

He keep trying to scale!

Comment 14 Hugo Amorim 2014-07-31 12:07:39 UTC
In the last try:

I, [2014-07-31T07:24:59.022322 #254681]  INFO -- : add-gear - capacity: 112.5% gear_count: 1 sessions: 18 up_thresh: 90.0%
I, [2014-07-31T07:25:38.681584 #254681]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 502
I, [2014-07-31T07:35:39.770020 #254681]  INFO -- : add-gear - capacity: 518.75% gear_count: 1 sessions: 83 up_thresh: 90.0%
I, [2014-07-31T07:37:22.708368 #254681]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 500
I, [2014-07-31T07:47:24.147964 #254681]  INFO -- : add-gear - capacity: 506.25% gear_count: 1 sessions: 81 up_thresh: 90.0%
I, [2014-07-31T07:48:54.480351 #254681]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 500
I, [2014-07-31T07:58:55.849197 #254681]  INFO -- : add-gear - capacity: 506.25% gear_count: 1 sessions: 81 up_thresh: 90.0%
I, [2014-07-31T08:04:29.522432 #254681]  INFO -- : add-gear - exit_code: 1  output: The add-gear request failed with http_code: 504


He change the code and a worst thing happens he make me available gears down for a little time!

Comment 15 Lili Nader 2014-07-31 18:14:51 UTC
Hugo, thanks this very helpful!  I think it indicates that the haproxy cartridge itself is having problems adding a new gear.

Anyway, you can share with us you code and load test so we can recreate this issue is a dev environment?  If not that is ok too

Comment 16 Hugo Amorim 2014-07-31 18:41:18 UTC
Lili for load test i'm using ApacheBench with this simple command:

ab -n 10000 -c 40 https://www.sistemasaro.com.br/

The code I dont have permission to share. Sorry.

Some info about the stack may help:

Rails 4
Delayed Job
Postgres

-The only thing i change on openshift config was:

-Add marker disable_asset_compilation 

-Add on post_start_ruby 
cd $OPENSHIFT_REPO_DIR
RAILS_ENV=production script/delayed_job start &

-Add on  pre_stop_ruby
cd $OPENSHIFT_REPO_DIR
RAILS_ENV=production script/delayed_job stop &

Comment 17 Derek Carr 2014-08-08 17:08:03 UTC
It appears that the haproxy cartridge itself is having the problem adding the gear.

Comment 18 Ben Parees 2014-08-08 17:32:07 UTC
Maciej, please take a look and see if this is a cartridge issue (problem creating the additional gear).  (See if you can recreate it first)

Comment 19 Maciej Szulik 2014-08-15 20:04:28 UTC
Hugo, can you please provide us with more information regarding your setup? From previous comments I remember you can't provide us with the code, but can you give more or less information what kind of calls are being made on the main page? Which cartridges exactly you're using, as well as what's inside your job? Did you try to test your application without that job? The reason for me asking all this is that I've spent last two days testing various combination of ruby carts, db calls and scaling and unfortunately wasn't able to reproduce the problem.
Thanks,

Comment 20 Hugo Amorim 2014-09-01 12:56:55 UTC
Cartridges:
 Ruby 1.9
 Web Load Balancer
 PostgreSQL 9.2

The jobs is only a delaydjob work to send email with mandrill api.
The application is a simple crud application only http request.

Comment 21 Robert Ingrum 2015-05-15 20:49:21 UTC
Has there been a resolution to this?  I'm getting the same issue on a Django application.

Comment 22 Ben Parees 2015-05-15 21:14:41 UTC
Robert, if fails when autoscaling, or manually scaling?

Can you tell us more about your app?  any other cartridges besides python?

Comment 23 Robert Ingrum 2015-05-15 21:18:31 UTC
Of course.  It fails both when autoscaling, and when manually scaling.

One gear has Python 2.7, Web Load Balancer, and Cron 1.4.  The other has PostgreSQL 9.2.  I'm trying to scale up the Python gear.

Here is a copy of the console log when I try to scale up using rhc:

C:\Users\Robert>rhc cartridge-scale python --min 2 -a production -d
DEBUG: Using config file C:/Users/Robert/.openshift/express.conf
DEBUG: Git config 'git config --get rhc.app-id' returned ''
DEBUG: Git config 'git config --get rhc.app-name' returned ''
DEBUG: Git config 'git config --get rhc.domain-name' returned ''
DEBUG: Authenticating with RHC::Auth::Token
DEBUG: Connecting to https://openshift.redhat.com/broker/rest/api
DEBUG: Getting all domains
DEBUG: Client supports API versions 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
DEBUG: Created new httpclient
DEBUG: Request GET https://openshift.redhat.com/broker/rest/api
DEBUG:    code 200 1880 ms
DEBUG: Server supports API versions 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
DEBUG:    Using API version 1.7
DEBUG: Client API version 1.7 is not current. Refetching API
DEBUG: Request GET https://openshift.redhat.com/broker/rest/api
DEBUG:    code 200  112 ms
DEBUG: Using token authentication
DEBUG: Request GET https://openshift.redhat.com/broker/rest/domains
DEBUG:    code 200  220 ms
DEBUG: Using token authentication
DEBUG: Request GET https://openshift.redhat.com/broker/rest/domain/purplewall/application/production?include=cartridges
DEBUG:    code 200  352 ms
Using python-2.7 (Python 2.7) for 'python'
This operation will run until the application is at the minimum scale and may take several minutes.
Setting scale range for python-2.7 ... DEBUG: Setting scales = scales_from: 2
DEBUG: Using token authentication
DEBUG: Request PUT https://openshift.redhat.com/broker/rest/application/52f2c6d7e0b8cd42a200001a/cartridge/python-2.7
DEBUG:    code 500 189132 ms

Unable to complete the requested operation due to: An invalid exit code (1) was returned from the server ex-std-node506.prod.rhcloud.com.  This indicates an unexpected problem during the execution of your request.
Reference ID: 7264cb26da5a89ce552d6ce6fa334687

Comment 24 Ben Parees 2015-05-15 21:24:39 UTC
That looks a bit different than the original issue here.  That sounds like something our operations team may need to look into.  Can you open a new bug and paste in your last comment? You can assign it to me initially and i'll get the right team to look at it.

thanks.

Comment 25 Robert Ingrum 2015-05-15 21:25:35 UTC
Will do, thank you for the quick response.