836973 – Sometimes jbossas-7 auto scaling up doesn't work fine

Bug 836973 - Sometimes jbossas-7 auto scaling up doesn't work fine

Summary: Sometimes jbossas-7 auto scaling up doesn't work fine

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Rajat Chopra
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-07-02 11:17 UTC by jizhao
Modified:	2015-05-14 22:56 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-07-13 23:43:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
haproxy status screenshot (71.36 KB, image/png) 2012-07-10 12:18 UTC, jizhao	no flags	Details
development.log during auto scaling up (26.42 KB, text/plain) 2012-07-11 08:46 UTC, jizhao	no flags	Details
screenshot of haproxy-status (127.52 KB, image/png) 2012-07-11 13:50 UTC, jizhao	no flags	Details
Show Obsolete (1) View All

Description jizhao 2012-07-02 11:17:18 UTC

Description of problem:
Create a jbossas-7 scalable app and use "ab" to create network flow to trigger auto scaling up. Sometimes auto scaling up doesn't work fine.

Version-Release number of selected component (if applicable):
devenv_1869

How reproducible:
Sometimes

Steps to Reproduce:
1.Create a scalable jbossas-7 app
$ rhc app create -a app1 -t jbossas-7 -s -d
2.Create "index.jsp" under <git_repo>/src/main/webapp/. Remove "index.html".
$ rm <git_repo>/src/main/webapp/index.html
==> <git_repo>/src/main/webapp/index.jsp <==
<%@ page contentType="text/html" language="java" import="java.sql.*" %>
<%@ page import="javax.naming.*" %>
<%@ page import="javax.sql.*" %>
<%@ page import="java.util.*" %>
<%
Map map = System.getenv();
out.print(map.get("OPENSHIFT_GEAR_DNS"));
%>
3.Use "ab" to create network flow
$ ab -c 200 -n 100000 <app_url>
4.tail "scale_events.log". 
$ rhc app tail -a app1 -f app1/logs/scale_events.log -o '-n +1'

Use OpenShift shell to check the app's processes.
$ ps -ef

Check <app_url>/haproxy-status/. 

  
Actual results:
Sometimes gear up retried for several times.
==> scale_events.log <==
I, [2012-07-02T06:32:03.437359 #14154]  INFO -- : GEAR_UP - capacity: 630.0% gear_count: 1 sessions: 63 up_thresh: 90.0%
D, [2012-07-02T06:34:16.238530 #14154] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 1 sessions: 0 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 0 gear_remove_thresh: 0/20
I, [2012-07-02T06:34:22.271034 #14154]  INFO -- : GEAR_UP - capacity: 1670.0% gear_count: 1 sessions: 167 up_thresh: 90.0%
D, [2012-07-02T06:36:27.662974 #14154] DEBUG -- : GEAR_INFO - capacity: 1390.0% gear_count: 1 sessions: 139 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 0 gear_remove_thresh: 0/20
I, [2012-07-02T06:36:30.686516 #14154]  INFO -- : GEAR_UP - capacity: 1020.0% gear_count: 1 sessions: 102 up_thresh: 90.0%
I, [2012-07-02T06:37:53.988507 #3249]  INFO -- : Starting haproxy_ctld
D, [2012-07-02T06:37:54.009961 #3249] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 2 sessions: 0 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 120 gear_remove_thresh: 0/20

In haproxy-status page, during gear up operation, the default gear went UP and DOWN for several times. After gear up finish, the 2 gears went UP and DOWN.

In the output of ps -ef, sometimes /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/add-gear apprears and disappears again and again. Sometimes 3 "/usr/libexec/stickshift/cartridges/jbossas-7/info/bin/sync_gears.sh" processes exist at the same time.
<----------output------------->
500       1087  1072  0 06:36 ?        00:00:00 /bin/bash /usr/libexec/stickshift/cartridges/jbossas-7/info/bin/sync_gears.sh e9853accdc1b4220bf6b0475d594723a.65.181
500       1095  1087  0 06:36 ?        00:00:00 /bin/bash /usr/libexec/stickshift/cartridges/jbossas-7/info/bin/sync_gears.sh e9853accdc1b4220bf6b0475d594723a.65.181
500       1096  1095  0 06:36 ?        00:00:00 /bin/bash /usr/libexec/stickshift/cartridges/jbossas-7/info/bin/sync_gears.sh e9853accdc1b4220bf6b0475d594723a.65.181
500       1365  1096  0 06:36 ?        00:00:00 /bin/bash /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/ssh e9853accdc1b4220bf6b0475d594723a.65.18
500       1366  1365  0 06:36 ?        00:00:00 /usr/bin/ssh -q -o BatchMode=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/lib/stickshift/21774ba7ead94cab881d
500      14148     1  1 06:26 ?        00:00:07 /usr/sbin/haproxy -f /var/lib/stickshift/21774ba7ead94cab881da54569e9b324//haproxy-1.4/conf/haproxy.cfg
500      14154     1  0 06:26 ?        00:00:00 haproxy_ctld                                                                    
500      19628 19609  0 06:28 ?        00:00:00 sshd: 21774ba7ead94cab881da54569e9b324@pts/1
500      19629 19628  0 06:28 ?        00:00:00 /bin/bash --init-file /usr/bin/rhcsh -i
500      20017 19993  0 06:29 ?        00:00:00 sshd: 21774ba7ead94cab881da54569e9b324@pts/2
500      20018 20017  0 06:29 ?        00:00:00 /bin/bash --init-file /usr/bin/rhcsh -i
500      29528 29515  0 06:34 ?        00:00:00 /bin/bash /usr/libexec/stickshift/cartridges/jbossas-7/info/bin/sync_gears.sh e52bea3355594643b801eecaa251212b.65.181
500      31353 29528  0 06:35 ?        00:00:00 cat /var/lib/stickshift/21774ba7ead94cab881da54569e9b324/app1//tmp/sync_gears.output.sbiR2YbbxMHHIV7d
500      32407 14154  2 06:36 ?        00:00:01 /usr/bin/ruby /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/add-gear -n jizhao4 -a app1 -u 21774ba7ead94
<---------------------------->


Expected results:
The app should be auto scaled up without error.

Additional info:
I've also tried php-5.3 cartridge and it works fine.

Comment 1 Bill DeCoste 2012-07-07 01:45:47 UTC

The only way I could reproduce this bug was to reduce the scale-up timeout so I have increased the previous timeout.

Comment 2 jizhao 2012-07-10 12:18:13 UTC

Created attachment 597323 [details]
haproxy status screenshot

During GEAR UP, 2 gears are all up and down.

Comment 3 jizhao 2012-07-10 12:21:54 UTC

Still get the same error on devenv_1882. I've attached the haproxy status screenshot. Change bug status back to ASSIGNED.

Comment 4 jizhao 2012-07-10 12:46:47 UTC

Sometimes, I can't destroy the app after testing.
Here is the development.log on the server:
http://pastebin.test.redhat.com/96308

Comment 5 Bill DeCoste 2012-07-10 15:17:48 UTC

Hi Jiangtao,

Hoping you can provide me with some addition info on this bug. I was unable to recreate the original bug (only 1 jboss instance/session after a SCALE_UP event) until I manually decreased the scale_up timeout. I increased that timeout in case that was the cause of the problem you were seeing.

Are you still seeing that problem or is there another problem? Are you hitting the backup stats page after a scale-up? I saw the haproxy stats attachment in the bug.

I haven't been able to recreate any problem. After a scale-up, I am able to successfully hit ROOT.war and never see the haproxy stats page that would indicate a load-balancing failure.

Thanks -Bill

Comment 6 Bill DeCoste 2012-07-10 16:17:28 UTC

I notice that AS7 is throwing the error below when running the load test. I've asked the jbossweb team to see if there's a way to increase the max active sessions, but I don't see anything in the jbossweb schema. In the meantime, could you:

1) Attach the server.log file for the 1st jboss instance when you start the load test. I would like to see if you are seeing the same error. This may be causing the failure
2) Add the following line to the top of your JSP. This will stop tomcat from creating a new session everytime the JSP is hit
<%@ page session="false" %>

2012/07/10 11:53:39,394 FATAL [org.apache.jasper.runtime.JspFactoryImpl] (http--127.0.250.129-8080-47) Exception initializing page context: java.lang.IllegalStateException: createSession: Too many active sessions
	at org.apache.catalina.session.StandardManager.createSession(StandardManager.java:293) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.connector.Request.doGetSession(Request.java:2665) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.connector.Request.getSession(Request.java:2375) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:841) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:852) [jbossweb-7.0.10.Final.jar:]
	at org.apache.jasper.runtime.PageContextImpl._initialize(PageContextImpl.java:146) [jbossweb-7.0.10.Final.jar:]
	at org.apache.jasper.runtime.PageContextImpl.initialize(PageContextImpl.java:124) [jbossweb-7.0.10.Final.jar:]
	at org.apache.jasper.runtime.JspFactoryImpl.internalGetPageContext(JspFactoryImpl.java:106) [jbossweb-7.0.10.Final.jar:]
	at org.apache.jasper.runtime.JspFactoryImpl.getPageContext(JspFactoryImpl.java:62) [jbossweb-7.0.10.Final.jar:]
	at org.apache.jsp.index_jsp._jspService(index_jsp.java:49)
	at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) [jbossweb-7.0.10.Final.jar:]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:847) [jboss-servlet-api_3.0_spec-1.0.0.Final.jar:1.0.0.Final]
	at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:369) [jbossweb-7.0.10.Final.jar:]
	at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:326) [jbossweb-7.0.10.Final.jar:]
	at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:253) [jbossweb-7.0.10.Final.jar:]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:847) [jboss-servlet-api_3.0_spec-1.0.0.Final.jar:1.0.0.Final]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:329) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161) [jbossweb-7.0.10.Final.jar:]
	at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:154) [jboss-as-web-7.1.0.Final.jar:7.1.0.Final]
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) [jbossweb-7.0.10.Final.jar:]
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:368) [jbossweb-7.0.10.Final.jar:]
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877) [jbossweb-7.0.10.Final.jar:]
	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:671) [jbossweb-7.0.10.Final.jar:]
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:930) [jbossweb-7.0.10.Final.jar:]
	at java.lang.Thread.run(Thread.java:679) [rt.jar:1.6.0_24]

Comment 7 Bill DeCoste 2012-07-10 18:52:15 UTC

Because of the LOW_MEMORY setting, the number of sessions is limited to 1024 by default. To override, add the following below </extensions> in standalone.xml:

<system-properties>
   <property name="org.apache.catalina.session.StandardManager.MAX_ACTIVE_SESSIONS" value="-1"/>
</system-properties>

This should resolve the active session limitation

Comment 8 Bill DeCoste 2012-07-10 20:18:32 UTC

Not able to recreate any scale-up problems but there are consistent problems with scale-down on the latest devenv

I, [2012-07-10T16:12:17.469959 #16426]  INFO -- : GEAR_UP - capacity: 700.0% gear_count: 1 sessions: 70 up_thresh: 90.0%
I, [2012-07-10T16:13:01.933128 #23223]  INFO -- : Starting haproxy_ctld
D, [2012-07-10T16:13:01.962894 #23223] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 2 sessions: 1 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 120 gear_remove_thresh: 0/20
I, [2012-07-10T16:13:23.214079 #23223]  INFO -- : GEAR_UP - capacity: 110.0% gear_count: 2 sessions: 23 up_thresh: 90.0%
D, [2012-07-10T16:13:28.067165 #23223] DEBUG -- : GEAR_UP - add-gear: exit: 0  stdout: Already at the maximum number of gears allowed for either the app or your account.

D, [2012-07-10T16:13:28.216436 #23223] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 2 sessions: 0 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 115 gear_remove_thresh: 0/20
I, [2012-07-10T16:15:25.458342 #23223]  INFO -- : GEAR_DOWN - capacity: 0.0% gear_count: 2 sessions: 0 remove_thresh: 49.9%
D, [2012-07-10T16:16:38.017702 #23223] DEBUG -- : GEAR_DOWN - remove-gear: exit: 256  stdout: 
D, [2012-07-10T16:16:38.030316 #23223] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 2 sessions: 0 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 0 gear_remove_thresh: 20/20
I, [2012-07-10T16:17:38.222018 #23223]  INFO -- : GEAR_DOWN - capacity: 0.0% gear_count: 2 sessions: 0 remove_thresh: 49.9%

Comment 9 Bill DeCoste 2012-07-10 21:00:08 UTC

Another data point ... if there is no traffic and manually scale-up then the scale-down works fine. Looks like the scale-down only fails under load.

Comment 10 Bill DeCoste 2012-07-10 21:46:21 UTC

The scale-down issue only occurs when JBoss is configured to allow unlimited sessions. When the JSP has <%@ page session="false" %>, then it behaves correctly. I am moving to QA to confirm the original issues and will continue to look into the session/scale-down problem

Comment 11 jizhao 2012-07-11 08:46:57 UTC

Created attachment 597505 [details]
development.log during auto scaling up

I've attached the development.log during auto scaling up.
Besides, I also get some error message in scale_events.log of the app:

D, [2012-07-11T04:36:22.949483 #5076] DEBUG -- : GEAR_INFO - capacity: 10.0% gear_count: 1 sessions: 1 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 84 gear_remove_thresh: 0/20
I, [2012-07-11T04:36:26.002934 #5076]  INFO -- : GEAR_UP - capacity: 380.0% gear_count: 1 sessions: 38 up_thresh: 90.0%
/usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/add-gear:61:in `initialize': undefined method `http' for #<RestClient::InternalServerError:0x7fd369f72be8> (NoMethodError)
	from /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/add-gear:211:in `new'
	from /usr/libexec/stickshift/cartridges/embedded/haproxy-1.4/info/bin/add-gear:211
D, [2012-07-11T04:38:35.470239 #5076] DEBUG -- : GEAR_UP - add-gear: exit: 256  stdout: 
D, [2012-07-11T04:38:35.503193 #5076] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 1 sessions: 0 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 0 gear_remove_thresh: 0/20

Comment 12 Bill DeCoste 2012-07-11 12:18:35 UTC

I haven't been able to reproduce any error, but I'll keep trying. The error above is one reason why we should not be using Ruby.

Comment 13 Bill DeCoste 2012-07-11 12:27:03 UTC

Can you please provide the exact steps you used to recreate this problem? Were the session errors corrected and if so, how?

Comment 14 Bill DeCoste 2012-07-11 12:31:54 UTC

What environment/ami are you using to create this error?

Comment 15 jizhao 2012-07-11 13:50:25 UTC

Created attachment 597578 [details]
screenshot of haproxy-status

Comment 16 jizhao 2012-07-11 13:53:09 UTC

The error I meet is a little different from the original one now. Instead of seeing two gears UP and DOWN, the default gear go DOWN.


Version-Release number of selected component (if applicable):
devenv_1884

How reproducible:
Always


Steps to Reproduce:
1.Create a scalable jbossas-7 app
$ rhc app create -a app1 -t jbossas-7 -s -d
2.Create "index.jsp" under <git_repo>/src/main/webapp/. Remove "index.html".
$ rm <git_repo>/src/main/webapp/index.html
==> <git_repo>/src/main/webapp/index.jsp <==
<%@ page session="false" %>
<%@ page contentType="text/html" language="java" %>
<%@ page import="javax.naming.*" %>
<%@ page import="java.util.*" %>
<%
Map map = System.getenv();
out.print(map.get("OPENSHIFT_GEAR_DNS"));
%>
3. Add the following to <git_repo>/.openshift/config/standalone.xml:
    <system-properties>
       <property name="org.apache.catalina.session.StandardManager.MAX_ACTIVE_SESSIONS" value="-1"/>
    </system-properties>

4. Use Openshift shell to login the app, stop haproxy_ctld and re-start haproxy_ctld with --debug. This way we can get detailed haproxy log.
$ ssh <app_uuid>@<app_host>
$ haproxy_ctld_daemon stop
$ haproxy_ctld --debug &
5.Use "ab" to create network flow
$ ab -c 200 -n 100000 <app_url>
6.tail "scale_events.log". Check <app_url>/haproxy-status/.
$ rhc app tail -a app1 -f app1/logs/scale_events.log -o '-n +1'


Actual results:
On my side, in <app_url>/haproxy-status/, after total session number growing to 26251, the default gear go DOWN. If I visit <app_url>/, I will get the haproxy-status page. Here is the screenshot: https://bugzilla.redhat.com/attachment.cgi?id=597578
In scale_events.log, I get:
....
D, [2012-07-11T09:33:51.146001 #30386] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 1 sessions: 0 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 24 gear_remove_thresh: 0/20
I, [2012-07-11T09:33:54.160020 #30386]  INFO -- : GEAR_UP - capacity: 200.0% gear_count: 1 sessions: 20 up_thresh: 90.0%
D, [2012-07-11T09:36:15.816112 #30386] DEBUG -- : GEAR_UP - add-gear: exit: 256  stdout: 
D, [2012-07-11T09:36:15.824645 #30386] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 1 sessions: 0 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 0 gear_remove_thresh: 0/20
....

Here's the development.log: http://pastebin.test.redhat.com/96516

Comment 17 Bill DeCoste 2012-07-11 17:18:31 UTC

Still unable to reproduce (latest AMI tested is ami-1b852a72). The development log shows a failure in parsing an mcollective response.

Comment 18 Bill DeCoste 2012-07-11 21:55:24 UTC

Have recreated on a medium instance ... stay tuned

Comment 19 Bill DeCoste 2012-07-11 22:16:28 UTC

The mcollective timeout causes the original JBoss instance to be deleted leaving only an haproxy gear with 0 backing JBoss instances.

Comment 20 Bill DeCoste 2012-07-11 22:41:34 UTC

http://pastebin.com/J53AsZyN

Comment 21 Bill DeCoste 2012-07-11 22:43:56 UTC

D, [2012-07-11T17:41:55.197647 #9381] DEBUG -- : GEAR_INFO - capacity: 10.0% gear_count: 1 sessions: 1 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 0 gear_remove_thresh: 0/20
I, [2012-07-11T17:41:58.237579 #9381]  INFO -- : GEAR_UP - capacity: 230.0% gear_count: 1 sessions: 23 up_thresh: 90.0%
D, [2012-07-11T17:44:17.593840 #9381] DEBUG -- : GEAR_UP - add-gear: exit: 256  stdout: 
D, [2012-07-11T17:44:17.623614 #9381] DEBUG -- : GEAR_INFO - capacity: 0.0% gear_count: 1 sessions: 0 up/remove_thresh: 90.0%/49.9% sec_left_til_remove: 0 gear_remove_thresh: 0/20

Comment 22 Bill DeCoste 2012-07-12 00:14:02 UTC

Post fix, you will still see 256 failures for scale-up at times on a medium instance. The app should keep on rolling however.

Comment 23 Rajat Chopra 2012-07-12 00:26:35 UTC

fixed with rev#06250abd651f4b4be3fe28cd09834ce4b18ba30e in crankcase.repo

Comment 24 jizhao 2012-07-12 09:23:43 UTC

Verified on devenv-stage_223.

Note You need to log in before you can comment on or make changes to this bug.