1025844 – RFE: Ship with RHQ server tuned for higher capacity

Bug 1025844 - RFE: Ship with RHQ server tuned for higher capacity

Summary: RFE: Ship with RHQ server tuned for higher capacity

Keywords:
Status:	NEW
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Installer
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1026428
TreeView+	depends on / blocked

Reported:	2013-11-01 17:52 UTC by Elias Ross
Modified:	2022-04-23 08:28 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Clones:	1026428 (view as bug list)
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description Elias Ross 2013-11-01 17:52:00 UTC

Description of problem:

The RHQ server, if used with over a hundred (or a thousand agents) cannot handle, out-the-box, enough load to cleanly do an upgrade.

This may not be customer-typical, but also doesn't seem likely to do much harm to adjust even on smaller installations.

Although I cannot identify what is most important, the following need to be tuned:

1) Increase the default size of the storage node memory usage. I would say that for about 1000 nodes, around 5GB of heap memory for Cassandra is good. Though I think the installer should simply pick a good number based on the local free memory size.

Example error:

01:44:14,249 ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-64) JBAS014134: EJB Invocation failed on component ResourceManagerBean for method public
 abstract void org.rhq.enterprise.server.resource.ResourceManagerLocal.addResourceError(org.rhq.core.domain.resource.ResourceError): javax.ejb.EJBException: J
BAS014516: Failed to acquire a permit within 5 MINUTES
        at org.jboss.as.ejb3.pool.strictmax.StrictMaxPool.get(StrictMaxPool.java:109) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]
        at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:47) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]


2) Increase the size of the EJB pool. What happened with 4.5.1 -> 4.9 upgrade was that the number of inventory requests went up substantially in a short time. This caused many, many timeouts.
                    <strict-max-pool name="slsb-strict-max-pool" max-pool-size="2000" instance-acquisition-timeout="1" instance-acquisition-timeout-unit="MINUTES"/>

3) Increase the out-of-box communication limits:

rhq.server.startup.web.max-connections=1000
rhq.server.agent-downloads-limit=45
rhq.server.client-downloads-limit=5
rhq.communications.global-concurrency-limit=200
rhq.server.concurrency-limit.inventory-report=25
rhq.server.concurrency-limit.availability-report=25
rhq.server.concurrency-limit.inventory-sync=25
rhq.server.concurrency-limit.content-report=25
rhq.server.concurrency-limit.content-download=25
rhq.server.concurrency-limit.measurement-report=25
rhq.server.concurrency-limit.measurement-schedule-request=25
rhq.server.concurrency-limit.configuration-update=25


Version-Release number of selected component (if applicable): 4.9 (from 4.5.1)

Comment 1 Mike Foley 2013-11-01 18:00:14 UTC

minimally, this should be considered as documentation for jon 3.2

Comment 2 Mike Foley 2013-11-04 15:53:29 UTC

i cloned this into a JON 3.2 BZ (removing the blocker on this, and adding it to the new JON BZ)

Comment 3 Elias Ross 2014-01-14 21:52:05 UTC

For 2000 agents, the numbers need to be increased 

rhq.server.startup.web.max-connections=2000
rhq.server.agent-downloads-limit=45
rhq.server.client-downloads-limit=5
rhq.communications.global-concurrency-limit=500
rhq.server.concurrency-limit.inventory-report=100
rhq.server.concurrency-limit.availability-report=100
rhq.server.concurrency-limit.inventory-sync=100
rhq.server.concurrency-limit.content-report=100
rhq.server.concurrency-limit.content-download=100
rhq.server.concurrency-limit.measurement-report=100
rhq.server.concurrency-limit.measurement-schedule-request=100
rhq.server.concurrency-limit.configuration-update=100

I think the biggest problem is you don't know if you have hit some limit. Nothing is in the logs that's obvious.

Comment 4 Elias Ross 2014-01-16 01:17:45 UTC

One problem is the database can be overwhelmed when 2000 agents connect at once, especially with all the transactions coming in you quickly overload the connection pool.

There needs to be a way to throttle this without causing the database to blow up as well.

One solution would be to have the server reply back to the agent and tell it when to wake up again. For example, the first 100 or so agents, send nothing. The next 100, one minute, next 100 two minutes, etc.

Note You need to log in before you can comment on or make changes to this bug.