Bug 1025844 - RFE: Ship with RHQ server tuned for higher capacity
RFE: Ship with RHQ server tuned for higher capacity
Status: NEW
Product: RHQ Project
Classification: Other
Component: Installer (Show other bugs)
4.9
Unspecified Unspecified
unspecified Severity unspecified (vote)
: ---
: ---
Assigned To: RHQ Project Maintainer
Mike Foley
: Documentation
Depends On:
Blocks: 1026428
  Show dependency treegraph
 
Reported: 2013-11-01 13:52 EDT by Elias Ross
Modified: 2014-01-15 20:17 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1026428 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Elias Ross 2013-11-01 13:52:00 EDT
Description of problem:

The RHQ server, if used with over a hundred (or a thousand agents) cannot handle, out-the-box, enough load to cleanly do an upgrade.

This may not be customer-typical, but also doesn't seem likely to do much harm to adjust even on smaller installations.

Although I cannot identify what is most important, the following need to be tuned:

1) Increase the default size of the storage node memory usage. I would say that for about 1000 nodes, around 5GB of heap memory for Cassandra is good. Though I think the installer should simply pick a good number based on the local free memory size.

Example error:

01:44:14,249 ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-64) JBAS014134: EJB Invocation failed on component ResourceManagerBean for method public
 abstract void org.rhq.enterprise.server.resource.ResourceManagerLocal.addResourceError(org.rhq.core.domain.resource.ResourceError): javax.ejb.EJBException: J
BAS014516: Failed to acquire a permit within 5 MINUTES
        at org.jboss.as.ejb3.pool.strictmax.StrictMaxPool.get(StrictMaxPool.java:109) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]
        at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:47) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4]


2) Increase the size of the EJB pool. What happened with 4.5.1 -> 4.9 upgrade was that the number of inventory requests went up substantially in a short time. This caused many, many timeouts.
                    <strict-max-pool name="slsb-strict-max-pool" max-pool-size="2000" instance-acquisition-timeout="1" instance-acquisition-timeout-unit="MINUTES"/>

3) Increase the out-of-box communication limits:

rhq.server.startup.web.max-connections=1000
rhq.server.agent-downloads-limit=45
rhq.server.client-downloads-limit=5
rhq.communications.global-concurrency-limit=200
rhq.server.concurrency-limit.inventory-report=25
rhq.server.concurrency-limit.availability-report=25
rhq.server.concurrency-limit.inventory-sync=25
rhq.server.concurrency-limit.content-report=25
rhq.server.concurrency-limit.content-download=25
rhq.server.concurrency-limit.measurement-report=25
rhq.server.concurrency-limit.measurement-schedule-request=25
rhq.server.concurrency-limit.configuration-update=25


Version-Release number of selected component (if applicable): 4.9 (from 4.5.1)
Comment 1 Mike Foley 2013-11-01 14:00:14 EDT
minimally, this should be considered as documentation for jon 3.2
Comment 2 Mike Foley 2013-11-04 10:53:29 EST
i cloned this into a JON 3.2 BZ (removing the blocker on this, and adding it to the new JON BZ)
Comment 3 Elias Ross 2014-01-14 16:52:05 EST
For 2000 agents, the numbers need to be increased 

rhq.server.startup.web.max-connections=2000
rhq.server.agent-downloads-limit=45
rhq.server.client-downloads-limit=5
rhq.communications.global-concurrency-limit=500
rhq.server.concurrency-limit.inventory-report=100
rhq.server.concurrency-limit.availability-report=100
rhq.server.concurrency-limit.inventory-sync=100
rhq.server.concurrency-limit.content-report=100
rhq.server.concurrency-limit.content-download=100
rhq.server.concurrency-limit.measurement-report=100
rhq.server.concurrency-limit.measurement-schedule-request=100
rhq.server.concurrency-limit.configuration-update=100

I think the biggest problem is you don't know if you have hit some limit. Nothing is in the logs that's obvious.
Comment 4 Elias Ross 2014-01-15 20:17:45 EST
One problem is the database can be overwhelmed when 2000 agents connect at once, especially with all the transactions coming in you quickly overload the connection pool.

There needs to be a way to throttle this without causing the database to blow up as well.

One solution would be to have the server reply back to the agent and tell it when to wake up again. For example, the first 100 or so agents, send nothing. The next 100, one minute, next 100 two minutes, etc.

Note You need to log in before you can comment on or make changes to this bug.