Description of problem: The RHQ server, if used with over a hundred (or a thousand agents) cannot handle, out-the-box, enough load to cleanly do an upgrade. This may not be customer-typical, but also doesn't seem likely to do much harm to adjust even on smaller installations. Although I cannot identify what is most important, the following need to be tuned: 1) Increase the default size of the storage node memory usage. I would say that for about 1000 nodes, around 5GB of heap memory for Cassandra is good. Though I think the installer should simply pick a good number based on the local free memory size. Example error: 01:44:14,249 ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-64) JBAS014134: EJB Invocation failed on component ResourceManagerBean for method public abstract void org.rhq.enterprise.server.resource.ResourceManagerLocal.addResourceError(org.rhq.core.domain.resource.ResourceError): javax.ejb.EJBException: J BAS014516: Failed to acquire a permit within 5 MINUTES at org.jboss.as.ejb3.pool.strictmax.StrictMaxPool.get(StrictMaxPool.java:109) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4] at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:47) [jboss-as-ejb3-7.2.0.Alpha1-redhat-4.jar:7.2.0.Alpha1-redhat-4] 2) Increase the size of the EJB pool. What happened with 4.5.1 -> 4.9 upgrade was that the number of inventory requests went up substantially in a short time. This caused many, many timeouts. <strict-max-pool name="slsb-strict-max-pool" max-pool-size="2000" instance-acquisition-timeout="1" instance-acquisition-timeout-unit="MINUTES"/> 3) Increase the out-of-box communication limits: rhq.server.startup.web.max-connections=1000 rhq.server.agent-downloads-limit=45 rhq.server.client-downloads-limit=5 rhq.communications.global-concurrency-limit=200 rhq.server.concurrency-limit.inventory-report=25 rhq.server.concurrency-limit.availability-report=25 rhq.server.concurrency-limit.inventory-sync=25 rhq.server.concurrency-limit.content-report=25 rhq.server.concurrency-limit.content-download=25 rhq.server.concurrency-limit.measurement-report=25 rhq.server.concurrency-limit.measurement-schedule-request=25 rhq.server.concurrency-limit.configuration-update=25 Version-Release number of selected component (if applicable): 4.9 (from 4.5.1)
minimally, this should be considered as documentation for jon 3.2
i cloned this into a JON 3.2 BZ (removing the blocker on this, and adding it to the new JON BZ)
For 2000 agents, the numbers need to be increased rhq.server.startup.web.max-connections=2000 rhq.server.agent-downloads-limit=45 rhq.server.client-downloads-limit=5 rhq.communications.global-concurrency-limit=500 rhq.server.concurrency-limit.inventory-report=100 rhq.server.concurrency-limit.availability-report=100 rhq.server.concurrency-limit.inventory-sync=100 rhq.server.concurrency-limit.content-report=100 rhq.server.concurrency-limit.content-download=100 rhq.server.concurrency-limit.measurement-report=100 rhq.server.concurrency-limit.measurement-schedule-request=100 rhq.server.concurrency-limit.configuration-update=100 I think the biggest problem is you don't know if you have hit some limit. Nothing is in the logs that's obvious.
One problem is the database can be overwhelmed when 2000 agents connect at once, especially with all the transactions coming in you quickly overload the connection pool. There needs to be a way to throttle this without causing the database to blow up as well. One solution would be to have the server reply back to the agent and tell it when to wake up again. For example, the first 100 or so agents, send nothing. The next 100, one minute, next 100 two minutes, etc.