Description of problem: If the RHQ server is busy, the RHQ Agent can stop reporting metrics. This can happen if the database server is overloaded with too many clients. Consequently, no more metrics are reported and alerting cannot occur. This is a request that the agent plugin either: * Generate an event saying there was this condition (and others like it). Unfortunately, this may generate a ton of events across hundreds of agents. * Mark itself as unavailable. In effect, the agent is stopped working. This may be simpler to diagnose. * A sub resource marks itself as unavailable. This is probably ideal. How reproducible: You can probably induce this by reducing the number of server threads to 1, increasing the number of agents to a lot, and creating a table lock. Or maybe just a table lock. You should see: 2013-01-18 19:04:34,075 ERROR [ClientCommandSenderTask Thread #4] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=st11p01ad-ad001.apple.com, rhq.externalizable-strategy=AGENT, rhq.security-token=YWI LXUV/JelWuPeWz4KN03nqFKNaBO2ZRvHb4lcuW2a4KmLXxuuH9Daa3CKaO0f0jJo=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasuremen tReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurren t.TimeoutException This should be detectable.
I was thinking the following: diff --git a/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml b/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml index 01c0a41..adccdf3 100644 --- a/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml +++ b/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml @@ -55,6 +55,12 @@ <c:simple-property name="snapshotLogEnabled" type="boolean" default="true" description="If true, take a snapshot of the log files"/> <c:simple-property name="snapshotDataEnabled" type="boolean" default="true" description="If true, take a snapshot of the data files"/> </c:group> + <c:group name="errorDetection" displayName="Error Detection" hiddenByDefault="true"> + <c:simple-property name="maxCommandsQueued" + displayName="Max commands queued." + default="100" readOnly="false" required="true" + description="Maximum number of messages currently queued waiting to be sent to the RHQ Server. This could happen if the server is overloaded and d + </c:group> <c:group name="advanced" displayName="Advanced" hiddenByDefault="true"> <c:simple-property name="childJmxServerName" displayName="JVM Name" default="JVM" readOnly="true" required="false" Can I work on a patch for this? Would it be accepted?