Bug 901672

Summary: If the agent stops reporting metrics, then it should be marked as unavailable and/or send an event saying it can't send metrics
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: AgentAssignee: Nobody <nobody>
Status: NEW --- QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: hrupp
Target Milestone: ---Flags: genman: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Elias Ross 2013-01-18 19:12:13 UTC
Description of problem:

If the RHQ server is busy, the RHQ Agent can stop reporting metrics. This can happen if the database server is overloaded with too many clients. Consequently, no more metrics are reported and alerting cannot occur.

This is a request that the agent plugin either:
* Generate an event saying there was this condition (and others like it). Unfortunately, this may generate a ton of events across hundreds of agents.
* Mark itself as unavailable. In effect, the agent is stopped working. This may be simpler to diagnose.
* A sub resource marks itself as unavailable. This is probably ideal.

How reproducible:

You can probably induce this by reducing the number of server threads to 1, increasing the number of agents to a lot, and creating a table lock. Or maybe just a table lock.

You should see:

2013-01-18 19:04:34,075 ERROR [ClientCommandSenderTask Thread #4] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed 
to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=st11p01ad-ad001.apple.com, rhq.externalizable-strategy=AGENT, rhq.security-token=YWI
LXUV/JelWuPeWz4KN03nqFKNaBO2ZRvHb4lcuW2a4KmLXxuuH9Daa3CKaO0f0jJo=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasuremen
tReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurren
t.TimeoutException

This should be detectable.

Comment 1 Elias Ross 2013-01-18 20:09:19 UTC
I was thinking the following:
diff --git a/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml b/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
index 01c0a41..adccdf3 100644
--- a/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
+++ b/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
@@ -55,6 +55,12 @@
             <c:simple-property name="snapshotLogEnabled" type="boolean" default="true" description="If true, take a snapshot of the log files"/>
             <c:simple-property name="snapshotDataEnabled" type="boolean" default="true" description="If true, take a snapshot of the data files"/>
          </c:group>
+         <c:group name="errorDetection" displayName="Error Detection" hiddenByDefault="true">
+            <c:simple-property name="maxCommandsQueued"
+                               displayName="Max commands queued."
+                               default="100" readOnly="false" required="true"
+                               description="Maximum number of messages currently queued waiting to be sent to the RHQ Server. This could happen if the server is overloaded and d
+         </c:group>
          <c:group name="advanced" displayName="Advanced" hiddenByDefault="true">
             <c:simple-property name="childJmxServerName" displayName="JVM Name" default="JVM"
                                readOnly="true" required="false"


Can I work on a patch for this? Would it be accepted?