Bug 901672 - If the agent stops reporting metrics, then it should be marked as unavailable and/or send an event saying it can't send metrics [NEEDINFO]
Summary: If the agent stops reporting metrics, then it should be marked as unavailable...
Keywords:
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-01-18 19:12 UTC by Elias Ross
Modified: 2022-03-31 04:28 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Embargoed:
genman: needinfo?


Attachments (Terms of Use)

Description Elias Ross 2013-01-18 19:12:13 UTC
Description of problem:

If the RHQ server is busy, the RHQ Agent can stop reporting metrics. This can happen if the database server is overloaded with too many clients. Consequently, no more metrics are reported and alerting cannot occur.

This is a request that the agent plugin either:
* Generate an event saying there was this condition (and others like it). Unfortunately, this may generate a ton of events across hundreds of agents.
* Mark itself as unavailable. In effect, the agent is stopped working. This may be simpler to diagnose.
* A sub resource marks itself as unavailable. This is probably ideal.

How reproducible:

You can probably induce this by reducing the number of server threads to 1, increasing the number of agents to a lot, and creating a table lock. Or maybe just a table lock.

You should see:

2013-01-18 19:04:34,075 ERROR [ClientCommandSenderTask Thread #4] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed 
to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=st11p01ad-ad001.apple.com, rhq.externalizable-strategy=AGENT, rhq.security-token=YWI
LXUV/JelWuPeWz4KN03nqFKNaBO2ZRvHb4lcuW2a4KmLXxuuH9Daa3CKaO0f0jJo=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasuremen
tReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurren
t.TimeoutException

This should be detectable.

Comment 1 Elias Ross 2013-01-18 20:09:19 UTC
I was thinking the following:
diff --git a/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml b/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
index 01c0a41..adccdf3 100644
--- a/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
+++ b/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
@@ -55,6 +55,12 @@
             <c:simple-property name="snapshotLogEnabled" type="boolean" default="true" description="If true, take a snapshot of the log files"/>
             <c:simple-property name="snapshotDataEnabled" type="boolean" default="true" description="If true, take a snapshot of the data files"/>
          </c:group>
+         <c:group name="errorDetection" displayName="Error Detection" hiddenByDefault="true">
+            <c:simple-property name="maxCommandsQueued"
+                               displayName="Max commands queued."
+                               default="100" readOnly="false" required="true"
+                               description="Maximum number of messages currently queued waiting to be sent to the RHQ Server. This could happen if the server is overloaded and d
+         </c:group>
          <c:group name="advanced" displayName="Advanced" hiddenByDefault="true">
             <c:simple-property name="childJmxServerName" displayName="JVM Name" default="JVM"
                                readOnly="true" required="false"


Can I work on a patch for this? Would it be accepted?


Note You need to log in before you can comment on or make changes to this bug.