Bug 901672 - If the agent stops reporting metrics, then it should be marked as unavailable and/or send an event saying it can't send metrics [NEEDINFO]
If the agent stops reporting metrics, then it should be marked as unavailable...
Status: NEW
Product: RHQ Project
Classification: Other
Component: Agent (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified (vote)
: ---
: ---
Assigned To: RHQ Project Maintainer
Mike Foley
Depends On:
  Show dependency treegraph
Reported: 2013-01-18 14:12 EST by Elias Ross
Modified: 2013-01-18 15:09 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
genman: needinfo?

Attachments (Terms of Use)

  None (edit)
Description Elias Ross 2013-01-18 14:12:13 EST
Description of problem:

If the RHQ server is busy, the RHQ Agent can stop reporting metrics. This can happen if the database server is overloaded with too many clients. Consequently, no more metrics are reported and alerting cannot occur.

This is a request that the agent plugin either:
* Generate an event saying there was this condition (and others like it). Unfortunately, this may generate a ton of events across hundreds of agents.
* Mark itself as unavailable. In effect, the agent is stopped working. This may be simpler to diagnose.
* A sub resource marks itself as unavailable. This is probably ideal.

How reproducible:

You can probably induce this by reducing the number of server threads to 1, increasing the number of agents to a lot, and creating a table lock. Or maybe just a table lock.

You should see:

2013-01-18 19:04:34,075 ERROR [ClientCommandSenderTask Thread #4] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed 
to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=st11p01ad-ad001.apple.com, rhq.externalizable-strategy=AGENT, rhq.security-token=YWI
LXUV/JelWuPeWz4KN03nqFKNaBO2ZRvHb4lcuW2a4KmLXxuuH9Daa3CKaO0f0jJo=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasuremen
tReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurren

This should be detectable.
Comment 1 Elias Ross 2013-01-18 15:09:19 EST
I was thinking the following:
diff --git a/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml b/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
index 01c0a41..adccdf3 100644
--- a/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
+++ b/modules/plugins/rhq-agent/src/main/resources/META-INF/rhq-plugin.xml
@@ -55,6 +55,12 @@
             <c:simple-property name="snapshotLogEnabled" type="boolean" default="true" description="If true, take a snapshot of the log files"/>
             <c:simple-property name="snapshotDataEnabled" type="boolean" default="true" description="If true, take a snapshot of the data files"/>
+         <c:group name="errorDetection" displayName="Error Detection" hiddenByDefault="true">
+            <c:simple-property name="maxCommandsQueued"
+                               displayName="Max commands queued."
+                               default="100" readOnly="false" required="true"
+                               description="Maximum number of messages currently queued waiting to be sent to the RHQ Server. This could happen if the server is overloaded and d
+         </c:group>
          <c:group name="advanced" displayName="Advanced" hiddenByDefault="true">
             <c:simple-property name="childJmxServerName" displayName="JVM Name" default="JVM"
                                readOnly="true" required="false"

Can I work on a patch for this? Would it be accepted?

Note You need to log in before you can comment on or make changes to this bug.