Bug 536173 (RHQ-551) - need way to configure availability timeout
Summary: need way to configure availability timeout
Keywords:
Status: CLOSED NEXTRELEASE
Alias: RHQ-551
Product: RHQ Project
Classification: Other
Component: Monitoring
Version: 1.0
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: ---
Assignee: John Mazzitelli
QA Contact:
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks: jon-sprint10-bugs 594787 741450
TreeView+ depends on / blocked
 
Reported: 2008-06-09 18:45 UTC by John Mazzitelli
Modified: 2011-09-26 20:47 UTC (History)
1 user (show)

Fixed In Version: 1.3
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)
AvailabilityCollector.java (10.38 KB, text/x-java)
2009-06-08 21:09 UTC, John Mazzitelli
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 741450 0 medium CLOSED RFE: Improve Availability Handling (Tracker) 2021-02-22 00:41:40 UTC

Internal Links: 741450

Description John Mazzitelli 2008-06-09 18:45:00 UTC
I wrote a plugin that hit this:

2008-06-09 14:05:41,019 DEBUG [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Failed to collect availability on resource Resource[id=500051, type=Ringside Server, key=/api/restserver.php@82, name=Ringside Server (mazzlap:82), version=Developer Version]
org.rhq.core.pc.inventory.TimeoutException: Call to org.ringside.plugin.RingsideServerComponent.getAvailability() with args [] timed out.
	at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:398)

Eventually, future availability checks did come in at under the 5s limit and I went green. I don't know what caused the delay, perhaps my box was slow at the time for some reason.  It would be nice if this was configurable.  Right now, its hardcoded at 5s: AvailabilityExecutor.GET_AVAILABILITY_TIMEOUT.

Add a configuration item to the agent preferences so I can set this timeout value to something that I want, rather than what you want :-p

Here's docs on how to add a new agent pref:

http://support.rhq-project.org/display/RHQ/Design-Adding+Agent+Configuration+Preference


Comment 1 John Mazzitelli 2009-02-13 16:41:38 UTC
After some deliberation (with myself) I conclude we do not want to implement this.

Here's why:

If we configure this avail timeout to be larger, it would not only affect the resource in question, but it would affect the entire plugin container (if multiple resources take too long, our availability reporting will be too slow).

The question is - what if I have a resource that takes longer than 5 seconds to determine its status? We need a way to work with this one resource WITHOUT affecting or delaying the availability reporting of the plugin container as a whole.

The solution is simple - for a resource that may take longer than the timeout, that resource's component needs to spawn its own thread and within THAT thread make the necessary availability checks (taking as long as it wants to complete this).

The components' getAvailability method should simply return a value stored in a cached data member which holds the LAST known state of the resource. The other thread will simply update that last known state variable whenever it finishes its avail check. This thread is to be long-lived - it should be created when the component is initialized and killed when the component is shutdown. It should periodically check the state, probably every minute or less since that's how often the PC wants to check availabilties. Obviously, the thread and component object must synchronize access to this last known state variable.

Comment 2 John Mazzitelli 2009-03-06 03:38:07 UTC
another thought:

(10:26:33 PM) mazz: I think what we can do is maybe write some kind of utility framework or API that individual plugin writers can use
(10:27:09 PM) mazz: "if your managed resource takes a long time to tell us if its up or down, use this API to collect the status asynchronously while not affecting the performance of the plugin container itself"
(10:27:34 PM) mazz: perhaps we roll that up into the PC itself
(10:27:51 PM) mazz: have the PC build the  availability report asynchronously
(10:28:13 PM) mazz: and under the covers just let the resources take as long as it can

So perhaps we have an availaibity report - one global report - that gets filled in asynchronously - when its time to send up a report, we send that global report - whatever is filled in with whatever status is available gets sent up.

Comment 3 John Mazzitelli 2009-05-18 13:55:09 UTC
Can we consider adding a piece of metadata to <platform>, <server> and <service>: an attrib "availabilityTimeout". It would default to 5000 ms and would be optional.

We'd have to change the UI that would allow a user to change this timeout value if they needed. This would solve the problem across the inventory of resources and not require us to have to change any plugin code.

<server availabilityTimeout="15000"...>

this defaults the avail timeout to 15seconds. But the user would be able to change it.

I'm not sure how easy it would be to implement this. Its another piece of data that needs to be attached to the Resource entity and of course the ResourceType entity.

Comment 4 John Mazzitelli 2009-06-08 21:09:52 UTC
attaching AvailabilityCollector - a utility class plugins can use to perform this asynchronous avail checking while maintaining a "last known avail" for quick getAvailability checking.

To use, resource component's start() method would instantiate AvailabilityCollector and call its start() method.

Resource components stop() method would call the collector's stop() method.

The resource component's getAvailability method would call the collector's getLastKnownAvailability().

I won't be checking this in because I want to infuse this further in the plugin container/plugin API. Perhaps provide all components with an instance of availability collector in their resource context for them to use. The issue is that the thread pool needs management (i.e. shutting down the thread pool, instantiating it). The attached never shuts down the thread pool and instantiates it in the static block. Those two are probably OK because all threads are daemon and won't hang the agent and this class is in the top parent agent classloader so the static block should be OK. But there could be a better way to inject this stuff into the plugins. I'll eventually do that

Comment 5 John Mazzitelli 2009-06-09 08:12:43 UTC
svn rev4050 adds some code to the pc and plugin api (AvailabilityCollectorThreadPool/Runnable and AvailabilityFacet). This is essentially some helper code that plugin devs can use to do async avail checking

Comment 6 John Mazzitelli 2009-06-09 08:16:43 UTC
svn rev4050 adds code but introduces little to no additional overhead from previous versions. The only thing it adds at runtime (assuming no plugins need this async avail collection) is a single threadpool with 0 threads (so it is very lightweight).

If a plugin component needs this feature, this will add a single thread per started component. That single thread will live as long as the component is in the started state.

Comment 7 John Mazzitelli 2009-06-09 16:25:38 UTC
svn 4053 fixes something a unit test uncovered.

Comment 8 John Mazzitelli 2009-06-09 19:48:22 UTC
svn rev 4058 is the last checkin for this code. Here's how to use in your component code (I tested this and it works):


    // your component needs this data member
    private AvailabilityCollectorRunnable availCollector;

    public void start(ResourceContext context) {
        // as part of your start method, create the avail collector runnable
        // this is the thing that really talks to your managed resource for availability check and it
        // can take as long as you need
        availCollector = resourceContext.createAvailabilityCollectorRunnable(new AvailabilityFacet() {
            public AvailabilityType getAvailability() {
                // perform the actual check to see if the managed resource is up or not
                return ...AvailabilityType...;
            }
        }, 60000L);
        availCollector.start();
        // ...
        // ... the rest of your component's start method ...
    }

    public void stop() {
        // as part of your stop method, you need to stop your collector
        availCollector.stop();
        // ...
        // ... the rest of your component's stop method ...
    }

    public AvailabilityType getAvailability() {
        // this is what we are going for - this call is very fast, therefore, your getAvailability no longer takes long to complete
        // and this virtually guarantees you will never timeout in this method anymore.
        return availCollector.getLastKnownAvailability();
    }


Comment 9 John Mazzitelli 2009-06-09 21:24:07 UTC
documented how to use this framework here:

http://jopr.org/confluence/display/RHQ/Design-Asynchronous+Availability+Collector

Comment 10 Red Hat Bugzilla 2009-11-10 21:11:44 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-551
Imported an attachment (id=368855)


Comment 11 wes hayutin 2010-02-16 21:07:53 UTC
Mass move to component= Monitoring

Comment 12 John Mazzitelli 2010-05-21 13:07:45 UTC
I am going to apply this patch to AvailabilityExecutor:

 public class AvailabilityExecutor implements Runnable, Callable<AvailabilityReport> {
-    private static final int GET_AVAILABILITY_TIMEOUT = 5 * 1000; // 5 seconds
+    // the get-availability-timeout will rarely, if ever, want to be overridden. It will default to be 5 seconds
+    // and that's what it probably should always be. However, there may be a rare instance where someone wants
+    // to give this availability executor a bit more time to wait for the resource's availability response
+    // and is willing to live with the possible consequences (that being, delayed avail reports and possibly
+    // false-down alerts getting triggered). Rather than changing this timeout, people should be using
+    // the asynchronous-availability-check capabilities that are exposed to the plugins. Because we do not
+    // want to encourage people from changing this, we do not expose this "backdoor" system property as a
+    // standard plugin configuration setting/agent preference - if someone wants to do this, they must
+    // explicitly pass in -D to the JVM running the plugin container.
+    private static final int GET_AVAILABILITY_TIMEOUT;
+    static {
+        int timeout;
+        try {
+            timeout = Integer.parseInt(System.getProperty("rhq.agent.plugins.availability-scan.timeout", "5000"));
+        } catch (Throwable t) {
+            timeout = 5000;
+        }
+        GET_AVAILABILITY_TIMEOUT = timeout;
+    }

Comment 13 John Mazzitelli 2010-05-21 13:33:18 UTC
master git commit : 395aa1d971576a1ca526a71da6a58c3c58253256

"rhq.agent.plugins.availability-scan.timeout" is now a system property you can set (via -D to rhq-agent.sh or within RHQ_AGENT_ADDITIONAL_JAVA_OPTS) if you really really want something other than 5s. Since this will rarely, if ever, want to be done and we don't really want people to override this (see previous comments why 5s is the default and probably not to be changed) this is not a exposed preference setting found in agent-configuration.xml - if you want this, you have to adjust your agent startup script to do so (e.g. rhq-agent-env.sh which is where RHQ_AGENT_ADDITIONAL_JAVA_OPTS is set, or pass in the -D option to rhq-agent.sh).


Note You need to log in before you can comment on or make changes to this bug.