Bug 783603

Summary: Support for asynchronous reporting of measurements
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: AgentAssignee: John Mazzitelli <mazz>
Status: ASSIGNED --- QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 4.3CC: hrupp, loleary
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
rebased on 98149c5c59ee5cb6932240fc291f8db69c8b2580
none
patch with conflicts resolved - apply to a5ef337f5921c5cc52fd6a96c084925c46b8f4a1
none
zip file with the two patches to apply to master
none
patch that eventually expires obsolete or disabled metrics none

Description Elias Ross 2012-01-20 19:09:15 EST
Bug tracking changes discussed here: https://community.jboss.org/thread/177654

Create a class similar to AvailabilityCollectorRunnable.java that allows for measurements to be collected as well asynchronously.

Have ResourceContainer include ScheduledThreadExecutor instead of a plain Executor (better for async calls), and change the pool to work for any Runnable.


Version-Release number of selected component (if applicable):

4.3 - master
Comment 1 Elias Ross 2012-01-20 19:09:47 EST
Note: Changes pending approval from my company. This is a placeholder.
Comment 2 Mike Foley 2012-01-23 11:17:24 EST
per scrum 1/23/2012 crouch, loleary, mfoley
Comment 3 Elias Ross 2012-03-01 18:57:43 EST
Created attachment 566971 [details]
rebased on 98149c5c59ee5cb6932240fc291f8db69c8b2580
Comment 4 Heiko W. Rupp 2012-06-21 05:43:01 EDT
See also Bug 834019
Comment 5 John Mazzitelli 2012-06-21 11:04:39 EDT
There are lots of conflicts trying to merge in the latest master. I'll see if I can resolve these some way.
Comment 6 John Mazzitelli 2012-06-21 11:45:56 EDT
Created attachment 593489 [details]
patch with conflicts resolved - apply to a5ef337f5921c5cc52fd6a96c084925c46b8f4a1

i think i resolved all the conflicts properly - this needs LOTS of testing. Since that original patch was made, the AvailabilityContext object was introduced which was the bulk of the conflicts (but there were other conflicts unrelated as well).
Comment 7 John Mazzitelli 2012-06-21 12:21:02 EDT
warning - that last patch I attached isn't complete. there are some more compile errors that were missed.
Comment 8 John Mazzitelli 2012-06-21 14:44:22 EDT
Created attachment 593545 [details]
zip file with the two patches to apply to master

I fixed some compile issues and attached a zip with two patches - the original merged patch and my compile fixes (along with some comment/javadoc tweeks).

I have some questions about this new stuff.

First, the unit test fails on my machine:

Running org.rhq.core.pc.CollectorThreadPoolTest
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.623 sec <<< FAILURE!
Failed tests:   testMeasurement(org.rhq.core.pc.CollectorThreadPoolTest)
java.lang.AssertionError at org.rhq.core.pc.CollectorThreadPoolTest.testMeasurement(CollectorThreadPoolTest.java:102)

So that will need to be corrected.

Second, the new MeasurementCollectorRunnable has this in getLastValues:

    public void getLastValues(MeasurementReport report, Set<MeasurementScheduleRequest> metrics) throws Exception {
        this.requestedMetrics.addAll(metrics);
        report.add(this.lastReport);
    }

How I read this code is that, over a short period of time, eventually this.requestedMetrics is going to have all enabled metrics in it. Thus, in the run() method, it will eventually always ask the resource component for the entire set of enabled metrics every time the run() method is executed. This could potentially be expensive. Even though this will be performed asynchronously in its own separate thread, this still has the potential to cause spikes (in CPU or I/O depending on the types of metrics being collected) on either the agent process itself or in the managed resource itself. This is because if the resource has N enabled metrics, whether or not the metric schedules are staggered and to be collected over different intervals, the run() method will end up collecting all N metrics at the same time. And it will do this for every interval of time as configured in the MeasurementCollectorRunnable.interval (whose minimum value is 60 seconds).

Lastly, that impl of getLastValues will put every metric in the report, whether or not it was requested:

        report.add(this.lastReport);

This means if the plugin container determines its time to only collect metric "foo", it will pass the one MeasurementScheduleRequest for foo into the MeasurementFacet.getValues() call; but when the component delegates to this getLastValues(), what will get returned back to the plugin container is the entire set of all metrics that the async runnable class has collected in its lifetime (and over a short period of time, that will include all N metrics enabled on the resource). This means even though the PC asked for 1 metric, it potentially will be returned N metrics (I haven't checked if this causes problems in the PC, I don't know what happens if you return more metrics than what the PC asked for, but even if the PC allows this, it would mean you would be sending the entire set of metrics up to the server rather than the 1 that was scheduled. This will then cause the server to insert all N metrics every time, even if less than N were scheduled to be collected at that time). This has the potential of clobbering the server/database with much more metric data than should be expected.

We need to discuss these issues further before committing to master.
Comment 9 John Mazzitelli 2012-06-25 16:30:35 EDT
FYI: I pushed a remote branch (based on today's master branch) with the code changes related to this issue. 

http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=shortlog;h=refs/heads/bug/783603
Comment 10 John Mazzitelli 2012-06-26 15:43:58 EDT
git commit to branch bug/783603 : 5f142aeb194d9c2f4c7a3a7313a01aa6105d9876

This fixes the unit test that was failing. It also adds an additional configurable "initialDelay" setting so you can tell the runnable if it should delay performing the initial collection and if so how long that delay should be.
Comment 11 John Mazzitelli 2012-06-26 15:49:26 EDT
> that impl of getLastValues will put every metric in the report,
> whether or not it was requested:
> 
>         report.add(this.lastReport);
> 
> This means if the plugin container determines its time to only collect
> metric "foo", it will pass the one MeasurementScheduleRequest for foo into
> the MeasurementFacet.getValues() call; but when the component delegates to
> this getLastValues(), what will get returned back to the plugin container is
> the entire set of all metrics that the async runnable class has collected in
> its lifetime (and over a short period of time, that will include all N
> metrics enabled on the resource). 
> ...
> it would mean you would be sending the entire set of metrics up to the server rather
> than the 1 that was scheduled.

I think I can fix this by changing the implementation to not add ALL of the data found in this.lastReport but only those data that is being requested in the parameter to the call to getLastValues().
Comment 12 John Mazzitelli 2012-06-26 17:39:06 EDT
git commit to branch bug/783603 : a1c3d0e458a74756a4ec6954bf39cd01fa406560

this allows the getLastValues to be given a set of metric schedule requests and will only return the data asked for. added some unit tests to check this.
Comment 13 John Mazzitelli 2012-06-27 15:28:59 EDT
I'm still working on and thinking about this. One other problem I just noticed that I don't know a way around yet - what happens when metrics get disabled?

Suppose the plugin container is chugging along, collector runnable is collecting metrics fine (say, one of the metrics it was asked to collect is metric "foo"). But now the user goes to the GUI and disables the "foo" metric for my resource. This request eventually makes it down to the plugin container who will then turn off the metric schedule (so it will never ask the measurement facet for that metric again unless and until the user re-enables the metric). However, the MeasurementCollectorRunnable is never told. So it will continually be probing the managed resource collecting that metric when that data will never be used or stored in the server. We should not be doing this. We'll need a way for the collector to eventually (even if it isn't immediately) know that it should stop collecting that metric.
Comment 14 John Mazzitelli 2012-06-27 16:16:04 EDT
More problems with this that need to be solved.

this.lastReport grows unbounded. As we collect more metrics, the old metrics remain. So getLastValues() continually reports the same old metrics as before, with any new ones collected. This grows and is never purged of old data, even if reported already.

We need a thread-safe mechanism to clean out the old metrics once retrieved via getLastValues(). Otherwise, the measurement reports will simply grow and grow and will be reporting the same values over and over again.
Comment 15 John Mazzitelli 2012-06-27 17:26:02 EDT
git commit to branch bug/783603 : 7c9cbda06b11ac9740310190598d333f2e45d434

this makes sure the cached report doesn't grow unbounded. When getLastValues() is called, we make sure we remove the old data so it isn't cached anymore and so it isn't reported again.

Still need to fix the issue of turning off metric collection in the runnable when metrics get disabled.

Also, we need to worry about the impact this async collection will have since it will mean within a short time every collection period will collect every enabled metric on the resource (and potentially even metrics that are currently disabled, if they happened to have been enabled earlier).
Comment 16 Elias Ross 2012-08-22 19:51:44 EDT
Sorry for this late feedback. We are close. I appreciate your help and review.

> this.lastReport grows unbounded. As we collect more metrics, the old metrics remain. So getLastValues() continually reports the same old metrics as before, with any new ones collected. This grows and is never purged of old data, even if reported already.

No, because the measurements are kept in a Set<> and the measurements are 'Object.equal' based on the name of the measurement.

I don't really like the approach of removing old measurements as they are reported since if measurements are queried more frequently than supplied, NaN/null will be returned. I sort of see the code in MeasurementReport.add() as superfluous.

> Also, we need to worry about the impact this async collection will have since it will mean within a short time every collection period will collect every enabled metric on the resource

The use case was running an expensive database query, like querying a cluster. The results of one query were expected to be extracted into different measurement values.

I guess this will just have be a weakness of the solution.

>  (and potentially even metrics that are currently disabled, if they happened to have been enabled earlier).

    private Set<MeasurementScheduleRequest> requestedMetrics = new CopyOnWriteArraySet<MeasurementScheduleRequest>();

probably should change to something like:

    private Map<MeasurementScheduleRequest, Date> requestedMetrics = new ConcurrentHashMap<MeasurementScheduleRequest, Date>();

...

Then simply store the last time the measurement was queried. When the date exceeds a threshold (like MeasurementScheduleRequest.getInterval() * 2), then simply throw away the value.

Here's the suggested changes:

diff --git a/modules/core/plugin-api/src/main/java/org/rhq/core/pluginapi/measurement/MeasurementCollectorRunnable.java b/modules/core/plugin-api/src/main/java/org/rhq/core/pluginapi/measurement/MeasurementCollectorRunnable.java
index bf303b9..d9f42e2 100644
--- a/modules/core/plugin-api/src/main/java/org/rhq/core/pluginapi/measurement/MeasurementCollectorRunnable.java
+++ b/modules/core/plugin-api/src/main/java/org/rhq/core/pluginapi/measurement/MeasurementCollectorRunnable.java
@@ -1,7 +1,10 @@
 package org.rhq.core.pluginapi.measurement;
 
+import java.util.Date;
+import java.util.Iterator;
 import java.util.Set;
-import java.util.concurrent.CopyOnWriteArraySet;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.Future;
 import java.util.concurrent.FutureTask;
 import java.util.concurrent.ScheduledExecutorService;
@@ -91,7 +94,7 @@ public class MeasurementCollectorRunnable implements Runnable {
     /**
      * Accumulated report.
      */
-    private Set<MeasurementScheduleRequest> requestedMetrics = new CopyOnWriteArraySet<MeasurementScheduleRequest>();
+    private Map<MeasurementScheduleRequest, Date> requestedMetrics = new ConcurrentHashMap<MeasurementScheduleRequest, Date>();
 
     /**
      * Just a cache of the facet toString used in log messages. We don't want to keep calling toString on the
@@ -147,7 +150,8 @@ public class MeasurementCollectorRunnable implements Runnable {
      * their {@link MeasurementFacet#getValues()} method should simply be calling this method.
      */
     public void getLastValues(MeasurementReport report, Set<MeasurementScheduleRequest> metrics) throws Exception {
-        requestedMetrics.addAll(metrics);
+        for (MeasurementScheduleRequest metric : metrics)
+            requestedMetrics.put(metric, new Date());
         cachedReportLock.lock();
         try {
             // For all metrics being requested, take their cached values last collected and transfer them to the given report.
@@ -208,7 +212,14 @@ public class MeasurementCollectorRunnable implements Runnable {
         try {
             // collect the new data for all metrics previous requested in the past
             MeasurementReport newData = new MeasurementReport();
-            measured.getValues(newData, requestedMetrics);
+            long now = System.currentTimeMillis();
+            for (Iterator<MeasurementScheduleRequest, Date> i = requestedMetrics.entrySet().iterator(); i.hasNext(); ) {
+                Map.Entry<MeasurementScheduleRequest, Date> me = i.next();
+                long until = me.getValue().getTime() + me.getKey().getInterval() * 2;
+                if (now > until)
+                    i.remove();
+            }
+            measured.getValues(newData, requestedMetrics.keySet());
             if (log.isDebugEnabled()) {
                 log.debug("measurement collector latest data: " + newData);
             }
Comment 17 Elias Ross 2012-09-06 15:33:58 EDT
> > this.lastReport grows unbounded. As we collect more metrics ...

> No, because the measurements

Sorry, you're right. I did check the code. Although the report uses a Set, they are not identical instances.

A better solution is to simply flip the reports when run() is called.

    private volatile MeasurementReport report = new MeasurementReport();

    @Override
    public void run() {
        try {
            MeasurementReport report2 = new MeasurementReport();
            ((MeasurementFacet)delegate).getValues(report2, requests);
            report = report2;
        } catch (Exception e) {
            log.error(e, e);
        }
    }

This is less complicated than doing removal and fixes any memory leak issue in the report object.
Comment 18 John Mazzitelli 2012-09-06 18:59:52 EDT
(In reply to comment #17)
> A better solution is to simply flip the reports when run() is called.

I haven't had time to look at this in depth in a while. But for some reason, I'm thinking just switching these will lose data before it can be reported (I'm not sure - I'm just pointing out this is something to watch out for). For example, what happens if run() is called before a piece of data in the current report hasn't been reported up to the server yet (if this can happen, that measurement data will be lost)?

So, I'm asking if there is any way data being collected can be lost doing things this way?

We want to prevent two things: 1) data getting lost and 2) the same data getting reported to the server multiple times. If we can ensure that, then we are good.

What you should do is pull the bug/783603 branch and apply your changes to it and try it out (that branch is where I put all of my patches). Attach your patch to this BZ so we can apply it to that branch if appropriate. Also, make sure the unit test class passes (CollectorThreadPoolTest).

Note that I took some time just now to merge in the latest master to branch bug/783603 so that should be fully up to date with the latest master branch, but also has my changes as per previous comments. You can simply apply your changes to that branch and post your patch here.
Comment 19 Elias Ross 2012-09-07 00:58:43 EDT
There are these possible orderings:

getValues() -- no data returned
run() -- report 1 generated
run() -- report 2 generated [*] report 1 lost
getValues() -- report 2 returned
getValues() -- report 2 returned; old data

[*] Yes, you end up throwing out the old data here.

The improvement would be:

getValues() -- no data returned
run() -- report 1 generated
run() -- [^] NO report generated until getValues() called
getValues() -- report 1 returned
getValues() -- report 1 returned; old data

[^] This could be fixed by simply setting a boolean (volatile) flag to 'true' when the report is generated and 'false' when the values are returned. However, if (some) measurements are gathered at different times...

Still, if you know this resource utilizes the 'async' feature, then probably there's no point in setting a measurement schedule that is a different period (frequency) than the async frequency. Further, even if they are configured with a different period, still maybe there's not much harm in simply throwing away this data?

I also though of another (probably better) way to solve the same problem: Allow the measurement gathering container to simply gather metrics using a (somewhat bounded) thread pool rather than expect measurements to always be gathered under a certain limit. I mean, something like a method-level annotation or maybe an interface could enable certain resources to exceed the 20-30 second time limit they have now.

The same approach could be used for availability checking as well.

Something like:

public class SlowAvailResourceComponent ... {
    @Asynchronous(frequency=5,timeout=10,units=MINUTES)
    public AvailabilityType getAvailability() { ... }
    @Asynchronous(frequency=10,timeout=5) // MINUTES default
    public void getValues(...) {
}

Then the container would:
1) Not allow availability checks/measurement checks more than every 5/10 minutes
2) Do the scheduling using a thread pool and the component author doesn't need to be involved in hooking up the code

... just an idea ...
Comment 20 Elias Ross 2012-10-10 14:58:47 EDT
Created attachment 625107 [details]
patch that eventually expires obsolete or disabled metrics

patch apply to commit 8d36d210e04 , bug/783603

The patch satisfies two problems:
1) Scheduled metrics that are no longer being requested are removed after 2 times the request interval.
2) The metric report cannot grow unbounded. The old report is replaced every time run() is called.
Comment 21 John Mazzitelli 2012-10-10 15:44:24 EDT
pushed the last patch posted to bug/783603 branch: commit 8fb5e94aca43de8d64f7cd44ca64c629c50351f3

I then merged master into this bug branch to keep it up to date: commit f923990ee04f4e8a635437b5321aefe6f71a1873
Comment 22 John Mazzitelli 2012-10-15 08:54:16 EDT
setting target release so we triage this and get it into the release schedule. I think this is ready to go into the next RHQ release at least.

Elias - what is your opinion on the readiness for this to go into master and an RHQ release?
Comment 23 John Mazzitelli 2012-10-16 10:07:28 EDT
related email was sent to rhq-devel mailing list:

https://lists.fedorahosted.org/pipermail/rhq-devel/2012-October/002228.html

The content is:

---
There is a community contribution to the agent plugin container allowing for async measurement collection:

https://bugzilla.redhat.com/show_bug.cgi?id=783603

I created a branch off of master and put the patches for this enhancement in it:

http://git.fedorahosted.org/cgit/rhq/rhq.git/log/?h=bug/783603

In order to be merged into master and get into the next RHQ release, we need someone to peer review this and test this. Is anyone available?

Looking at the new code, it appears we do have unit tests for this.
Comment 24 Elias Ross 2012-11-27 16:16:28 EST
(In reply to comment #22)

> Elias - what is your opinion on the readiness for this to go into master and
> an RHQ release?

To be honest, I haven't tested this much myself. I have wrote something similar at the plugin level and it works fine in production.

There are some (internal facing) API changes (moved classes, ResourceContext constructor changes) so I think the concern there is breaking people's plugins if they use those APIs.

Something that comes to mind: There is no 'MeasurementContext' -- should there be one like AvailabilityContext? There is sort of a lack of symmetry here.