Red Hat Bugzilla – Bug 876665
Measurements not collected for agents with a large number of scheduled measurements
Last modified: 2013-09-11 07:02:12 EDT
+++ This bug was initially created as a clone of upstream Bug #834019 +++
Created attachment 593253 [details]
Proposed patch for RHQ 3.0 and 4.4
We have a somewhat particular RHQ setup where we monitor a large number of resources remotely from a single agent. Par agent, we have +/- 25000 scheduled measurements with +/- 1500 measurement collected per minute. Since most of the metrics are collected with the same interval (10 minutes), this causes the following problem: when the agent is started (t=0), it will schedule all these metrics in the same interval [0s,30s]. However, because of the large number of measurements, the agent is not able to collect all of them in that 30s interval and will reschedule the remaining ones to the next interval in the original schedule, i.e. to [10m,10m+30s]. The same thing again happens in the interval [10m,10m+30s] and most of the measurements are rescheduled to the next interval [20m,20m+30s] and so forth. This means that some metrics are never collected (and are reported as "late" in the metrics of the RHQ agent).
Note that the issue only occurs after restarting the agent. When the resources are originally added to the inventory, the corresponding measurement schedules are spread more or less randomly and the agent is able to collect all of them.
To solve that issue with RHQ 3.0, I applied the patch attached to this issue. The patch also works with RHQ 4.4. The idea is that instead of rescheduling the measurement according to the original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should simply be rescheduled to the next interval (from [0s,30s] to [30s,60s]).
--- Additional comment from email@example.com on 2012-06-20 11:44:06 EDT ---
Jay S. did something in the availability scheduling that was similar - at least the end goal was similar and that is to spread out the initial schedule so things aren't scheduled to be collected at roughly the same time. Need to look at the code Jay did for that and perhaps see if the same method can be used here.
--- Additional comment from firstname.lastname@example.org on 2012-06-20 14:07:00 EDT ---
Jay noted that for this to be efficient, we need to group metric requests for the same resource. This is indeed a requirement for us because we have a custom plugin that fetches the measurements for all metrics of a given resource in a single remote call. This requirement should be easy to satisfy by calculating the shift based on a hash function (e.g. modulo) of the resource ID.
However, there is another problem: in our case, all resources are highly available, which means that they are part of compatible groups. In that case it is not desirable that the measurement schedules for the different members of a given resource group are shifted randomly with respect to each other. Instead, measurements should be taken at the same time. Therefore it is probably better to use a hash function of the resource key instead of the resource ID.
Finally, I think that the first scheduled measurement should not be calculated relative to the agent start time, but relative to an absolute point in time (e.g. the UNIX epoch). This allows to satisfy the last requirement even if the measurements for the resources in a group are taken by different agents (which will likely be the case in most RHQ setups). It also helps satisfy the first requirement when metrics are enabled at runtime.
--- Additional comment from email@example.com on 2012-06-21 05:44:29 EDT ---
See also Bug 783603
--- Additional comment from firstname.lastname@example.org on 2012-06-28 14:01:52 EDT ---
Interesting, so the attached patch is small and relatively non-intrusive. If I read this right, all this is doing is taking a measurement schedule that was late and simply punting a little bit more into the future - essentially delaying the next collection 30s into the future. The difference here is the old code was more conservative and punted until the next schedule collection interval (which could be 10m in the future or some such number) giving the agent more time to catch up doing things (which may or may not be due to measurement collection if the agent is CPU starved for other reasons). Punting only for another 30s, in general, may not be enough time to let the agent try to settle down in a steady state.
What if, rather than punt 30s in the future, we punt (next collection time + 30s) - essentially spreading out the next collection further out in the future than it would otherwise by 30 seconds? Thus, the earlier description of the attached patch can be slightly modified to say:
> The idea is that instead of rescheduling the measurement according to the
> original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should simply
> be rescheduled to the next interval (from [0s,30s] to ***[10m+30s,10m+60s]***).
The *** *** part is what I propose we change from the attached patch.
> Finally, I think that the first scheduled measurement should not be calculated
> relative to the agent start time, but relative to an absolute point in time
> (e.g. the UNIX epoch).
This part scares me and before we do anything like this, we'd need some testing and prototyping because now this potentially means that many collections across all agents won't necessarily be "randomly spread out" but will more likely overlap. How does this affect the amount of data from all agents that get into the server(s) - will the servers be bombarded by a synchronocity of cyclical agent measurement reports with a period of relative inactivity in between? I'm just not sure I fully understand the ramifications of normalizing all collection start times to epoch millis and until we do understand the ramifications on system behavior, we should err on the side of caution.
--- Additional comment from email@example.com on 2012-06-29 17:49:45 EDT ---
I committed to master this: cd6c2fa4ce4f128f80145438499dbfc1223fac42
A later commit added some code to our base arquillian framework and I committed a stubbed out unit test method that is supposed to test the fix for this BZ. I will work on fleshing out that test next. But I needed the framework code first. See commit: 5e823b969b8cc830432cfc2e055ed3292464c4d8
--- Additional comment from firstname.lastname@example.org on 2012-07-03 13:09:57 EDT ---
finished the unit test - committed to master: 00db03b19d5a70e9798405b33f91ed150e38382e
I'm not sure how QE can test this. This was hard to get to replicate in unit test code! So I would say QE should just rely on the results of the unit test class LateMeasurementRescheduleTest.
Cherry picked the following commits (in reverse order, not without conflict).
release commit 46c80515654c9159c760298c913250c991e8ddaf
master commit bcce31290b99816d9a82ad1b6df70c187c563ec2
trivial - enhance logging slightly
release commit c09cabe89511d56c296a8903f61c7ae6371f5d93
master commit e09fb3b66648a2617825657e8ed153d98b0cae78
[Bug 834019 -Measurements not collected for agents with a large number of scheduled measurements]
Revisit the solution for this BZ and improve by adding more
randomization to the reschedule times.
- update test accordingly
- add some debug and trace level debugging around late
release commit b82f8694ffb12721513f6df0f452edff0103bcb1
master commit 5e823b969b8cc830432cfc2e055ed3292464c4d8
[BZ 834019] enhancement to the arquillian stuff so we can have code that can templatize reusable plugin descriptors.
release commit df6a63ddc08f052a58a007545114eb8c8a5c363c
master commit cd6c2fa4ce4f128f80145438499dbfc1223fac42
[BZ 834019] the actual code fix to push out the new rescheduled metric - will later be checking in test code to help
More release/jon3.1.x commits to deal with merge issues:
Moving to ON_QA as available for test in 3.1.2.ER4 or greater: https://brewweb.devel.redhat.com//buildinfo?buildID=246861
To test this feature using the pattern generator will require that it be cherry-picked from master, and possibly further enhanced.
There is a robust unit test for this feature fix, I think it is sufficient and recommend this be marked verified.
marking the bug as verified according to Jay's last comment (was not able to verify manually).