Red Hat Bugzilla – Bug 834019
Measurements not collected for agents with a large number of scheduled measurements
Last modified: 2013-09-01 06:13:20 EDT
Created attachment 593253 [details]
Proposed patch for RHQ 3.0 and 4.4
We have a somewhat particular RHQ setup where we monitor a large number of resources remotely from a single agent. Par agent, we have +/- 25000 scheduled measurements with +/- 1500 measurement collected per minute. Since most of the metrics are collected with the same interval (10 minutes), this causes the following problem: when the agent is started (t=0), it will schedule all these metrics in the same interval [0s,30s]. However, because of the large number of measurements, the agent is not able to collect all of them in that 30s interval and will reschedule the remaining ones to the next interval in the original schedule, i.e. to [10m,10m+30s]. The same thing again happens in the interval [10m,10m+30s] and most of the measurements are rescheduled to the next interval [20m,20m+30s] and so forth. This means that some metrics are never collected (and are reported as "late" in the metrics of the RHQ agent).
Note that the issue only occurs after restarting the agent. When the resources are originally added to the inventory, the corresponding measurement schedules are spread more or less randomly and the agent is able to collect all of them.
To solve that issue with RHQ 3.0, I applied the patch attached to this issue. The patch also works with RHQ 4.4. The idea is that instead of rescheduling the measurement according to the original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should simply be rescheduled to the next interval (from [0s,30s] to [30s,60s]).
Jay S. did something in the availability scheduling that was similar - at least the end goal was similar and that is to spread out the initial schedule so things aren't scheduled to be collected at roughly the same time. Need to look at the code Jay did for that and perhaps see if the same method can be used here.
Jay noted that for this to be efficient, we need to group metric requests for the same resource. This is indeed a requirement for us because we have a custom plugin that fetches the measurements for all metrics of a given resource in a single remote call. This requirement should be easy to satisfy by calculating the shift based on a hash function (e.g. modulo) of the resource ID.
However, there is another problem: in our case, all resources are highly available, which means that they are part of compatible groups. In that case it is not desirable that the measurement schedules for the different members of a given resource group are shifted randomly with respect to each other. Instead, measurements should be taken at the same time. Therefore it is probably better to use a hash function of the resource key instead of the resource ID.
Finally, I think that the first scheduled measurement should not be calculated relative to the agent start time, but relative to an absolute point in time (e.g. the UNIX epoch). This allows to satisfy the last requirement even if the measurements for the resources in a group are taken by different agents (which will likely be the case in most RHQ setups). It also helps satisfy the first requirement when metrics are enabled at runtime.
See also Bug 783603
Interesting, so the attached patch is small and relatively non-intrusive. If I read this right, all this is doing is taking a measurement schedule that was late and simply punting a little bit more into the future - essentially delaying the next collection 30s into the future. The difference here is the old code was more conservative and punted until the next schedule collection interval (which could be 10m in the future or some such number) giving the agent more time to catch up doing things (which may or may not be due to measurement collection if the agent is CPU starved for other reasons). Punting only for another 30s, in general, may not be enough time to let the agent try to settle down in a steady state.
What if, rather than punt 30s in the future, we punt (next collection time + 30s) - essentially spreading out the next collection further out in the future than it would otherwise by 30 seconds? Thus, the earlier description of the attached patch can be slightly modified to say:
> The idea is that instead of rescheduling the measurement according to the
> original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should simply
> be rescheduled to the next interval (from [0s,30s] to ***[10m+30s,10m+60s]***).
The *** *** part is what I propose we change from the attached patch.
> Finally, I think that the first scheduled measurement should not be calculated
> relative to the agent start time, but relative to an absolute point in time
> (e.g. the UNIX epoch).
This part scares me and before we do anything like this, we'd need some testing and prototyping because now this potentially means that many collections across all agents won't necessarily be "randomly spread out" but will more likely overlap. How does this affect the amount of data from all agents that get into the server(s) - will the servers be bombarded by a synchronocity of cyclical agent measurement reports with a period of relative inactivity in between? I'm just not sure I fully understand the ramifications of normalizing all collection start times to epoch millis and until we do understand the ramifications on system behavior, we should err on the side of caution.
I committed to master this: cd6c2fa4ce4f128f80145438499dbfc1223fac42
A later commit added some code to our base arquillian framework and I committed a stubbed out unit test method that is supposed to test the fix for this BZ. I will work on fleshing out that test next. But I needed the framework code first. See commit: 5e823b969b8cc830432cfc2e055ed3292464c4d8
finished the unit test - committed to master: 00db03b19d5a70e9798405b33f91ed150e38382e
I'm not sure how QE can test this. This was hard to get to replicate in unit test code! So I would say QE should just rely on the results of the unit test class LateMeasurementRescheduleTest.
Setting this back to ON_DEV.
The fix today is OK but after discussing with Mazz we think it can be improved. The issue with the fix today is that when metric collection falls behind, there may be a large number of schedule requests still remaining for that time bucket. And all of them are then rescheduled for the same time in the future (collectionInterval + 31s). It is good that they are scheduled away from the collectionInterval, but more randomization would be better, as would rescheduling for a time that is potentially nearer to the current time.
The change will be to reschedule to 31s + randomSeconds(0..collectionInterval).
The fixed 31s ensures that we push out at least to some degree, allowing the agent some time to catch up and, I believe, this plays into our internal eval periods. But then introduce more randomization to better spread out collection requests, and also don't, in general, wait so long to retry, because this could cause the initial collection to still take an unacceptably long time.
Also, add more debug and trace level logging for better diagnostics in this area.
A little more revision after getting into this in detail. Importantly, the rescheduling of late collection requests is:
*Now* + 30s + [1..collectionInterval]
using "Now" (i.e. current system time) as opposed to request.getNextCollection() (i.e. the originally scheduled time) gives us more predictability about the range in which the request will be scheduled.
Author: Jay Shaughnessy <firstname.lastname@example.org>
Date: Wed Nov 14 16:39:05 2012 -0500
Revisit the solution for this BZ and improve by adding more
randomization to the reschedule times.
- update test accordingly
- add some debug and trace level debugging around late
Author: Jay Shaughnessy <email@example.com>
Date: Wed Nov 14 17:14:31 2012 -0500
trivial - enhance logging slightly
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.