Red Hat Bugzilla – Bug 971556
Better design for availability checking
Last modified: 2014-04-23 08:30:45 EDT
Availability checking is limited to about 5 seconds, which works for most resources, but not all. For busy JBoss the 'solution' for these resources is to create (inside the plugin code) an async processing thread.
Since the checking work is done pretty much linearly, the limitation is appropriate.
It would work better to:
* Create a scheduled thread pool executor for availability checking
* Schedule availability checks as configured for the resource (every 30 seconds, 1 minute, 5 minutes)
* These checks 'append' to the availability report, which is sent every 30 seconds or so to the server. (Thread safety hazard here.)
* Create timeouts for these checks on a separate thread pool
Consider as well this class, since there may be hundreds of resources: http://docs.jboss.org/netty/3.1/api/org/jboss/netty/util/HashedWheelTimer.html
Note that if there is queuing, i.e. running out of threads, then the plugin container should:
* Complain loudly
* Possibly increase the availability check period (double it?)
* Disable or stop checking the availability of resources taking the longest to check
Probably 8 threads or so would be sufficient.
A similar solution to measurement gathering would be good.
^^^ Ignore the design above.
I have a better solution using a special proxy that fits more seamlessly in. I will submit a patch for this.
Created attachment 781695 [details]
Patch for master (69656f42348a)
I deployed this patch on 4.5 and it works great for a variety of services.
1) Would be nice to generalize this approach for measurement gathering/scheduling
2) What should AvailabilityCollectorThreadPool be used for? I'm guessing I would deprecate it and eventually remove it.
3) ... which means removing it from the existing JBoss plugin.
4) Is 1 second enough for 'sync' querying? More stats would be good.
You may will need to apply this change using the '--ignore-whitespace' option to 'git am'
Jay, can you please have a look at this? I think this may be beneficial for 4.9 as well.
Created attachment 798914 [details]
Patch against RHQ 4.9 release
I have this patch worked into code locally and am evaluating...
Looking at 4.9, one thing I did not take into consideration was that through the UI, there are calls made to getLiveResourceAvailability -> getCurrentAvailability, and this may mess up the timing of the proxy system. For example if a user clicks on a resource, then clicks back on the same resource, the Future may be canceled and DOWN returned.
I haven't seen any problem in practice. But it may make sense to add a record of when the scheduled check was made.
(In reply to Elias Ross from comment #8)
> Looking at 4.9, one thing I did not take into consideration was that through
> the UI, there are calls made to getLiveResourceAvailability ->
> getCurrentAvailability, and this may mess up the timing of the proxy system.
> For example if a user clicks on a resource, then clicks back on the same
> resource, the Future may be canceled and DOWN returned.
> I haven't seen any problem in practice. But it may make sense to add a
> record of when the scheduled check was made.
Yeah, this is a problem. I have a unit test that I can see this happen. I spawn a few threads and concurrently call getAvailability and I can clearly see the problem. It makes sense when I look at the code. I'll see if I can somehow work around this without requiring additional data fields in the proxy - we need to keep the memory footprint of this proxy very lean - there is a potential of having tens of thousands of these for those with large agent inventories so the smaller the better.
side note: just to get a rough estimate of the size of this proxy, I made the proxy serializable (had to make some things transient, like the classloader and executor references), serialized it, and got the size in bytes of the serialized data. The size was 500 bytes. So, figure each object's memory footprint isn't going to be larger than 1,000 bytes. So for each 10,000 resources in inventory, this could mean adding at least 5MB and probably closer to 10MB to the agent's footprint. This was after I took out the logger data field (the proxy now only creates the logger when it needs to log a message which, in this proxy's case, is only when unusual conditions happen, like we've timed out over 5 times or things like that).
In the origin/jay-avail branch, is the latest code with tests passing. I think this is it. After the US holidays, we'll merge the latest master into it, make sure its still working, then merge it into master so others can beat on it.
(In reply to John Mazzitelli from comment #9)
> Yeah, this is a problem. I have a unit test that I can see this happen. I
> spawn a few threads and concurrently call getAvailability and I can clearly
> see the problem. It makes sense when I look at the code. I'll see if I can
> somehow work around this without requiring additional data fields in the
> proxy - we need to keep the memory footprint of this proxy very lean - there
> is a potential of having tens of thousands of these for those with large
> agent inventories so the smaller the better.
I think the solution is to have getLiveAvailability do a different thing. Basically only use the proxy for scheduled availability checks. I'm not sure how workable this is without changing a lot of code, but I'm throwing that out there.
It does make some sense for live availability checks to work synchronously anyway and let the client handle the cancelation as appropriate.
Created attachment 831568 [details]
patch for avail async in plugin container
attaching patch against 4.10 master branch (applied on top of dc4622217d50fd9864b1fe64a424c5a7de507908)
this has been pushed to master : bbb18b7
AvailTest was failing - it appears due to the singleton nature of PluginContainer and ResourceContainer, we need the new avail proxy to call into ResourceContainer to submit tasks (rather than the proxy storing the instance of the executor thread pool) or we need to refresh its thread pool reference with the latest one. This is because during testing, we shutdown and reinitalize the PC over and over and in some cases, the proxy's threadpool has been shutdown. (the RsourceContainer.initialize creates a new one after its .shutdown() shuts down the thread pool- but those thread pool references remain in the proxy). Thanks to Lukas for finding this out.
I'll try to work around this problem in the testing framework somehow. Putting this back to ON_DEV.
git commit to master related to this issue:
Bulk closing of 4.10 issues.
If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.