Bug 971556 - Better design for availability checking
Better design for availability checking
Status: CLOSED CURRENTRELEASE
Product: RHQ Project
Classification: Other
Component: Agent (Show other bugs)
4.5
Unspecified Unspecified
unspecified Severity unspecified (vote)
: GA
: RHQ 4.10
Assigned To: Jay Shaughnessy
Mike Foley
:
Depends On:
Blocks: 1037855
  Show dependency treegraph
 
Reported: 2013-06-06 15:44 EDT by Elias Ross
Modified: 2014-04-23 08:30 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-04-23 08:30:45 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch for master (69656f42348a) (33.33 KB, patch)
2013-08-01 14:03 EDT, Elias Ross
no flags Details | Diff
Patch against RHQ 4.9 release (32.71 KB, patch)
2013-09-17 13:36 EDT, Elias Ross
no flags Details | Diff
patch for avail async in plugin container (81.13 KB, patch)
2013-12-02 08:04 EST, John Mazzitelli
no flags Details | Diff

  None (edit)
Description Elias Ross 2013-06-06 15:44:33 EDT
Availability checking is limited to about 5 seconds, which works for most resources, but not all. For busy JBoss the 'solution' for these resources is to create (inside the plugin code) an async processing thread.

Since the checking work is done pretty much linearly, the limitation is appropriate.

It would work better to:
* Create a scheduled thread pool executor for availability checking
* Schedule availability checks as configured for the resource (every 30 seconds, 1 minute, 5 minutes)
* These checks 'append' to the availability report, which is sent every 30 seconds or so to the server. (Thread safety hazard here.)
* Create timeouts for these checks on a separate thread pool

Consider as well this class, since there may be hundreds of resources: http://docs.jboss.org/netty/3.1/api/org/jboss/netty/util/HashedWheelTimer.html

Note that if there is queuing, i.e. running out of threads, then the plugin container should:
* Complain loudly
* Possibly increase the availability check period (double it?)
* Disable or stop checking the availability of resources taking the longest to check

Probably 8 threads or so would be sufficient.

A similar solution to measurement gathering would be good.
Comment 1 Elias Ross 2013-06-07 16:30:06 EDT
^^^ Ignore the design above.

I have a better solution using a special proxy that fits more seamlessly in. I will submit a patch for this.
Comment 2 Elias Ross 2013-08-01 14:03:49 EDT
Created attachment 781695 [details]
Patch for master (69656f42348a)
Comment 3 Elias Ross 2013-08-01 14:12:28 EDT
I deployed this patch on 4.5 and it works great for a variety of services.

Couple thoughts:
1) Would be nice to generalize this approach for measurement gathering/scheduling
2) What should AvailabilityCollectorThreadPool be used for? I'm guessing I would deprecate it and eventually remove it.
3) ... which means removing it from the existing JBoss plugin.
4) Is 1 second enough for 'sync' querying? More stats would be good.
Comment 4 Elias Ross 2013-08-14 14:31:08 EDT
You may will need to apply this change using the '--ignore-whitespace' option to 'git am'
Comment 5 Heiko W. Rupp 2013-08-19 07:42:50 EDT
Jay, can you please have a look at this? I think this may be beneficial for 4.9 as well.
Comment 6 Elias Ross 2013-09-17 13:36:55 EDT
Created attachment 798914 [details]
Patch against RHQ 4.9 release
Comment 7 Jay Shaughnessy 2013-11-22 13:47:17 EST
I have this patch worked into code locally and am evaluating...
Comment 8 Elias Ross 2013-11-22 14:55:20 EST
Looking at 4.9, one thing I did not take into consideration was that through the UI, there are calls made to getLiveResourceAvailability -> getCurrentAvailability, and this may mess up the timing of the proxy system. For example if a user clicks on a resource, then clicks back on the same resource, the Future may be canceled and DOWN returned.

I haven't seen any problem in practice. But it may make sense to add a record of when the scheduled check was made.
Comment 9 John Mazzitelli 2013-11-26 18:02:12 EST
(In reply to Elias Ross from comment #8)
> Looking at 4.9, one thing I did not take into consideration was that through
> the UI, there are calls made to getLiveResourceAvailability ->
> getCurrentAvailability, and this may mess up the timing of the proxy system.
> For example if a user clicks on a resource, then clicks back on the same
> resource, the Future may be canceled and DOWN returned.
> 
> I haven't seen any problem in practice. But it may make sense to add a
> record of when the scheduled check was made.

Yeah, this is a problem. I have a unit test that I can see this happen. I spawn a few threads and concurrently call getAvailability and I can clearly see the problem. It makes sense when I look at the code. I'll see if I can somehow work around this without requiring additional data fields in the proxy - we need to keep the memory footprint of this proxy very lean - there is a potential of having tens of thousands of these for those with large agent inventories so the smaller the better.
Comment 10 John Mazzitelli 2013-11-27 10:11:19 EST
side note: just to get a rough estimate of the size of this proxy, I made the proxy serializable (had to make some things transient, like the classloader and executor  references), serialized it, and got the size in bytes of the serialized data. The size was 500 bytes. So, figure each object's memory footprint isn't going to be larger than 1,000 bytes. So for each 10,000 resources in inventory, this could mean adding at least 5MB and probably closer to 10MB to the agent's footprint. This was after I took out the logger data field (the proxy now only creates the logger when it needs to log a message which, in this proxy's case, is only when unusual conditions happen, like we've timed out over 5 times or things like that).
Comment 11 John Mazzitelli 2013-11-27 13:20:08 EST
In the origin/jay-avail branch, is the latest code with tests passing. I think this is it. After the US holidays, we'll merge the latest master into it, make sure its still working, then merge it into master so others can beat on it.
Comment 12 Elias Ross 2013-11-27 17:39:12 EST
(In reply to John Mazzitelli from comment #9)
> 
> Yeah, this is a problem. I have a unit test that I can see this happen. I
> spawn a few threads and concurrently call getAvailability and I can clearly
> see the problem. It makes sense when I look at the code. I'll see if I can
> somehow work around this without requiring additional data fields in the
> proxy - we need to keep the memory footprint of this proxy very lean - there
> is a potential of having tens of thousands of these for those with large
> agent inventories so the smaller the better.

I think the solution is to have getLiveAvailability do a different thing. Basically only use the proxy for scheduled availability checks. I'm not sure how workable this is without changing a lot of code, but I'm throwing that out there.

It does make some sense for live availability checks to work synchronously anyway and let the client handle the cancelation as appropriate.
Comment 13 John Mazzitelli 2013-12-02 08:04:40 EST
Created attachment 831568 [details]
patch for avail async in plugin container

attaching patch against 4.10 master branch (applied on top of dc4622217d50fd9864b1fe64a424c5a7de507908)
Comment 14 John Mazzitelli 2013-12-03 17:35:58 EST
this has been pushed to master : bbb18b7
Comment 15 John Mazzitelli 2013-12-06 12:55:40 EST
AvailTest was failing - it appears due to the singleton nature of PluginContainer and ResourceContainer, we need the new avail proxy to call into ResourceContainer to submit tasks (rather than the proxy storing the instance of the executor thread pool) or we need to refresh its thread pool reference with the latest one. This is because during testing, we shutdown and reinitalize the PC over and over and in some cases, the proxy's threadpool has been shutdown. (the RsourceContainer.initialize creates a new one after its .shutdown() shuts down the thread pool- but those thread pool references remain in the proxy). Thanks to Lukas for finding this out.

I'll try to work around this problem in the testing framework somehow. Putting this back to ON_DEV.
Comment 16 John Mazzitelli 2013-12-09 12:51:09 EST
git commit to master related to this issue:

f89a8b4ce8c647c93255db0645758eded2423239
273c9a9593fe488f5f1ac85dfcf66253d1f19d25
988034f36abce65ed3bf4a6fa1846a0d540398ba
Comment 17 Heiko W. Rupp 2014-04-23 08:30:45 EDT
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.

Note You need to log in before you can comment on or make changes to this bug.