Bug 971556

Summary: Better design for availability checking
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: AgentAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.5CC: asantos, genman, hrupp, mazz
Target Milestone: GA   
Target Release: RHQ 4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-23 12:30:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1037855    
Attachments:
Description Flags
Patch for master (69656f42348a)
none
Patch against RHQ 4.9 release
none
patch for avail async in plugin container none

Description Elias Ross 2013-06-06 19:44:33 UTC
Availability checking is limited to about 5 seconds, which works for most resources, but not all. For busy JBoss the 'solution' for these resources is to create (inside the plugin code) an async processing thread.

Since the checking work is done pretty much linearly, the limitation is appropriate.

It would work better to:
* Create a scheduled thread pool executor for availability checking
* Schedule availability checks as configured for the resource (every 30 seconds, 1 minute, 5 minutes)
* These checks 'append' to the availability report, which is sent every 30 seconds or so to the server. (Thread safety hazard here.)
* Create timeouts for these checks on a separate thread pool

Consider as well this class, since there may be hundreds of resources: http://docs.jboss.org/netty/3.1/api/org/jboss/netty/util/HashedWheelTimer.html

Note that if there is queuing, i.e. running out of threads, then the plugin container should:
* Complain loudly
* Possibly increase the availability check period (double it?)
* Disable or stop checking the availability of resources taking the longest to check

Probably 8 threads or so would be sufficient.

A similar solution to measurement gathering would be good.

Comment 1 Elias Ross 2013-06-07 20:30:06 UTC
^^^ Ignore the design above.

I have a better solution using a special proxy that fits more seamlessly in. I will submit a patch for this.

Comment 2 Elias Ross 2013-08-01 18:03:49 UTC
Created attachment 781695 [details]
Patch for master (69656f42348a)

Comment 3 Elias Ross 2013-08-01 18:12:28 UTC
I deployed this patch on 4.5 and it works great for a variety of services.

Couple thoughts:
1) Would be nice to generalize this approach for measurement gathering/scheduling
2) What should AvailabilityCollectorThreadPool be used for? I'm guessing I would deprecate it and eventually remove it.
3) ... which means removing it from the existing JBoss plugin.
4) Is 1 second enough for 'sync' querying? More stats would be good.

Comment 4 Elias Ross 2013-08-14 18:31:08 UTC
You may will need to apply this change using the '--ignore-whitespace' option to 'git am'

Comment 5 Heiko W. Rupp 2013-08-19 11:42:50 UTC
Jay, can you please have a look at this? I think this may be beneficial for 4.9 as well.

Comment 6 Elias Ross 2013-09-17 17:36:55 UTC
Created attachment 798914 [details]
Patch against RHQ 4.9 release

Comment 7 Jay Shaughnessy 2013-11-22 18:47:17 UTC
I have this patch worked into code locally and am evaluating...

Comment 8 Elias Ross 2013-11-22 19:55:20 UTC
Looking at 4.9, one thing I did not take into consideration was that through the UI, there are calls made to getLiveResourceAvailability -> getCurrentAvailability, and this may mess up the timing of the proxy system. For example if a user clicks on a resource, then clicks back on the same resource, the Future may be canceled and DOWN returned.

I haven't seen any problem in practice. But it may make sense to add a record of when the scheduled check was made.

Comment 9 John Mazzitelli 2013-11-26 23:02:12 UTC
(In reply to Elias Ross from comment #8)
> Looking at 4.9, one thing I did not take into consideration was that through
> the UI, there are calls made to getLiveResourceAvailability ->
> getCurrentAvailability, and this may mess up the timing of the proxy system.
> For example if a user clicks on a resource, then clicks back on the same
> resource, the Future may be canceled and DOWN returned.
> 
> I haven't seen any problem in practice. But it may make sense to add a
> record of when the scheduled check was made.

Yeah, this is a problem. I have a unit test that I can see this happen. I spawn a few threads and concurrently call getAvailability and I can clearly see the problem. It makes sense when I look at the code. I'll see if I can somehow work around this without requiring additional data fields in the proxy - we need to keep the memory footprint of this proxy very lean - there is a potential of having tens of thousands of these for those with large agent inventories so the smaller the better.

Comment 10 John Mazzitelli 2013-11-27 15:11:19 UTC
side note: just to get a rough estimate of the size of this proxy, I made the proxy serializable (had to make some things transient, like the classloader and executor  references), serialized it, and got the size in bytes of the serialized data. The size was 500 bytes. So, figure each object's memory footprint isn't going to be larger than 1,000 bytes. So for each 10,000 resources in inventory, this could mean adding at least 5MB and probably closer to 10MB to the agent's footprint. This was after I took out the logger data field (the proxy now only creates the logger when it needs to log a message which, in this proxy's case, is only when unusual conditions happen, like we've timed out over 5 times or things like that).

Comment 11 John Mazzitelli 2013-11-27 18:20:08 UTC
In the origin/jay-avail branch, is the latest code with tests passing. I think this is it. After the US holidays, we'll merge the latest master into it, make sure its still working, then merge it into master so others can beat on it.

Comment 12 Elias Ross 2013-11-27 22:39:12 UTC
(In reply to John Mazzitelli from comment #9)
> 
> Yeah, this is a problem. I have a unit test that I can see this happen. I
> spawn a few threads and concurrently call getAvailability and I can clearly
> see the problem. It makes sense when I look at the code. I'll see if I can
> somehow work around this without requiring additional data fields in the
> proxy - we need to keep the memory footprint of this proxy very lean - there
> is a potential of having tens of thousands of these for those with large
> agent inventories so the smaller the better.

I think the solution is to have getLiveAvailability do a different thing. Basically only use the proxy for scheduled availability checks. I'm not sure how workable this is without changing a lot of code, but I'm throwing that out there.

It does make some sense for live availability checks to work synchronously anyway and let the client handle the cancelation as appropriate.

Comment 13 John Mazzitelli 2013-12-02 13:04:40 UTC
Created attachment 831568 [details]
patch for avail async in plugin container

attaching patch against 4.10 master branch (applied on top of dc4622217d50fd9864b1fe64a424c5a7de507908)

Comment 14 John Mazzitelli 2013-12-03 22:35:58 UTC
this has been pushed to master : bbb18b7

Comment 15 John Mazzitelli 2013-12-06 17:55:40 UTC
AvailTest was failing - it appears due to the singleton nature of PluginContainer and ResourceContainer, we need the new avail proxy to call into ResourceContainer to submit tasks (rather than the proxy storing the instance of the executor thread pool) or we need to refresh its thread pool reference with the latest one. This is because during testing, we shutdown and reinitalize the PC over and over and in some cases, the proxy's threadpool has been shutdown. (the RsourceContainer.initialize creates a new one after its .shutdown() shuts down the thread pool- but those thread pool references remain in the proxy). Thanks to Lukas for finding this out.

I'll try to work around this problem in the testing framework somehow. Putting this back to ON_DEV.

Comment 16 John Mazzitelli 2013-12-09 17:51:09 UTC
git commit to master related to this issue:

f89a8b4ce8c647c93255db0645758eded2423239
273c9a9593fe488f5f1ac85dfcf66253d1f19d25
988034f36abce65ed3bf4a6fa1846a0d540398ba

Comment 17 Heiko W. Rupp 2014-04-23 12:30:45 UTC
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.