Bug 826047

Summary: Slow availability check can hang discovery
Product: [Other] RHQ Project Reporter: Jay Shaughnessy <jshaughn>
Component: Agent, Plugin ContainerAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.4CC: hrupp
Target Milestone: ---   
Target Release: RHQ 4.5.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-01 06:03:04 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Jay Shaughnessy 2012-05-29 09:22:25 EDT
Availability checks are performed during a service discovery.  These checks are currently directly calling the the resource component.  That means the discovery executor waits until the avail check returns.  The avail checks are already a lot of overhead, but if the avail check is slow it can unacceptably hold up discovery.

The discovery executor should call the component in the same way that the availability executor does, using a component proxy that applies the configured avail check timeout.
Comment 1 Jay Shaughnessy 2012-05-29 09:30:06 EDT
In addition to the fix for the long avail check, I've looked further as to whether all of the avail checking is really required.

After various IRC discussion it's agreed that it is not worth performing the avail check for resources with current avail = UP.  This will eliminate most avail checking during service discovery, as in production most resources are UP and stay UP.

Supporting IRC chat:

(4:16:50 PM) jshaughn: I sort of wonder whether it's worth doing an avail check at all during service discovery.
(4:17:22 PM) jshaughn: it certainly can slow things down, and it's old school because it's not in any way staggered.
(4:17:27 PM) jshaughn: so it could spike things
(4:18:06 PM) jshaughn: We need to do an initial getAvailability call for *newly discovered* resources
(4:18:41 PM) jshaughn: (i.e. resources with a null current avail) but for existing resources, I'm not sure we shouldn't just take the current avail
(4:18:51 PM) jshaughn: I'm toying with that
(4:19:08 PM) jshaughn: if it's not UP we wouldn't do a discovery
(4:19:35 PM) jshaughn: but the chances that the current avail is DOWN and in reality it's up, is small.
(4:20:10 PM) jshaughn: conversely, if the current avail is UP but it's really down, the discovery may be invalid/hang/whatever
(4:20:48 PM) jshaughn: mazz, ips, jsanda, do you guys have an opinion on this rant ^
(4:21:39 PM) mazz: I think the reason why we did that
(4:21:54 PM) mazz: was if a resource isn't up - no sense trying to connect to it to see if it has children
(4:22:02 PM) mazz: that said, if we look at the last known state
(4:22:04 PM) mazz: that may be good enough
(4:22:10 PM) mazz: and just don't do a live avail check there
(4:23:01 PM) jshaughn: that's what I'm thinking, interactively a user could always do an avail --force followed by discovery -f if that's what they needed
(4:23:19 PM) jshaughn: and this way we speed up discovery with, I think, very little downside
(4:43:34 PM) ips: jshaughn: i'm on the fence about live avail check or not
(4:45:18 PM) jshaughn: I'm just not sure it's worth the added footprint
(4:45:24 PM) ips: you could possibly trust the cached avail if it's UP
(4:45:38 PM) ips: but if it's DOWN, do a live check, just to make sure it's not back up
(4:45:46 PM) jshaughn: we could do that
(4:46:05 PM) jshaughn: to ensure that we don't miss an opportunity, I suppose
(4:46:10 PM) ips: right
(4:46:55 PM) jshaughn: hmmm, I don't love that but I guess I could live with it
(4:47:14 PM) jshaughn: because I feel like the current avail will likely always be valid
(5:09:01 PM) jshaughn: ok, I made the change locally but haven't run it yet.  It defers to ips and will still do an avail check for all non-UP resources. This should be low overhead since most resources are UP and stay UP, but this way our intermittent discovery will not bypass something that just came up.
Comment 2 Jay Shaughnessy 2012-05-29 13:25:52 EDT
After some testing, a tweak to the above. And another IRC chat:

(11:36:01 AM) jshaughn: hmmm, so I see the downside now of not performing an avail check prior to discovery.  In at least one case the discovery code is super slow if invoked when the resource is actually down
(11:36:41 AM) jshaughn: The AS7 standalone component goes through a lot of work even though it is down.  Over 30s of failed discovery.
(11:36:53 AM) jshaughn: perhaps I should always do an avail check for server resources
(11:37:46 AM) ips: interesting..
(11:38:32 AM) ips: may make sense to keep it then, just adding the facet timeout
12:00:36 PM) jshaughn: I think just doing a check on servers is sufficient.  Even at that, I'm not sure it's worth it.  It slows things down and the window for server avail check is only a minute by default.  But it will guarantee that we don't start a really slow server discovery for a DOWN server.  With that change I now see decent times for that recent test.
Comment 3 Jay Shaughnessy 2012-05-29 13:32:26 EDT
master commit b2226a282fbaa992775f5fa11cb138254ba6f23f

- Now uses proxy and applies configured avail timeout to
  getAvailability() calls.
- Also, now only perform an avail check on server resources or
  if the current avail is NOT UP. (i.e. assume UP services are still
  UP, which eliminates a lot of avail checking for the standard case)



Test Notes:
You can use agent prompt discovery -f in various ways and make sure times
are reasonable.  Example:

> Import only RHQ agent/server
> discovery -f
> Import a running EAP7
> discovery -f
> uninventory EAP7 service, like Infinispan
> discovery -f
  - should rediscover and reimport
> shut down EAP7
> discovery -f
  - should still be fast, maybe slightly slower due to avail checking on the EAP7
> uninventory EAP7 service, like Infinispan
> discovery -f
  - should be fast and not rediscover since EAP is down
> start EAP7
> discovery -f
  - should rediscover and reimport
Comment 4 Heiko W. Rupp 2013-09-01 06:03:04 EDT
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.