Bug 826047
Summary: | Slow availability check can hang discovery | ||
---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Jay Shaughnessy <jshaughn> |
Component: | Agent, Plugin Container | Assignee: | Jay Shaughnessy <jshaughn> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4 | CC: | hrupp |
Target Milestone: | --- | ||
Target Release: | RHQ 4.5.0 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-09-01 10:03:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jay Shaughnessy
2012-05-29 13:22:25 UTC
In addition to the fix for the long avail check, I've looked further as to whether all of the avail checking is really required. After various IRC discussion it's agreed that it is not worth performing the avail check for resources with current avail = UP. This will eliminate most avail checking during service discovery, as in production most resources are UP and stay UP. Supporting IRC chat: (4:16:50 PM) jshaughn: I sort of wonder whether it's worth doing an avail check at all during service discovery. (4:17:22 PM) jshaughn: it certainly can slow things down, and it's old school because it's not in any way staggered. (4:17:27 PM) jshaughn: so it could spike things (4:18:06 PM) jshaughn: We need to do an initial getAvailability call for *newly discovered* resources (4:18:41 PM) jshaughn: (i.e. resources with a null current avail) but for existing resources, I'm not sure we shouldn't just take the current avail (4:18:51 PM) jshaughn: I'm toying with that (4:19:08 PM) jshaughn: if it's not UP we wouldn't do a discovery (4:19:35 PM) jshaughn: but the chances that the current avail is DOWN and in reality it's up, is small. (4:20:10 PM) jshaughn: conversely, if the current avail is UP but it's really down, the discovery may be invalid/hang/whatever (4:20:48 PM) jshaughn: mazz, ips, jsanda, do you guys have an opinion on this rant ^ (4:21:39 PM) mazz: I think the reason why we did that (4:21:54 PM) mazz: was if a resource isn't up - no sense trying to connect to it to see if it has children (4:22:02 PM) mazz: that said, if we look at the last known state (4:22:04 PM) mazz: that may be good enough (4:22:10 PM) mazz: and just don't do a live avail check there (4:23:01 PM) jshaughn: that's what I'm thinking, interactively a user could always do an avail --force followed by discovery -f if that's what they needed (4:23:19 PM) jshaughn: and this way we speed up discovery with, I think, very little downside (4:43:34 PM) ips: jshaughn: i'm on the fence about live avail check or not (4:45:18 PM) jshaughn: I'm just not sure it's worth the added footprint (4:45:24 PM) ips: you could possibly trust the cached avail if it's UP (4:45:38 PM) ips: but if it's DOWN, do a live check, just to make sure it's not back up (4:45:46 PM) jshaughn: we could do that (4:46:05 PM) jshaughn: to ensure that we don't miss an opportunity, I suppose (4:46:10 PM) ips: right (4:46:55 PM) jshaughn: hmmm, I don't love that but I guess I could live with it (4:47:14 PM) jshaughn: because I feel like the current avail will likely always be valid (5:09:01 PM) jshaughn: ok, I made the change locally but haven't run it yet. It defers to ips and will still do an avail check for all non-UP resources. This should be low overhead since most resources are UP and stay UP, but this way our intermittent discovery will not bypass something that just came up. After some testing, a tweak to the above. And another IRC chat: (11:36:01 AM) jshaughn: hmmm, so I see the downside now of not performing an avail check prior to discovery. In at least one case the discovery code is super slow if invoked when the resource is actually down (11:36:41 AM) jshaughn: The AS7 standalone component goes through a lot of work even though it is down. Over 30s of failed discovery. (11:36:53 AM) jshaughn: perhaps I should always do an avail check for server resources (11:37:46 AM) ips: interesting.. (11:38:32 AM) ips: may make sense to keep it then, just adding the facet timeout 12:00:36 PM) jshaughn: I think just doing a check on servers is sufficient. Even at that, I'm not sure it's worth it. It slows things down and the window for server avail check is only a minute by default. But it will guarantee that we don't start a really slow server discovery for a DOWN server. With that change I now see decent times for that recent test. master commit b2226a282fbaa992775f5fa11cb138254ba6f23f - Now uses proxy and applies configured avail timeout to getAvailability() calls. - Also, now only perform an avail check on server resources or if the current avail is NOT UP. (i.e. assume UP services are still UP, which eliminates a lot of avail checking for the standard case) Test Notes: You can use agent prompt discovery -f in various ways and make sure times are reasonable. Example: > Import only RHQ agent/server > discovery -f > Import a running EAP7 > discovery -f > uninventory EAP7 service, like Infinispan > discovery -f - should rediscover and reimport > shut down EAP7 > discovery -f - should still be fast, maybe slightly slower due to avail checking on the EAP7 > uninventory EAP7 service, like Infinispan > discovery -f - should be fast and not rediscover since EAP is down > start EAP7 > discovery -f - should rediscover and reimport Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since. |