Bug 820709
Summary: | [plugin-container] with two or more AS7 host controllers in inventory, a full discovery scan takes an unacceptably long time | |||
---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Libor Zoubek <lzoubek> | |
Component: | Plugin Container | Assignee: | Jay Shaughnessy <jshaughn> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | |
Severity: | high | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.4 | CC: | ahovsepy, ccrouch, hrupp, jshaughn, mazz, theute | |
Target Milestone: | --- | |||
Target Release: | RHQ 4.10 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 823564 (view as bug list) | Environment: | ||
Last Closed: | 2014-02-18 21:36:59 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 707223, 823564 |
Description
Libor Zoubek
2012-05-10 18:07:04 UTC
I was able to easily reproduce this. For me, "discovery -f" took almost 7 minutes to complete, which is obviously way too long. I sent a QUIT signal to the Agent JVM to trigger a thread dump, and I noticed that in the discovery scan thread, a call to StreamUtil.copy() via ASConnection.executeRaw() was blocking on a call to inputStream.read(). I waited a couple minutes then triggered another thread dump and saw the discovery thread was blocked at the same spot. We set the read timeout on the HttpURLConnection to 10 seconds, so each call to ASConnection.executeRaw() that ends up hanging while reading the response, will take 10 seconds to timeout (assuming the read timeout is working). If there are 100's of such calls, that would explain why the discovery scan is taking so long to complete. It turns out HttpURLConnection.getInputStream().read() does not always return -1 when the entire response body has been read. Therefore, it's necessary for us to detect when the number of bytes we've read is equal to the Content-Length provided by the AS7 server, and to close the input stream and return the response at that point. http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=95ff81d implements this. I can tell it works, because "discovery -f" now completes in 9 seconds, rather than 7 minutes. An additional commit that fixes a regression caused by 95ff81d: http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=96533ac It turns out the read() calls blocking was not the issue. AS7 actually doesn't even set the Content-Length header on management responses, so the Content-Length checking code I added is essentially of no use. Considering how many resource types the AS7 plugin defines, AS7 service/runtime discovery is expected to take a while. However, it is taking about twice as long as it should due to a regression introduced by the fix for https://bugzilla.redhat.com/show_bug.cgi?id=534186, which results in discoverResources() being called twice per child ResourceType during a given runtime scan. The bug is in RuntimeDiscoveryExecutor and appears to be some redundant recursion on child resources. I've added an integration test that checks the bug: http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=de82e5b The test currently fails but should pass once this is fixed. I should also see "discovery -f" take about 3.5 minutes, rather than 7 minutes, with two AS7 domain controllers running and inventoried. Jay S. committed a potential fix for this yesterday: two git commits to master: c08b19ffc73d646766263a673c6e96f0f1024b03 25da84131727cc76c3ebae200f1db4b9e550c3af We also need to make sure any changes in this area of the discovery code does not regress back to reintroduce bug 534186 - for the record, I tested to see if jay's fix broke that again and it does not. The test replication procedures in bug 534186 pass successfully for me. Which makes me happy. note: I checked out git commit de82e5b9603bf9cd2ec8261b8cb996edbd5a838b (which was just before jay's fixes), I cherry picked the new unit tests but did NOT cherry pick jay's fixes. I deleted InventoryManagerTest from the plugin-container-itests module (because it had merge conflicts - but it didn't have the test I'm interested in anyway, so its no big deal) and I ran DiscoveryTest (which DOES have the unit test that we want to see pass) and it failed (as expected). It shows discovery components getting called multiple times. So, I see the new test failing prior to jay's fix and passing after jay's fix. git commit to master: 0f1f93cdd17d6041120907a150079868996651d3 this is the new test that shows this fix working. run this test prior to the fix, and it fails. reopen under RHQ 4.5.0 (last master build) please get attached screen-shots for different situations: 1. two AS servers started (domain and standalone) - discovery takes ~6 mins 2. no AS server started - discovery takes ~1sec 3. one AS server started - discovery takes ~1min As far as I know the amount of time currently taken by discovery is simply the time it takes to run through AS-7 servers, and there is no regression or bug here. Unless this is an issue generated outside of our internal groups I'd suggest we close this. Asking for triage... I'd say the assumption that this is the time to go through the as7 tree is right. We may in the future apply a more clever way of discovery where we do less round tripping, but this would be a plugin-internal thing. So yes, I think this can be closed. Closing, no more work planned currently. |