Red Hat Bugzilla – Bug 820709
[plugin-container] with two or more AS7 host controllers in inventory, a full discovery scan takes an unacceptably long time
Last modified: 2015-11-01 19:42:50 EST
Description of problem: I am facing performance issue. Default timeout for Manual Autodiscovery is 10minutes. When I have 2 EAPs on same host (and one of them runs in Domain mode) detailed autodiscovery mostly fails because of timeout. If it does not fail it finishes close to 10 minutes.
Version-Release number of selected component (if applicable):
How reproducible:almost always
Steps to Reproduce:
1. stand up at least 2 EAP6 instances on same host (more is better) I am working with 1 standalone and 1 domain instance
2. stand up agent, import whole platform with all children
3. wait 'till everyting is imported (this can take quite a long time 5-10minutes)
4. run manual autodiscovery on platform with 2 EAPs
Actual results: autodiscovery takes almost 10 minutes, sometimes it fails because of default timeout. This is too much. I was observing machine when discovery was running and there are no symptoms of overloading, average CPU usage was at 4%.
Expected results: Autodiscovery must be much faster. I suspect as7plugin. Without imported AS7 servers discovery runs pretty fast. Solution would be to implement discovery of each subsystem to run in parallel - but this might be very complicated and might introduce new issues.
Additional info: I was tcpdumping requests sent by agent to EAP6 and really, plugin was requesting both EAPs for all 10 minutes. But it is hard to determine whether requests came from discovery scan or usual availability. When observing requests when discovery is not running, plugin is sending them all the time anyway. This has become an issue when more AS7 subsystems have been implemented to as7plugin.
I was able to easily reproduce this. For me, "discovery -f" took almost 7 minutes to complete, which is obviously way too long.
I sent a QUIT signal to the Agent JVM to trigger a thread dump, and I noticed that in the discovery scan thread, a call to StreamUtil.copy() via ASConnection.executeRaw() was blocking on a call to inputStream.read(). I waited a couple minutes then triggered another thread dump and saw the discovery thread was blocked at the same spot. We set the read timeout on the HttpURLConnection to 10 seconds, so each call to ASConnection.executeRaw() that ends up hanging while reading the response, will take 10 seconds to timeout (assuming the read timeout is working). If there are 100's of such calls, that would explain why the discovery scan is taking so long to complete.
It turns out HttpURLConnection.getInputStream().read() does not always return -1 when the entire response body has been read. Therefore, it's necessary for us to detect when the number of bytes we've read is equal to the Content-Length provided by the AS7 server, and to close the input stream and return the response at that point.
http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=95ff81d implements this.
I can tell it works, because "discovery -f" now completes in 9 seconds, rather than 7 minutes.
An additional commit that fixes a regression caused by 95ff81d:
It turns out the read() calls blocking was not the issue. AS7 actually doesn't even set the Content-Length header on management responses, so the Content-Length checking code I added is essentially of no use.
Considering how many resource types the AS7 plugin defines, AS7 service/runtime discovery is expected to take a while. However, it is taking about twice as long as it should due to a regression introduced by the fix for https://bugzilla.redhat.com/show_bug.cgi?id=534186, which results in discoverResources() being called twice per child ResourceType during a given runtime scan. The bug is in RuntimeDiscoveryExecutor and appears to be some redundant recursion on child resources.
I've added an integration test that checks the bug:
The test currently fails but should pass once this is fixed.
I should also see "discovery -f" take about 3.5 minutes, rather than 7 minutes, with two AS7 domain controllers running and inventoried.
Jay S. committed a potential fix for this yesterday:
two git commits to master:
We also need to make sure any changes in this area of the discovery code does not regress back to reintroduce bug 534186 - for the record, I tested to see if jay's fix broke that again and it does not. The test replication procedures in bug 534186 pass successfully for me. Which makes me happy.
note: I checked out git commit de82e5b9603bf9cd2ec8261b8cb996edbd5a838b (which was just before jay's fixes), I cherry picked the new unit tests but did NOT cherry pick jay's fixes. I deleted InventoryManagerTest from the plugin-container-itests module (because it had merge conflicts - but it didn't have the test I'm interested in anyway, so its no big deal) and I ran DiscoveryTest (which DOES have the unit test that we want to see pass) and it failed (as expected). It shows discovery components getting called multiple times.
So, I see the new test failing prior to jay's fix and passing after jay's fix.
git commit to master: 0f1f93cdd17d6041120907a150079868996651d3
this is the new test that shows this fix working. run this test prior to the fix, and it fails.
reopen under RHQ 4.5.0 (last master build)
please get attached screen-shots for different situations:
1. two AS servers started (domain and standalone) - discovery takes ~6 mins
2. no AS server started - discovery takes ~1sec
3. one AS server started - discovery takes ~1min
As far as I know the amount of time currently taken by discovery is simply the time it takes to run through AS-7 servers, and there is no regression or bug here.
Unless this is an issue generated outside of our internal groups I'd suggest we close this. Asking for triage...
I'd say the assumption that this is the time to go through the as7 tree is right.
We may in the future apply a more clever way of discovery where we do less round tripping, but this would be a plugin-internal thing.
So yes, I think this can be closed.
Closing, no more work planned currently.