Bug 820709

Summary: [plugin-container] with two or more AS7 host controllers in inventory, a full discovery scan takes an unacceptably long time
Product: [Other] RHQ Project Reporter: Libor Zoubek <lzoubek>
Component: Plugin ContainerAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.4CC: ahovsepy, ccrouch, hrupp, jshaughn, mazz, theute
Target Milestone: ---   
Target Release: RHQ 4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 823564 (view as bug list) Environment:
Last Closed: 2014-02-18 21:36:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 707223, 823564    

Description Libor Zoubek 2012-05-10 18:07:04 UTC
Description of problem: I am facing performance issue. Default timeout for Manual Autodiscovery is 10minutes. When I have 2 EAPs on same host (and one of them runs in Domain mode) detailed autodiscovery mostly fails because of timeout. If it does not fail it finishes close to 10 minutes. 


Version-Release number of selected component (if applicable):
JON 3.1.CR1
EAP6 ER6

How reproducible:almost always


Steps to Reproduce:
1. stand up at least 2 EAP6 instances on same host (more is better) I am working with 1 standalone and 1 domain instance
2. stand up agent, import whole platform with all children
3. wait 'till everyting is imported (this can take quite a long time 5-10minutes)
4. run manual autodiscovery on platform with 2 EAPs
  
Actual results: autodiscovery takes almost 10 minutes, sometimes it fails because of default timeout. This is too much. I was observing machine when discovery was running and there are no symptoms of overloading, average CPU usage was at 4%.


Expected results: Autodiscovery must be much faster. I suspect as7plugin. Without imported AS7 servers discovery runs pretty fast. Solution would be to implement discovery of each subsystem to run in parallel - but this might be very complicated and might introduce new issues.


Additional info: I was tcpdumping requests sent by agent to EAP6 and really, plugin was requesting both EAPs for all 10 minutes. But it is hard to determine whether requests came from discovery scan or usual availability. When observing requests when discovery is not running, plugin is sending them all the time anyway. This has become an issue when more AS7 subsystems have been implemented to as7plugin.

Comment 1 Ian Springer 2012-05-15 14:41:30 UTC
I was able to easily reproduce this. For me, "discovery -f" took almost 7 minutes to complete, which is obviously way too long. 

I sent a QUIT signal to the Agent JVM to trigger a thread dump, and I noticed that in the discovery scan thread, a call to StreamUtil.copy() via ASConnection.executeRaw() was blocking on a call to inputStream.read(). I waited a couple minutes then triggered another thread dump and saw the discovery thread was blocked at the same spot. We set the read timeout on the HttpURLConnection to 10 seconds, so each call to ASConnection.executeRaw() that ends up hanging while reading the response, will take 10 seconds to timeout (assuming the read timeout is working). If there are 100's of such calls, that would explain why the discovery scan is taking so long to complete.

Comment 2 Ian Springer 2012-05-15 16:00:09 UTC
It turns out HttpURLConnection.getInputStream().read() does not always return -1 when the entire response body has been read. Therefore, it's necessary for us to detect when the number of bytes we've read is equal to the Content-Length provided by the AS7 server, and to close the input stream and return the response at that point.

http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=95ff81d implements this.

I can tell it works, because "discovery -f" now completes in 9 seconds, rather than 7 minutes.

Comment 3 Ian Springer 2012-05-15 22:40:08 UTC
An additional commit that fixes a regression caused by 95ff81d:

http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=96533ac

Comment 4 Ian Springer 2012-05-16 17:10:39 UTC
It turns out the read() calls blocking was not the issue. AS7 actually doesn't even set the Content-Length header on management responses, so the Content-Length checking code I added is essentially of no use.

Considering how many resource types the AS7 plugin defines, AS7 service/runtime discovery is expected to take a while. However, it is taking about twice as long as it should due to a regression introduced by the fix for https://bugzilla.redhat.com/show_bug.cgi?id=534186, which results in discoverResources() being called twice per child ResourceType during a given runtime scan. The bug is in RuntimeDiscoveryExecutor and appears to be some redundant recursion on child resources.

I've added an integration test that checks the bug:

http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=de82e5b

The test currently fails but should pass once this is fixed.

I should also see "discovery -f" take about 3.5 minutes, rather than 7 minutes, with two AS7 domain controllers running and inventoried.

Comment 5 John Mazzitelli 2012-05-18 13:57:27 UTC
Jay S. committed a potential fix for this yesterday:

two git commits to master: 

c08b19ffc73d646766263a673c6e96f0f1024b03
25da84131727cc76c3ebae200f1db4b9e550c3af

We also need to make sure any changes in this area of the discovery code does not regress back to reintroduce bug 534186 - for the record, I tested to see if jay's fix broke that again and it does not. The test replication procedures in bug 534186 pass successfully for me. Which makes me happy.

Comment 6 John Mazzitelli 2012-05-18 21:13:36 UTC
note: I checked out git commit de82e5b9603bf9cd2ec8261b8cb996edbd5a838b (which was just before jay's fixes), I cherry picked the new unit tests but did NOT cherry pick jay's fixes. I deleted InventoryManagerTest from the plugin-container-itests module (because it had merge conflicts - but it didn't have the test I'm interested in anyway, so its no big deal) and I ran DiscoveryTest (which DOES have the unit test that we want to see pass) and it failed (as expected). It shows discovery components getting called multiple times.

So, I see the new test failing prior to jay's fix and passing after jay's fix.

Comment 7 John Mazzitelli 2012-05-18 21:14:24 UTC
git commit to master: 0f1f93cdd17d6041120907a150079868996651d3

this is the new test that shows this fix working. run this test prior to the fix, and it fails.

Comment 8 Armine Hovsepyan 2012-09-05 16:48:34 UTC
reopen under RHQ 4.5.0 (last master build)


please get attached screen-shots for different situations:
1. two AS servers started (domain and standalone) - discovery takes ~6 mins
2. no AS server started - discovery takes ~1sec
3. one AS server started - discovery takes ~1min

Comment 9 Jay Shaughnessy 2014-02-06 14:39:05 UTC
As far as I know the amount of time currently taken by discovery is simply the time it takes to run through AS-7 servers, and there is no regression or bug here.

Unless this is an issue generated outside of our internal groups I'd suggest we close this.  Asking for triage...

Comment 10 Heiko W. Rupp 2014-02-18 19:51:58 UTC
I'd say the assumption that this is the time to go through the as7 tree is right.
We may in the future apply a more clever way of discovery where we do less round tripping, but this would be a plugin-internal thing.

So yes, I think this can be closed.

Comment 11 Jay Shaughnessy 2014-02-18 21:36:59 UTC
Closing, no more work planned currently.