Bug 536662 (RHQ-98)

Summary: can inventory report merging be more tolerant of bogged down server?
Product: [Other] RHQ Project Reporter: John Mazzitelli <mazz>
Component: InventoryAssignee: John Mazzitelli <mazz>
Severity: medium Docs Contact:
Priority: high    
Version: unspecifiedKeywords: Improvement
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-98
Fixed In Version: 1.2 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description John Mazzitelli 2008-03-14 09:52:00 EDT
Running stress testing with 90 agents simultaneously and I found that inventory merging sometimes times out due to the server getting clobbered with messages.  I up'ed the @Timeout of the DiscoveryServerService.mergeInventoryReport to 30 minutes.  However this still wasn't enough.  Some agents still weren't able to get their inventories merged in 30 minutes (due to the need for the server to have inventory report concurrency limit of 5 - which is required because even at that low limit, inventory reports are taking 3-4 minutes to complete in some cases when the reports contain 1,000 resources).

Should we do something in InventoryManager.handleReport() to perform its own retry if it gets a timeout?  Since this is a special case and we should expect and deal with timeouts, perhaps we can do something this (the proposed new code is aligned on the left-most column so you can see what is added to existing code):

            if (configuration.isInsideAgent() && (report.getAddedRoots().size() > 0)) {
                log.info("Sending inventory report to server");
boolean tryToMerge=true;
int retry = 3;
                InventoryReportResponse response = configuration.getServerServices().getDiscoveryServerService().mergeInventoryReport(report);
} catch (Exception e) {
if (e is a timeout exception) {
   if (retry-- > 0)
      ...log error and say we will retry again...
   else throw e;

// should we retry syncIds too?
// what happens if we merge but later timeout on this sync? will bad things happen?
                syncIds(report, response);

Here's the exception you get when the merge times out:

00:06:34,301 ERROR [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor) - Error running runtime report
        at $Proxy154.mergeInventoryReport(Unknown Source)
        at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:573)
        at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:106)
        at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:49)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:211)
        at java.util.concurrent.FutureTask.get(FutureTask.java:85)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.run(ClientCommandSenderTask.java:153)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.sendSynch(ClientCommandSender.java:615)
        at org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory$RemotePojoProxyHandler.invoke(ClientRemotePojoFactory.java:392)
        ... 11 more
Comment 1 John Mazzitelli 2008-03-14 09:59:19 EDT
maybe do that retry looping not in InventoryManager but in:


it calls handleReport() and if we wrap that in the retry logic, we can ensure both the merge and syncIds are successful together.  i.e. it will help ensure syncIds is done if the merge report is done (it won't be 100% but close)
Comment 2 Charles Crouch 2008-03-14 13:47:34 EDT
We need to triage how important  this is:

"Here's the exception you get when the merge times out:

00:06:34,301 ERROR [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor) - Error running runtime report"

So do we recovery after this timeout?
Comment 3 John Mazzitelli 2008-03-14 13:58:04 EDT
we recover when one of these three things happen:

1) you manual ask for a discovery by executing the platform's "manual discovery" operation

2) the agent performs its next service scan auto-discovery - which has a default of 24 hours. agent-configuration.xml can define this:

<entry key="rhq.agent.plugins.service-discovery.period-secs" value="86400"/>

3) restart the agent which runs a new discovery soon after (defined by rhq.agent.plugins.service-discovery.initial-delay-secs)
Comment 4 Joseph Marques 2008-07-07 20:12:38 EDT
pushing to 1.2 - with three available workarounds, it would be difficult to justify trying to squeeze this into 1.1
Comment 5 John Mazzitelli 2008-10-10 16:59:59 EDT
We tested on a cluster of 2 servers and 300 agents during our 1.1 GA testing and I can't say I ever remember seeing this problem.  We had 100K servers and 2K servers - so the inventory was fairly large.

Whatever we did, I think we fixed this.  We reduced the concurrency limits and who knows what else as part of inventory merging.

I'm closing this as "fixed indirectly".
Comment 6 Red Hat Bugzilla 2009-11-10 16:21:07 EST
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-98