Bug 536662 (RHQ-98) - can inventory report merging be more tolerant of bogged down server?
Summary: can inventory report merging be more tolerant of bogged down server?
Keywords:
Status: CLOSED NEXTRELEASE
Alias: RHQ-98
Product: RHQ Project
Classification: Other
Component: Inventory
Version: unspecified
Hardware: All
OS: All
high
medium
Target Milestone: ---
: ---
Assignee: John Mazzitelli
QA Contact:
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-03-14 13:52 UTC by John Mazzitelli
Modified: 2008-10-10 20:59 UTC (History)
0 users

Fixed In Version: 1.2
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description John Mazzitelli 2008-03-14 13:52:00 UTC
Running stress testing with 90 agents simultaneously and I found that inventory merging sometimes times out due to the server getting clobbered with messages.  I up'ed the @Timeout of the DiscoveryServerService.mergeInventoryReport to 30 minutes.  However this still wasn't enough.  Some agents still weren't able to get their inventories merged in 30 minutes (due to the need for the server to have inventory report concurrency limit of 5 - which is required because even at that low limit, inventory reports are taking 3-4 minutes to complete in some cases when the reports contain 1,000 resources).

Should we do something in InventoryManager.handleReport() to perform its own retry if it gets a timeout?  Since this is a special case and we should expect and deal with timeouts, perhaps we can do something this (the proposed new code is aligned on the left-most column so you can see what is added to existing code):

            if (configuration.isInsideAgent() && (report.getAddedRoots().size() > 0)) {
                log.info("Sending inventory report to server");
boolean tryToMerge=true;
int retry = 3;
while(tryToMerge){
try{
                InventoryReportResponse response = configuration.getServerServices().getDiscoveryServerService().mergeInventoryReport(report);
tryToMerge=false;
} catch (Exception e) {
if (e is a timeout exception) {
   if (retry-- > 0)
      ...log error and say we will retry again...
   else throw e;
}
}
}

// should we retry syncIds too?
// what happens if we merge but later timeout on this sync? will bad things happen?
                syncIds(report, response);
            }


Here's the exception you get when the merge times out:

00:06:34,301 ERROR [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor) - Error running runtime report
java.lang.reflect.UndeclaredThrowableException
        at $Proxy154.mergeInventoryReport(Unknown Source)
        at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:573)
        at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:106)
        at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:49)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:211)
        at java.util.concurrent.FutureTask.get(FutureTask.java:85)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.run(ClientCommandSenderTask.java:153)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.sendSynch(ClientCommandSender.java:615)
        at org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory$RemotePojoProxyHandler.invoke(ClientRemotePojoFactory.java:392)
        ... 11 more


Comment 1 John Mazzitelli 2008-03-14 13:59:19 UTC
maybe do that retry looping not in InventoryManager but in:

RuntimeDiscoveryExecutor.call()

it calls handleReport() and if we wrap that in the retry logic, we can ensure both the merge and syncIds are successful together.  i.e. it will help ensure syncIds is done if the merge report is done (it won't be 100% but close)

Comment 2 Charles Crouch 2008-03-14 17:47:34 UTC
We need to triage how important  this is:

"Here's the exception you get when the merge times out:

00:06:34,301 ERROR [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor) - Error running runtime report"

So do we recovery after this timeout?

Comment 3 John Mazzitelli 2008-03-14 17:58:04 UTC
we recover when one of these three things happen:

1) you manual ask for a discovery by executing the platform's "manual discovery" operation

2) the agent performs its next service scan auto-discovery - which has a default of 24 hours. agent-configuration.xml can define this:

<entry key="rhq.agent.plugins.service-discovery.period-secs" value="86400"/>

3) restart the agent which runs a new discovery soon after (defined by rhq.agent.plugins.service-discovery.initial-delay-secs)

Comment 4 Joseph Marques 2008-07-08 00:12:38 UTC
pushing to 1.2 - with three available workarounds, it would be difficult to justify trying to squeeze this into 1.1

Comment 5 John Mazzitelli 2008-10-10 20:59:59 UTC
We tested on a cluster of 2 servers and 300 agents during our 1.1 GA testing and I can't say I ever remember seeing this problem.  We had 100K servers and 2K servers - so the inventory was fairly large.

Whatever we did, I think we fixed this.  We reduced the concurrency limits and who knows what else as part of inventory merging.

I'm closing this as "fixed indirectly".

Comment 6 Red Hat Bugzilla 2009-11-10 21:21:07 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-98



Note You need to log in before you can comment on or make changes to this bug.