Running stress testing with 90 agents simultaneously and I found that inventory merging sometimes times out due to the server getting clobbered with messages. I up'ed the @Timeout of the DiscoveryServerService.mergeInventoryReport to 30 minutes. However this still wasn't enough. Some agents still weren't able to get their inventories merged in 30 minutes (due to the need for the server to have inventory report concurrency limit of 5 - which is required because even at that low limit, inventory reports are taking 3-4 minutes to complete in some cases when the reports contain 1,000 resources). Should we do something in InventoryManager.handleReport() to perform its own retry if it gets a timeout? Since this is a special case and we should expect and deal with timeouts, perhaps we can do something this (the proposed new code is aligned on the left-most column so you can see what is added to existing code): if (configuration.isInsideAgent() && (report.getAddedRoots().size() > 0)) { log.info("Sending inventory report to server"); boolean tryToMerge=true; int retry = 3; while(tryToMerge){ try{ InventoryReportResponse response = configuration.getServerServices().getDiscoveryServerService().mergeInventoryReport(report); tryToMerge=false; } catch (Exception e) { if (e is a timeout exception) { if (retry-- > 0) ...log error and say we will retry again... else throw e; } } } // should we retry syncIds too? // what happens if we merge but later timeout on this sync? will bad things happen? syncIds(report, response); } Here's the exception you get when the merge times out: 00:06:34,301 ERROR [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor) - Error running runtime report java.lang.reflect.UndeclaredThrowableException at $Proxy154.mergeInventoryReport(Unknown Source) at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:573) at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:106) at org.rhq.core.pc.inventory.RuntimeDiscoveryExecutor.call(RuntimeDiscoveryExecutor.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269) at java.util.concurrent.FutureTask.run(FutureTask.java:123) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:211) at java.util.concurrent.FutureTask.get(FutureTask.java:85) at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.run(ClientCommandSenderTask.java:153) at org.rhq.enterprise.communications.command.client.ClientCommandSender.sendSynch(ClientCommandSender.java:615) at org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory$RemotePojoProxyHandler.invoke(ClientRemotePojoFactory.java:392) ... 11 more
maybe do that retry looping not in InventoryManager but in: RuntimeDiscoveryExecutor.call() it calls handleReport() and if we wrap that in the retry logic, we can ensure both the merge and syncIds are successful together. i.e. it will help ensure syncIds is done if the merge report is done (it won't be 100% but close)
We need to triage how important this is: "Here's the exception you get when the merge times out: 00:06:34,301 ERROR [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor) - Error running runtime report" So do we recovery after this timeout?
we recover when one of these three things happen: 1) you manual ask for a discovery by executing the platform's "manual discovery" operation 2) the agent performs its next service scan auto-discovery - which has a default of 24 hours. agent-configuration.xml can define this: <entry key="rhq.agent.plugins.service-discovery.period-secs" value="86400"/> 3) restart the agent which runs a new discovery soon after (defined by rhq.agent.plugins.service-discovery.initial-delay-secs)
pushing to 1.2 - with three available workarounds, it would be difficult to justify trying to squeeze this into 1.1
We tested on a cluster of 2 servers and 300 agents during our 1.1 GA testing and I can't say I ever remember seeing this problem. We had 100K servers and 2K servers - so the inventory was fairly large. Whatever we did, I think we fixed this. We reduced the concurrency limits and who knows what else as part of inventory merging. I'm closing this as "fixed indirectly".
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-98