Bug 1126208

Summary: Purge job can hang indefinitely
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: Core Server, Storage NodeAssignee: Nobody <nobody>
Status: NEW --- QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.12CC: hrupp
Target Milestone: ---   
Target Release: RHQ 4.13   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1133605    

Description Elias Ross 2014-08-03 15:42:42 UTC
Description of problem:

The purge process can be stuck waiting for a semaphore, if there are problems doing the process, it will effectively "run" (hang) forever.

(Unfortunately I lost the stack trace.)

04:00:00,008 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-1) Data Purge Job STARTING
04:00:00,014 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-1) Measurement data compression starting at Sun Aug 03 04:00:00 UTC 2014
04:00:00,014 INFO  [org.rhq.server.metrics.aggregation.AggregationManager] (RHQScheduler_Worker-1) Starting aggregation for time slice 2014-08-03T03:00:00.000Z
... then nothing is logged

There must be a leaky release().

Version-Release number of selected component (if applicable): 4.12

Comment 1 John Sanda 2014-08-06 01:31:17 UTC
I have done some initial investigation based on some errors provided by Elias. In that case the problem was due to lack of exception handling in Guava's Futures.transform(ListenableFuture, Function) method. The function call is wrapped in an AsyncFunction which lacks exception handling that we have in Futures.transform(ListenableFuture, AsyncFunction). I made some changes[1] to add the necessary exception handling, but there are probably other areas that need to be addressed as well. I think the best thing to do is set an uncaught exception handler so that we can terminate aggregation when any unexpected errors occur.

[1] https://github.com/jsanda/rhq/commit/b2775e5d0621f45df40737b73a9e88ac594fa287