Description of problem: The purge process can be stuck waiting for a semaphore, if there are problems doing the process, it will effectively "run" (hang) forever. (Unfortunately I lost the stack trace.) 04:00:00,008 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-1) Data Purge Job STARTING 04:00:00,014 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-1) Measurement data compression starting at Sun Aug 03 04:00:00 UTC 2014 04:00:00,014 INFO [org.rhq.server.metrics.aggregation.AggregationManager] (RHQScheduler_Worker-1) Starting aggregation for time slice 2014-08-03T03:00:00.000Z ... then nothing is logged There must be a leaky release(). Version-Release number of selected component (if applicable): 4.12
I have done some initial investigation based on some errors provided by Elias. In that case the problem was due to lack of exception handling in Guava's Futures.transform(ListenableFuture, Function) method. The function call is wrapped in an AsyncFunction which lacks exception handling that we have in Futures.transform(ListenableFuture, AsyncFunction). I made some changes[1] to add the necessary exception handling, but there are probably other areas that need to be addressed as well. I think the best thing to do is set an uncaught exception handler so that we can terminate aggregation when any unexpected errors occur. [1] https://github.com/jsanda/rhq/commit/b2775e5d0621f45df40737b73a9e88ac594fa287