Description of problem: From the server.log file, the read repair storage cluster job is timing out and the job is cancelled. This will, among other things, cancel creation of storage node snapshots. The excerpt from the server.log file: ************************************* 00:30:00,064 INFO [org.rhq.enterprise.server.scheduler.jobs.StorageClusterReadRepairJob] (RHQScheduler_Worker-2) Preparing to run read repair on storage cluster 00:40:00,068 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc05e in state RUN 00:40:00,087 WARN [org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorImpl] (Transaction Reaper Worker 10) Transaction afterCompletion called by a background thread! Delaying action until the original thread can handle it. [status=4] 00:40:00,093 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 10,5,main] successfully canceled TX 0:ffff8290a334:36553337:53918159:3cc05e 00:40:01,191 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc065 in state RUN 00:40:01,200 WARN [org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorImpl] (Transaction Reaper Worker 10) Transaction afterCompletion called by a background thread! Delaying action until the original thread can handle it. [status=4] 00:40:01,202 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 10,5,main] successfully canceled TX 0:ffff8290a334:36553337:53918159:3cc065 00:40:01,907 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state RUN 00:40:01,912 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012095: Abort of action id 0:ffff8290a334:36553337:53918159:3cc06d invoked while multiple threads active within it. 00:40:01,922 INFO [org.rhq.enterprise.server.common.TransactionInterruptInterceptor] (Transaction Reaper Worker 10) Transaction [0:ffff8290a334:36553337:53918159:3cc06d] is aborting with active thread [RHQScheduler_Worker-2]. interrupting=[false]: java.lang.Throwable: STACK TRACE OF ACTIVE THREAD IN TERMINATING TRANSACTION ... 00:40:02,410 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state CANCEL 00:40:02,589 WARN [org.hornetq.core.server] (Thread-0) HQ222103: transaction with xid XidImpl (698756754 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.-126.-112.-93.52.54.85.51.55.83.-111.-127.89.0.60.-64.112.0.0.0.2.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.-126.-112.-93.52.54.85.51.55.83.-111.-127.89.0.60.-64.109.49 base64:AAAAAAAAAAAAAP__gpCjNDZVMzdTkYFZADzAcAAAAAIAAAAAAAAAAAAAAAAAAP__gpCjNDZVMzdTkYFZADzAbTEHAgIA timed out 00:40:02,913 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state CANCEL_INTERRUPTED 00:40:02,914 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012120: TransactionReaper::check worker Thread[Transaction Reaper Worker 10,5,main] not responding to interrupt when cancelling TX 0:ffff8290a334:36553337:53918159:3cc06d -- worker marked as zombie and TX scheduled for mark-as-rollback 00:40:02,914 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012110: TransactionReaper::check successfuly marked TX 0:ffff8290a334:36553337:53918159:3cc06d as rollback only 01:00:00,074 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job STARTING ************************************* Version-Release number of selected component (if applicable): RHQ 4.9 How reproducible: Sometimes. Steps to Reproduce: 1. I don't have steps to reproduce this issue - just an excerpt from Sunday's log file as above. Actual results: Snapshots were not generated and there is no other indication that would show why this was not done. Expected results: Transactions are not timing out and snapshots are generated. Additional info:
These are master commits: commit cec1e931ffdfbf0705b6b17358a25578a1462f7d Merge: 61a3359 c74195a Author: jshaughn <jshaughn> Date: Thu Jul 17 11:36:48 2014 -0400 Merge pull request #90 from rhq-project/jshaughn/1111064 [1111064] Some code changes to avoid a locking issue. commit c74195a129bd2fb9ec6ef9860b4c9b4c9895ca86 Author: Jay Shaughnessy <jshaughn> Date: Thu Jul 17 11:34:16 2014 -0400 revert testing change to StorageClusterReadRepairJob, put it back to the original schedule. commit d1be24fa474505640c396e0968dabd22dcaf2e62 Author: Jay Shaughnessy <jshaughn> Date: Wed Jul 16 17:54:37 2014 -0400 [1111064] Some code changes to avoid a locking issue. Presumably the issue has to so with passing an attached entity to an asynchronous slsb method. A few things: - make sure we pass a deteched entity - notify the alert condition-cache prior to calling the storage operation hook - only call the storage operation hook for updates to existing history (not at operation schedule-time) - remove an unnecessary outer Tx on runClusterMaintenance - add some logging Note: TODO: revert the hyped up scheduling of the cluster readrepair job