+++ This bug was initially created as a clone of Bug #1111064 +++ Description of problem: From the server.log file, the read repair storage cluster job is timing out and the job is cancelled. This will, among other things, cancel creation of storage node snapshots. The excerpt from the server.log file: ************************************* 00:30:00,064 INFO [org.rhq.enterprise.server.scheduler.jobs.StorageClusterReadRepairJob] (RHQScheduler_Worker-2) Preparing to run read repair on storage cluster 00:40:00,068 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc05e in state RUN 00:40:00,087 WARN [org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorImpl] (Transaction Reaper Worker 10) Transaction afterCompletion called by a background thread! Delaying action until the original thread can handle it. [status=4] 00:40:00,093 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 10,5,main] successfully canceled TX 0:ffff8290a334:36553337:53918159:3cc05e 00:40:01,191 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc065 in state RUN 00:40:01,200 WARN [org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorImpl] (Transaction Reaper Worker 10) Transaction afterCompletion called by a background thread! Delaying action until the original thread can handle it. [status=4] 00:40:01,202 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 10,5,main] successfully canceled TX 0:ffff8290a334:36553337:53918159:3cc065 00:40:01,907 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state RUN 00:40:01,912 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012095: Abort of action id 0:ffff8290a334:36553337:53918159:3cc06d invoked while multiple threads active within it. 00:40:01,922 INFO [org.rhq.enterprise.server.common.TransactionInterruptInterceptor] (Transaction Reaper Worker 10) Transaction [0:ffff8290a334:36553337:53918159:3cc06d] is aborting with active thread [RHQScheduler_Worker-2]. interrupting=[false]: java.lang.Throwable: STACK TRACE OF ACTIVE THREAD IN TERMINATING TRANSACTION ... 00:40:02,410 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state CANCEL 00:40:02,589 WARN [org.hornetq.core.server] (Thread-0) HQ222103: transaction with xid XidImpl (698756754 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.-126.-112.-93.52.54.85.51.55.83.-111.-127.89.0.60.-64.112.0.0.0.2.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.-126.-112.-93.52.54.85.51.55.83.-111.-127.89.0.60.-64.109.49 base64:AAAAAAAAAAAAAP__gpCjNDZVMzdTkYFZADzAcAAAAAIAAAAAAAAAAAAAAAAAAP__gpCjNDZVMzdTkYFZADzAbTEHAgIA timed out 00:40:02,913 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state CANCEL_INTERRUPTED 00:40:02,914 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012120: TransactionReaper::check worker Thread[Transaction Reaper Worker 10,5,main] not responding to interrupt when cancelling TX 0:ffff8290a334:36553337:53918159:3cc06d -- worker marked as zombie and TX scheduled for mark-as-rollback 00:40:02,914 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012110: TransactionReaper::check successfuly marked TX 0:ffff8290a334:36553337:53918159:3cc06d as rollback only 01:00:00,074 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job STARTING ************************************* Version-Release number of selected component (if applicable): JBoss ON 3.2.0, 3.2.1 How reproducible: Sometimes. Steps to Reproduce: 1. I don't have steps to reproduce this issue - just an excerpt from Sunday's log file as above. Actual results: Snapshots were not generated and there is no other indication that would show why this was not done. Expected results: Transactions are not timing out and snapshots are generated. Additional info: --- Additional comment from Jay Shaughnessy on 2014-07-17 11:46:18 EDT --- I've been working with John on this for the last few days and although we haven't been able to reproduce, we think we roughly know why the issue occurs and the following changes should resolve the issue, if our analysis is correct. In short we believe the issue has to do with passing an attached entity to an Asynchronous SLSB method (thus having that entity held in another thread, in a different transaction context) and/or issues with interactions between sending a JMS message and the asynchronous call. Because we couldn't reproduce we're not positive of the fix, but we've done 3 different things, any of which should theoretically resolve the issue. These are master commits, but should backport easily since the affected code hasn't changed recently. Setting to POST state for any comments/review before backport. Assigning to me to reduce jsanda's load... commit cec1e931ffdfbf0705b6b17358a25578a1462f7d Merge: 61a3359 c74195a Author: jshaughn <jshaughn> Date: Thu Jul 17 11:36:48 2014 -0400 Merge pull request #90 from rhq-project/jshaughn/1111064 [1111064] Some code changes to avoid a locking issue. commit c74195a129bd2fb9ec6ef9860b4c9b4c9895ca86 Author: Jay Shaughnessy <jshaughn> Date: Thu Jul 17 11:34:16 2014 -0400 revert testing change to StorageClusterReadRepairJob, put it back to the original schedule. commit d1be24fa474505640c396e0968dabd22dcaf2e62 Author: Jay Shaughnessy <jshaughn> Date: Wed Jul 16 17:54:37 2014 -0400 [1111064] Some code changes to avoid a locking issue. Presumably the issue has to so with passing an attached entity to an asynchronous slsb method. A few things: - make sure we pass a deteched entity - notify the alert condition-cache prior to calling the storage operation hook - only call the storage operation hook for updates to existing history (not at operation schedule-time) - remove an unnecessary outer Tx on runClusterMaintenance - add some logging Note: TODO: revert the hyped up scheduling of the cluster readrepair job
This was in master when the release/jon3.3x branch was created so the commits are already there.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days