1111065 – Storage node snapshots are not generated when read repair storage cluster job times out

Bug 1111065 - Storage node snapshots are not generated when read repair storage cluster job times out

Summary: Storage node snapshots are not generated when read repair storage cluster job...

Keywords:
Status:	ON_QA
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Storage Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	RHQ 4.13
Assignee:	Jay Shaughnessy
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1111064 1127868
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-19 07:50 UTC by bkramer
Modified:	2022-03-31 04:28 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description bkramer 2014-06-19 07:50:00 UTC

Description of problem:
From the server.log file, the read repair storage cluster job is timing out and the job is cancelled. This will, among other things, cancel creation of storage node snapshots. 

The excerpt from the server.log file:

*************************************
00:30:00,064 INFO  [org.rhq.enterprise.server.scheduler.jobs.StorageClusterReadRepairJob] (RHQScheduler_Worker-2) Preparing to run read repair on storage cluster
00:40:00,068 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc05e in state  RUN
00:40:00,087 WARN  [org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorImpl] (Transaction Reaper Worker 10) Transaction afterCompletion called by a background thread! Delaying action until the original thread can handle it. [status=4]
00:40:00,093 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 10,5,main] successfully canceled TX 0:ffff8290a334:36553337:53918159:3cc05e
00:40:01,191 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc065 in state  RUN
00:40:01,200 WARN  [org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorImpl] (Transaction Reaper Worker 10) Transaction afterCompletion called by a background thread! Delaying action until the original thread can handle it. [status=4]
00:40:01,202 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 10,5,main] successfully canceled TX 0:ffff8290a334:36553337:53918159:3cc065
00:40:01,907 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state  RUN
00:40:01,912 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012095: Abort of action id 0:ffff8290a334:36553337:53918159:3cc06d invoked while multiple threads active within it.
00:40:01,922 INFO  [org.rhq.enterprise.server.common.TransactionInterruptInterceptor] (Transaction Reaper Worker 10) Transaction [0:ffff8290a334:36553337:53918159:3cc06d] is aborting with active thread [RHQScheduler_Worker-2]. interrupting=[false]: java.lang.Throwable: STACK TRACE OF ACTIVE THREAD IN TERMINATING TRANSACTION
...
00:40:02,410 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state  CANCEL
00:40:02,589 WARN  [org.hornetq.core.server] (Thread-0) HQ222103: transaction with xid XidImpl (698756754 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.-126.-112.-93.52.54.85.51.55.83.-111.-127.89.0.60.-64.112.0.0.0.2.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.-126.-112.-93.52.54.85.51.55.83.-111.-127.89.0.60.-64.109.49 base64:AAAAAAAAAAAAAP__gpCjNDZVMzdTkYFZADzAcAAAAAIAAAAAAAAAAAAAAAAAAP__gpCjNDZVMzdTkYFZADzAbTEHAgIA timed out
00:40:02,913 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state  CANCEL_INTERRUPTED
00:40:02,914 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012120: TransactionReaper::check worker Thread[Transaction Reaper Worker 10,5,main] not responding to interrupt when cancelling TX 0:ffff8290a334:36553337:53918159:3cc06d -- worker marked as zombie and TX scheduled for mark-as-rollback
00:40:02,914 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012110: TransactionReaper::check successfuly marked TX 0:ffff8290a334:36553337:53918159:3cc06d as rollback only
01:00:00,074 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job STARTING
*************************************

Version-Release number of selected component (if applicable):
RHQ 4.9

How reproducible:
Sometimes.

Steps to Reproduce:
1. I don't have steps to reproduce this issue - just an excerpt from Sunday's log file as above.


Actual results:
Snapshots were not generated and there is no other indication that would show why this was not done.

Expected results:
Transactions are not timing out and snapshots are generated.

Additional info:

Comment 1 Jay Shaughnessy 2014-08-08 15:48:02 UTC

These are master commits:


commit cec1e931ffdfbf0705b6b17358a25578a1462f7d
Merge: 61a3359 c74195a
Author: jshaughn <jshaughn>
Date:   Thu Jul 17 11:36:48 2014 -0400

    Merge pull request #90 from rhq-project/jshaughn/1111064

    [1111064] Some code changes to avoid a locking issue.

commit c74195a129bd2fb9ec6ef9860b4c9b4c9895ca86
Author: Jay Shaughnessy <jshaughn>
Date:   Thu Jul 17 11:34:16 2014 -0400

    revert testing change to StorageClusterReadRepairJob, put it back to the
    original schedule.

commit d1be24fa474505640c396e0968dabd22dcaf2e62
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Jul 16 17:54:37 2014 -0400

    [1111064] Some code changes to avoid a locking issue.  Presumably the
    issue has to so with passing an attached entity to an asynchronous slsb
    method.  A few things:
    - make sure we pass a deteched entity
    - notify the alert condition-cache prior to calling the storage operation hook
    - only call the storage operation hook for updates to existing history (not
      at operation schedule-time)
    - remove an unnecessary outer Tx on runClusterMaintenance
    - add some logging

    Note: TODO: revert the hyped up scheduling of the cluster readrepair job

Note You need to log in before you can comment on or make changes to this bug.