Bug 1127868 - Storage node snapshots are not generated when read repair storage cluster job times out [NEEDINFO]
Summary: Storage node snapshots are not generated when read repair storage cluster job...
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Storage Node
Version: JON 3.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ER01
: JON 3.3.0
Assignee: John Sanda
QA Contact: Mike Foley
URL:
Whiteboard:
Keywords: Triaged
Depends On: 1111064
Blocks: 1111065
TreeView+ depends on / blocked
 
Reported: 2014-08-07 17:53 UTC by John Mazzitelli
Modified: 2018-12-06 17:40 UTC (History)
6 users (show)

(edit)
An issue with the read repair storage cluster job caused storage node snapshots to time out and the job associated with the job to be cancelled. The fix introduces improvements to code that now prevents a locking issue, which corrects the issue.
Clone Of: 1111064
(edit)
Last Closed: 2014-12-11 14:03:37 UTC
jmorgan: needinfo? (jsanda)


Attachments (Terms of Use)

Description John Mazzitelli 2014-08-07 17:53:49 UTC
+++ This bug was initially created as a clone of Bug #1111064 +++

Description of problem:
From the server.log file, the read repair storage cluster job is timing out and the job is cancelled. This will, among other things, cancel creation of storage node snapshots. 

The excerpt from the server.log file:

*************************************
00:30:00,064 INFO  [org.rhq.enterprise.server.scheduler.jobs.StorageClusterReadRepairJob] (RHQScheduler_Worker-2) Preparing to run read repair on storage cluster
00:40:00,068 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc05e in state  RUN
00:40:00,087 WARN  [org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorImpl] (Transaction Reaper Worker 10) Transaction afterCompletion called by a background thread! Delaying action until the original thread can handle it. [status=4]
00:40:00,093 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 10,5,main] successfully canceled TX 0:ffff8290a334:36553337:53918159:3cc05e
00:40:01,191 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc065 in state  RUN
00:40:01,200 WARN  [org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorImpl] (Transaction Reaper Worker 10) Transaction afterCompletion called by a background thread! Delaying action until the original thread can handle it. [status=4]
00:40:01,202 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 10,5,main] successfully canceled TX 0:ffff8290a334:36553337:53918159:3cc065
00:40:01,907 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state  RUN
00:40:01,912 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 10) ARJUNA012095: Abort of action id 0:ffff8290a334:36553337:53918159:3cc06d invoked while multiple threads active within it.
00:40:01,922 INFO  [org.rhq.enterprise.server.common.TransactionInterruptInterceptor] (Transaction Reaper Worker 10) Transaction [0:ffff8290a334:36553337:53918159:3cc06d] is aborting with active thread [RHQScheduler_Worker-2]. interrupting=[false]: java.lang.Throwable: STACK TRACE OF ACTIVE THREAD IN TERMINATING TRANSACTION
...
00:40:02,410 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state  CANCEL
00:40:02,589 WARN  [org.hornetq.core.server] (Thread-0) HQ222103: transaction with xid XidImpl (698756754 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.-126.-112.-93.52.54.85.51.55.83.-111.-127.89.0.60.-64.112.0.0.0.2.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.-126.-112.-93.52.54.85.51.55.83.-111.-127.89.0.60.-64.109.49 base64:AAAAAAAAAAAAAP__gpCjNDZVMzdTkYFZADzAcAAAAAIAAAAAAAAAAAAAAAAAAP__gpCjNDZVMzdTkYFZADzAbTEHAgIA timed out
00:40:02,913 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff8290a334:36553337:53918159:3cc06d in state  CANCEL_INTERRUPTED
00:40:02,914 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012120: TransactionReaper::check worker Thread[Transaction Reaper Worker 10,5,main] not responding to interrupt when cancelling TX 0:ffff8290a334:36553337:53918159:3cc06d -- worker marked as zombie and TX scheduled for mark-as-rollback
00:40:02,914 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012110: TransactionReaper::check successfuly marked TX 0:ffff8290a334:36553337:53918159:3cc06d as rollback only
01:00:00,074 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-3) Data Purge Job STARTING
*************************************

Version-Release number of selected component (if applicable):
JBoss ON 3.2.0, 3.2.1

How reproducible:
Sometimes.

Steps to Reproduce:
1. I don't have steps to reproduce this issue - just an excerpt from Sunday's log file as above.


Actual results:
Snapshots were not generated and there is no other indication that would show why this was not done.

Expected results:
Transactions are not timing out and snapshots are generated.

Additional info:

--- Additional comment from Jay Shaughnessy on 2014-07-17 11:46:18 EDT ---


I've been working with John on this for the last few days and although we haven't been able to reproduce, we think we roughly know why the issue occurs and the following changes should resolve the issue, if our analysis is correct.  In short we believe the issue has to do with passing an attached entity to an Asynchronous SLSB method (thus having that entity held in another thread, in a different transaction context) and/or issues with interactions between sending a JMS message and the asynchronous call.

Because we couldn't reproduce we're not positive of the fix, but we've done 3 different things, any of which should theoretically resolve the issue.

These are master commits, but should backport easily since the affected code hasn't changed recently.

Setting to POST state for any comments/review before backport.  Assigning to me to reduce jsanda's load...


commit cec1e931ffdfbf0705b6b17358a25578a1462f7d
Merge: 61a3359 c74195a
Author: jshaughn <jshaughn@redhat.com>
Date:   Thu Jul 17 11:36:48 2014 -0400

    Merge pull request #90 from rhq-project/jshaughn/1111064

    [1111064] Some code changes to avoid a locking issue.

commit c74195a129bd2fb9ec6ef9860b4c9b4c9895ca86
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Thu Jul 17 11:34:16 2014 -0400

    revert testing change to StorageClusterReadRepairJob, put it back to the
    original schedule.

commit d1be24fa474505640c396e0968dabd22dcaf2e62
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Wed Jul 16 17:54:37 2014 -0400

    [1111064] Some code changes to avoid a locking issue.  Presumably the
    issue has to so with passing an attached entity to an asynchronous slsb
    method.  A few things:
    - make sure we pass a deteched entity
    - notify the alert condition-cache prior to calling the storage operation hook
    - only call the storage operation hook for updates to existing history (not
      at operation schedule-time)
    - remove an unnecessary outer Tx on runClusterMaintenance
    - add some logging

    Note: TODO: revert the hyped up scheduling of the cluster readrepair job

Comment 1 Jay Shaughnessy 2014-08-08 18:29:49 UTC
This was in master when the release/jon3.3x branch was created so the commits are already there.


Note You need to log in before you can comment on or make changes to this bug.