Bug 1073691

Summary: Table deadlock issue when processing content report due to missing index and very large transactions in ContentServerServiceImpl.mergeDiscoveredPackages
Product: [JBoss] JBoss Operations Network Reporter: Larry O'Leary <loleary>
Component: Content, PerformanceAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.1.2CC: jshaughn, myarboro, tsegismo
Target Milestone: ER04Keywords: Triaged
Target Release: JON 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Legacy issues with ContentServerServiceImpl.mergeDiscoveredPackages caused Content reports to occasionally fail to complete due to a concurrency/deadlock issue. The fix introduces various enhancements to ensure that the process is thread safe, and fixes overall performance issues.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-11 14:00:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Larry O'Leary 2014-03-06 23:55:36 UTC
Description of problem:
Content reports occasionally fail to be processed due to what appears to be a concurrency/deadlock issue. This is evident in the database by it reporting that 5 database sessions are blocking due to a TM lock and the JBoss ON server lock containing many of the following exceptions:

        WARN  [org.hibernate.util.JDBCExceptionReporter] SQL Error: 60, SQLState: 61000
        ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-00060: deadlock detected while waiting for resource
        ERROR [org.hibernate.event.def.AbstractFlushingEventListener] Could not synchronize database state with session
        org.hibernate.exception.LockAcquisitionException: Could not execute JDBC batch update
            at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:87)
            at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
            ...
            at $Proxy435.mergeDiscoveredPackages(Unknown Source)
            at org.rhq.enterprise.server.content.ContentServerServiceImpl.mergeDiscoveredPackages(ContentServerServiceImpl.java:56)
            ...
        Caused by: java.sql.BatchUpdateException: ORA-00060: deadlock detected while waiting for resource
            at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10345)
            at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:230)
            ...
        WARN  [com.arjuna.ats.arjuna.logging.arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator_2] TwoPhaseCoordinator.beforeCompletion - failed for com.arjuna.ats.internal.jta.resources.arjunacore.SynchronizationImple@3a573ecb
        javax.persistence.PersistenceException: org.hibernate.exception.LockAcquisitionException: Could not execute JDBC batch update
            at org.hibernate.ejb.AbstractEntityManagerImpl.throwPersistenceException(AbstractEntityManagerImpl.java:629)
            at org.hibernate.ejb.AbstractEntityManagerImpl$1.beforeCompletion(AbstractEntityManagerImpl.java:524)
            ...
            at $Proxy435.mergeDiscoveredPackages(Unknown Source)
            at org.rhq.enterprise.server.content.ContentServerServiceImpl.mergeDiscoveredPackages(ContentServerServiceImpl.java:56)
            ...


Version-Release number of selected component (if applicable):
4.4.0.JON312GA

How reproducible:
Occasionally -- between 4AM and 5AM every day or two

Steps to Reproduce:
Uncertain

Additional info:
From the original case that this issue was reported, the problem appears for several minutes between 4AM and 5AM over a 4 day period.

It was also observed that at the time of the deadlocks, the sessions that were in a locked state were also executing the following queries many times:

    delete from RHQ_INSTALLED_PACKAGE where ID=:1
    insert into RHQ_INSTALLED_PACKAGE (RESOURCE_ID, PACKAGE_VERSION_ID, INSTALLATION_TIME, SUBJECT_ID, ID) values (:1 , :2 , :3 , :4 , :5 )

These queries were performing very slow due to a full table scan on RHQ_CREATE_RES_HIST. This table scan is due to the foreign key RHQ_CREATE_RES_HIST.INSTALLED_PACKAGE_ID not having any index defined.

However, it is important to note that even if defining an index on this column, it may not prevent the deadlock. As evident in the Oracle exception, two or more sessions were attempting to lock each others locked tables.

Comment 4 Jay Shaughnessy 2014-09-05 21:43:28 UTC
Setting to ER04 for review of relevant perf/locking.

Comment 5 Jay Shaughnessy 2014-09-17 21:44:26 UTC
This code certainly wasn't scalable due to the umbrella Tx, and I can see how it would be susceptible to locking, due to the same umbrella Tx, and also the order in which it executed various things.

This commit should make things better, maybe significantly. I have yet to actually add any indexing.  I'm not sure it's required but an FK index on rhq_installed_package.resource_id would likely be the first one to add.  As for the suggested FK on rhq_create_res_hist.installed_package_id, I'm not sure that issue was directly influenced by this problem.  But again, FK indexes are certainly useful and I'm not against adding them when the need arises.  But since indexes can themselves cause issues, it's not always a no-brainer.

Committing to master to let it run through automated regression testing and get some soak time.

master commit 9aa0c46edb8a74d7e02557b153b60ae1757e6a32
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Sep 17 17:32:33 2014 -0400

 This is yet another legacy code spot subject to scalability and locking
 issues. I've made a pass at making it more scalable and less likely to lock:
 - No longer process an entire report under an umbrella Tx. Break up the
   work into more, smaller, Txs
 - Reduce # of DB round trips
 - Remove several calls to unindexed queries
   - Note, no new indexes have been added at this point. We are still likely
     doing one full-table scan of rhq_installed_package when getting the
     InstalledPackages for a Resource.  We may want to add an index on that FK.
   - Altered queries to use indexes
   - Removed unused queries from InstalledPackage and PackageVersion
 - Added some inline doco to help future genrations
 - Removed duplicated remote method decls from ContentManagerLocal
 - Fixed an issue in ContentManagerBeanTest and re-enabled a bunch of tests

Comment 6 Jay Shaughnessy 2014-09-18 15:34:08 UTC
master commit 8e86e7b6a5ab36e34e7d47faba892444eca131f8
Author: Jay Shaughnessy <jshaughn>
Date:   Thu Sep 18 11:29:02 2014 -0400

    add a perf-oriented unit test

Comment 7 Jay Shaughnessy 2014-09-18 15:35:55 UTC
Setting to POST as work is currently done.  Waiting for Oracle Jenkins run with the latest tests.

Comment 8 Thomas Segismont 2014-09-23 15:20:06 UTC
Cherry-picked over to release/jon3.3.x

commit fcae8e7e2e200ca7c0acee03396af59e64add6ee
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Sep 17 17:32:33 2014 -0400

    [1073691] Table deadlock issue when processing content report due to missing index and very large transactions in ContentServerServiceImpl.mergeDiscoveredPackage
    This is yet another legacy code spot subject to scalability and locking
    issues. I've made a pass at making it more scalable and less likely to lock:
    - No longer process an entire report under an umbrella Tx. Break up the
      work into more, smaller, Txs
    - Reduce # of DB round trips
    - Remove several calls to unindexed queries
      - Note, no new indexes have been added at this point. We are still likely
        doing one full-table scan of rhq_installed_package when getting the
        InstalledPackages for a Resource.  We may want to add an index on that FK.
      - Altered queries to use indexes
      - Removed unused queries from InstalledPackage and PackageVersion
    - Added some inline doco to help future genrations
    - Removed duplicated remote method decls from ContentManagerLocal
    - Fixed an issue in ContentManagerBeanTest and re-enabled a bunch of tests
    
    (cherry picked from commit 9aa0c46edb8a74d7e02557b153b60ae1757e6a32)
    Signed-off-by: Thomas Segismont <tsegismo>
    
    Conflicts:
        modules/core/domain/intentional-api-changes-since-4.12.0.xml


commit 577b2fc74d554439f21560319ad08f066392de2a
Author: Jay Shaughnessy <jshaughn>
Date:   Thu Sep 18 11:29:02 2014 -0400

    [1073691] Table deadlock issue when processing content report...
    add a perf-oriented unit test
    
    (cherry picked from commit 8e86e7b6a5ab36e34e7d47faba892444eca131f8)
    Signed-off-by: Thomas Segismont <tsegismo>

Comment 9 Simeon Pinder 2014-10-01 21:33:34 UTC
Moving to ON_QA as available for test with build:
https://brewweb.devel.redhat.com/buildinfo?buildID=388959

Comment 11 Jay Shaughnessy 2014-10-29 13:08:43 UTC
Jared, that comment is fine. Thanks.