Description of problem: Content reports occasionally fail to be processed due to what appears to be a concurrency/deadlock issue. This is evident in the database by it reporting that 5 database sessions are blocking due to a TM lock and the JBoss ON server lock containing many of the following exceptions: WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 60, SQLState: 61000 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-00060: deadlock detected while waiting for resource ERROR [org.hibernate.event.def.AbstractFlushingEventListener] Could not synchronize database state with session org.hibernate.exception.LockAcquisitionException: Could not execute JDBC batch update at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:87) at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43) ... at $Proxy435.mergeDiscoveredPackages(Unknown Source) at org.rhq.enterprise.server.content.ContentServerServiceImpl.mergeDiscoveredPackages(ContentServerServiceImpl.java:56) ... Caused by: java.sql.BatchUpdateException: ORA-00060: deadlock detected while waiting for resource at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10345) at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:230) ... WARN [com.arjuna.ats.arjuna.logging.arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator_2] TwoPhaseCoordinator.beforeCompletion - failed for com.arjuna.ats.internal.jta.resources.arjunacore.SynchronizationImple@3a573ecb javax.persistence.PersistenceException: org.hibernate.exception.LockAcquisitionException: Could not execute JDBC batch update at org.hibernate.ejb.AbstractEntityManagerImpl.throwPersistenceException(AbstractEntityManagerImpl.java:629) at org.hibernate.ejb.AbstractEntityManagerImpl$1.beforeCompletion(AbstractEntityManagerImpl.java:524) ... at $Proxy435.mergeDiscoveredPackages(Unknown Source) at org.rhq.enterprise.server.content.ContentServerServiceImpl.mergeDiscoveredPackages(ContentServerServiceImpl.java:56) ... Version-Release number of selected component (if applicable): 4.4.0.JON312GA How reproducible: Occasionally -- between 4AM and 5AM every day or two Steps to Reproduce: Uncertain Additional info: From the original case that this issue was reported, the problem appears for several minutes between 4AM and 5AM over a 4 day period. It was also observed that at the time of the deadlocks, the sessions that were in a locked state were also executing the following queries many times: delete from RHQ_INSTALLED_PACKAGE where ID=:1 insert into RHQ_INSTALLED_PACKAGE (RESOURCE_ID, PACKAGE_VERSION_ID, INSTALLATION_TIME, SUBJECT_ID, ID) values (:1 , :2 , :3 , :4 , :5 ) These queries were performing very slow due to a full table scan on RHQ_CREATE_RES_HIST. This table scan is due to the foreign key RHQ_CREATE_RES_HIST.INSTALLED_PACKAGE_ID not having any index defined. However, it is important to note that even if defining an index on this column, it may not prevent the deadlock. As evident in the Oracle exception, two or more sessions were attempting to lock each others locked tables.
Setting to ER04 for review of relevant perf/locking.
This code certainly wasn't scalable due to the umbrella Tx, and I can see how it would be susceptible to locking, due to the same umbrella Tx, and also the order in which it executed various things. This commit should make things better, maybe significantly. I have yet to actually add any indexing. I'm not sure it's required but an FK index on rhq_installed_package.resource_id would likely be the first one to add. As for the suggested FK on rhq_create_res_hist.installed_package_id, I'm not sure that issue was directly influenced by this problem. But again, FK indexes are certainly useful and I'm not against adding them when the need arises. But since indexes can themselves cause issues, it's not always a no-brainer. Committing to master to let it run through automated regression testing and get some soak time. master commit 9aa0c46edb8a74d7e02557b153b60ae1757e6a32 Author: Jay Shaughnessy <jshaughn> Date: Wed Sep 17 17:32:33 2014 -0400 This is yet another legacy code spot subject to scalability and locking issues. I've made a pass at making it more scalable and less likely to lock: - No longer process an entire report under an umbrella Tx. Break up the work into more, smaller, Txs - Reduce # of DB round trips - Remove several calls to unindexed queries - Note, no new indexes have been added at this point. We are still likely doing one full-table scan of rhq_installed_package when getting the InstalledPackages for a Resource. We may want to add an index on that FK. - Altered queries to use indexes - Removed unused queries from InstalledPackage and PackageVersion - Added some inline doco to help future genrations - Removed duplicated remote method decls from ContentManagerLocal - Fixed an issue in ContentManagerBeanTest and re-enabled a bunch of tests
master commit 8e86e7b6a5ab36e34e7d47faba892444eca131f8 Author: Jay Shaughnessy <jshaughn> Date: Thu Sep 18 11:29:02 2014 -0400 add a perf-oriented unit test
Setting to POST as work is currently done. Waiting for Oracle Jenkins run with the latest tests.
Cherry-picked over to release/jon3.3.x commit fcae8e7e2e200ca7c0acee03396af59e64add6ee Author: Jay Shaughnessy <jshaughn> Date: Wed Sep 17 17:32:33 2014 -0400 [1073691] Table deadlock issue when processing content report due to missing index and very large transactions in ContentServerServiceImpl.mergeDiscoveredPackage This is yet another legacy code spot subject to scalability and locking issues. I've made a pass at making it more scalable and less likely to lock: - No longer process an entire report under an umbrella Tx. Break up the work into more, smaller, Txs - Reduce # of DB round trips - Remove several calls to unindexed queries - Note, no new indexes have been added at this point. We are still likely doing one full-table scan of rhq_installed_package when getting the InstalledPackages for a Resource. We may want to add an index on that FK. - Altered queries to use indexes - Removed unused queries from InstalledPackage and PackageVersion - Added some inline doco to help future genrations - Removed duplicated remote method decls from ContentManagerLocal - Fixed an issue in ContentManagerBeanTest and re-enabled a bunch of tests (cherry picked from commit 9aa0c46edb8a74d7e02557b153b60ae1757e6a32) Signed-off-by: Thomas Segismont <tsegismo> Conflicts: modules/core/domain/intentional-api-changes-since-4.12.0.xml commit 577b2fc74d554439f21560319ad08f066392de2a Author: Jay Shaughnessy <jshaughn> Date: Thu Sep 18 11:29:02 2014 -0400 [1073691] Table deadlock issue when processing content report... add a perf-oriented unit test (cherry picked from commit 8e86e7b6a5ab36e34e7d47faba892444eca131f8) Signed-off-by: Thomas Segismont <tsegismo>
Moving to ON_QA as available for test with build: https://brewweb.devel.redhat.com/buildinfo?buildID=388959
Jared, that comment is fine. Thanks.