Created attachment 857860 [details] log files that show full exceptions Description of problem: Bundle deployment in JBoss ON 3.2.0 when using Oracle as a backend will fail with ARJUNA016061 exception and after this whole JBoss ON installation will be in erroneous state and the only way to recover is to re-restore the database and re-install the JBoss ON. Version-Release number of selected component (if applicable): JBoss ON 3.2.0 Oracle as a database How reproducible: Always Steps to Reproduce: 1. fresh installation of JBoss ON 3.2.0 to use Oracle database; 2. confirm that platform is imported (it should be, by default); 3. create a new resource group that will contain platform resource; 4. navigate to bundles; 5. create bundle group - myGroup; 6. create bundle - testBundle; 7. assign testBundle to the bundle group - myGroup; 8. select bundle (testBundle) and press "Deploy" button; 9. in the Bundle Deployment window enter: Destination Name: testDestination; Destination Description; Resource Group: myGroup; Base Location: Root File System; Deployment Directory: /tmp/test 10. press Next button; 11. press Next button to confirm the bundle version; 12. press Next button in "Deployment Configuration"; 13. leave "Clean Deployment?" unselected (in Provide Deployment Information) and press Next button; 14. press Finish button to finish deployment Actual results: This deployment fails and message "Failed to load bundle deployment" is shown at the top of the screen (in red). The server.log file shows: WARN [com.arjuna.ats.jta] (http-/0.0.0.0:7080-1) ARJUNA016061: TransactionImple.enlistResource - XAResource.start returned: XAException.XAER_PROTO for < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a213fee:5aadaf24:52eba757:3fe, node_name=1, branch_uid=0:ffff0a213fee:5aadaf24:52eba757:402, subordinatenodename=null, eis_name=java:jboss/datasources/RHQDS >: oracle.jdbc.xa.OracleXAException ( I will attach the full log file) and from this moment every other action in JBoss ON UI will fail. The shutdown of JBoss ON Server will fail as well and the only way to stop it is to kill the process. Expected results: Bundle deployment works properly. Additional info: This does not happen when Postgres is used as a backend.
> The shutdown of JBoss ON Server will fail as well and the only way to stop it is to kill the process. It appears that this is what is causing the issue to persist. During testing I have found that if you wait for shutdown to complete on its own (around 10 minutes) the server will come back up as normal. By default the shutdown command will only wait for 5 minutes before reporting that the server has not yet shutdown. In my case just issued the command again and waiting. And finally a third time if the second time didn't do the job. After that, the server comes up just fine. I also found that if I do a kill -9 the server will not start correctly or completely. I can then do a rhqctl stop as above and things will go back to normal. So: rhqctl stop ... timeout rhqctl stop ... timeout rhqctl stop RHQ Server has stopped. ... Or: rhqctl stop ... timeout kill -9 <PID> rhqctl start rhqctl stop ... timeout rhqctl stop ... timeout rhqctl stop RHQ Server has stopped. ... What appears to happen is that after the transaction manager is shutdown, the bad transaction eventually timesout and the transaction reaper will cancel it.
As for the restart, I think you could also clean out the tx data before bringing up the server. Delete: ..\jbossas\standalone\data\tx-object-store and then restart.
(In reply to Jay Shaughnessy from comment #3) > As for the restart, I think you could also clean out the tx data before > bringing up the server. Delete: > > ..\jbossas\standalone\data\tx-object-store > > and then restart. I will try this again but when this was done by the user it left the database in an inconsistent/corrupted state. Database locks were left behind preventing anything schedule based from running.
master commit a77c743c29de219ea5ad5e25afce49f355935345 Author: Jay Shaughnessy <jshaughn> Date: Thu Feb 6 16:23:11 2014 -0500 Oracle does not like XA connections getting used both inside and outside a JTA transaction. To get around the problem you can create separate sub-pools for the different contexts using: <no-tx-separate-pools>true</no-tx-separate-pools> In the datasource definition. This was not getting set properly due to a subtle DMR issue in the installer.
Jay, can you get this into release/jon3.2.x. Thanks.
(In reply to Jay Shaughnessy from comment #5) > master commit a77c743c29de219ea5ad5e25afce49f355935345 > Author: Jay Shaughnessy <jshaughn> > Date: Thu Feb 6 16:23:11 2014 -0500 > > Oracle does not like XA connections getting used both inside and outside a > JTA transaction. To get around the problem you can create separate sub-pools > for the different contexts using: > > <no-tx-separate-pools>true</no-tx-separate-pools> > > In the datasource definition. This was not getting set properly due to a > subtle DMR issue in the installer. peer reviewed, cherry picked (commit=baf8f4c) to release/jon3.2.x branch.
After further testing it appears that bundle deployment still breaks the database when using Oracle. It isn't clear what is going wrong here but specifically, even after applying the fix: <no-tx-separate-pools>true</no-tx-separate-pools> Bundle deployment and viewing destinations seems to work normally. However, if you attempt to execute a resource operation (such as the platform's discovery operation or view process list operation) the operation will never actually get scheduled and will eventually result in the following error: ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-4) JBAS014134: EJB Invocation failed on component OperationManagerBean for method public abstract int org.rhq.enterprise.server.operation.OperationManagerLocal.scheduleResourceOperation(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.operation.bean.ResourceOperationSchedule): javax.ejb.EJBException: org.rhq.enterprise.server.exception.ScheduleException: org.quartz.impl.jdbcjobstore.LockException: Failure obtaining db row lock: ORA-02049: timeout: distributed transaction waiting for lock [See nested exception: java.sql.SQLSyntaxErrorException: ORA-02049: timeout: distributed transaction waiting for lock ] at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInOurTx(CMTTxInterceptor.java:165) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10] at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:250) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10] at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:315) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10] at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:214) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10] ... at org.rhq.enterprise.server.operation.OperationManagerLocal$$$view148.scheduleResourceOperation(Unknown Source) [rhq-server.jar:4.9.0.JON320GA] at org.rhq.coregui.server.gwt.OperationGWTServiceImpl.scheduleResourceOperation(OperationGWTServiceImpl.java:125) ... Caused by: org.rhq.enterprise.server.exception.ScheduleException: org.quartz.impl.jdbcjobstore.LockException: Failure obtaining db row lock: ORA-02049: timeout: distributed transaction waiting for lock [See nested exception: java.sql.SQLSyntaxErrorException: ORA-02049: timeout: distributed transaction waiting for lock ] at org.rhq.enterprise.server.operation.OperationManagerBean.scheduleResourceOperation(OperationManagerBean.java:201) [rhq-server.jar:4.9.0.JON320GA] ... Caused by: org.quartz.impl.jdbcjobstore.LockException: Failure obtaining db row lock: ORA-02049: timeout: distributed transaction waiting for lock [See nested exception: java.sql.SQLSyntaxErrorException: ORA-02049: timeout: distributed transaction waiting for lock ] at org.quartz.impl.jdbcjobstore.StdRowLockSemaphore.executeSQL(StdRowLockSemaphore.java:112) [quartz-1.6.5.jar:1.6.5] at org.quartz.impl.jdbcjobstore.DBSemaphore.obtainLock(DBSemaphore.java:112) [quartz-1.6.5.jar:1.6.5] at org.quartz.impl.jdbcjobstore.JobStoreCMT.executeInLock(JobStoreCMT.java:237) [quartz-1.6.5.jar:1.6.5] at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInLock(JobStoreSupport.java:3684) [quartz-1.6.5.jar:1.6.5] at org.quartz.impl.jdbcjobstore.JobStoreSupport.storeJobAndTrigger(JobStoreSupport.java:1035) [quartz-1.6.5.jar:1.6.5] at org.quartz.core.QuartzScheduler.scheduleJob(QuartzScheduler.java:732) [quartz-1.6.5.jar:1.6.5] at org.quartz.impl.StdScheduler.scheduleJob(StdScheduler.java:265) [quartz-1.6.5.jar:1.6.5] at org.rhq.enterprise.server.scheduler.SchedulerService.scheduleJob(SchedulerService.java:220) [rhq-server.jar:4.9.0.JON320GA] ... at org.rhq.enterprise.server.scheduler.SchedulerBean.scheduleJob(SchedulerBean.java:206) [rhq-server.jar:4.9.0.JON320GA] ... at org.rhq.enterprise.server.scheduler.SchedulerLocal$$$view10.scheduleJob(Unknown Source) [rhq-server.jar:4.9.0.JON320GA] at org.rhq.enterprise.server.operation.OperationManagerBean.scheduleResourceOperation(OperationManagerBean.java:281) [rhq-server.jar:4.9.0.JON320GA] at org.rhq.enterprise.server.operation.OperationManagerBean.scheduleResourceOperation(OperationManagerBean.java:196) [rhq-server.jar:4.9.0.JON320GA] ... 136 more Caused by: java.sql.SQLSyntaxErrorException: ORA-02049: timeout: distributed transaction waiting for lock at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:445) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:879) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:450) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:884) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1167) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1289) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3584) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3628) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1493) [ojdbc6-11.2.0.3.0.jar:11.2.0.3.0] at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeQuery(CachedPreparedStatement.java:107) at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeQuery(WrappedPreparedStatement.java:462) at org.quartz.impl.jdbcjobstore.StdRowLockSemaphore.executeSQL(StdRowLockSemaphore.java:92) [quartz-1.6.5.jar:1.6.5] ... 233 more
I'm back to looking at it... I see it locally.
I reproduced this issue even with DISTRIBUTED_LOCK_TIMEOUT increased to 180s (default is 60s)
I bet this has to do with a recent change made where the scheduling of operations had its transactioning changed in order to see the operation appear in the UI faster. Heiko knows the details. I would bet money if you revert that change, this bug goes away.
I don't think that the operations change is the primary issue, although it could contribute. I've found what seems to be the problem area of code although I'm not 100% sure why it's a problem. I got things working at the expense of a previous fix. Now looking at getting it working while maintaining the previous fix...
master commit 4fec3d53f49f4fc6e941983484ae80c4a9e1a271 Author: Jay Shaughnessy <jshaughn> Date: Fri Feb 14 11:38:47 2014 -0500 This is a completely different problem from the "no-tx-separate-pool" issue already fixed for this BZ. The addition of that DS attribute led to this issue involving the XA DS and our [quartz] job scheduling and triggering. The issue had to do with interactions between our scheduling of resource bundle deployments, our scheduling of the umbrella bundle deployment status checking job (which came into existence as a fix for Bug 1003679), and possible fast reporting of a bundle resource deployment status from an agent. The solution is to ensure that the bundle deployment status checking job is scheduled after the bundle deployment scheduling has been fully committed and the individual bundle resource deployments have been scheduled. Furthermore: - don't use a repeating quartz trigger, as this seems to exacerbate any problem that may occur. - when scheduling an operation minimize any access time on the quartz tables.
Bundle automation passed on master for both postgresql and oracle.
release/jon3.2.x commit 54c0774176fa8465586b3bd6105ada4f816423dc Author: Jay Shaughnessy <jshaughn> Date: Mon Feb 17 10:51:52 2014 -0500 Cherry-Pick of Master 4fec3d53f49f4fc6e941983484ae80c4a9e1a271 Signed-off-by: John Mazzitelli <mazz> (review) Signed-off-by: Jay Shaughnessy <jshaughn> (cherry-pick)
Updating the title of this BZ as this issue is not Oracle specific. It only was reported when using Oracle but after further testing we can see that this issue occurs on PostgreSQL as reported in bug 1038597. In summary, even after adjusting the XA datasource configuration, bundle deployment will result in Quartz jobs continuously failing due to a deadlock issue that can be introduced during bundle deployment. This deadlock will then become evident in any operation or process that scheduled jobs to be executed using Quartz.
*** Bug 1038597 has been marked as a duplicate of this bug. ***
Moving to ON_QA as available for testing in the following brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=340294 Note: the installed version is still JON 3.2.0.GA by design and this represents part of the payload for JON 3.2.1 also known as cumulative patch 1 for 3.2.0.GA. How this will be delivered to customers is still being discussed.
Verified on: Version : 3.2.0.GA Build Number : d18651a:f535707
JON 3.2.1 released week of 5/5/2014