We've found issue for CMR where data inconsistency occurs. This happens when CMR db resource successfully commits but the connection is broken before TM receives confirmation about such state. As connection got broken XAException.XA_RBROLLBACK is thrown and TM decides to rollback whole XA transaction. That ends in situation where db resource was committed and other XA participants are rollbacked. This is test flow: - prepare DB xa resource - prepare test xa resource - commit DB xa resource - DB commits - before confirmation is received by TM the connection crashes - jdbc driver returns XAException.XA_RBROLLBACK as connection is down - doAbort is called for the transaction - test xa resource is rollbacked - doForget is called and no information about the inconsistent transaction remains in object store - recovery manager does nothing
Created attachment 979160 [details] server.log with the failure
Based on the description this sounds like a blocker to me. Proposing as blocker pending a more comprehensive review of the nature of the bug.
After discussions with Jesper on the forum and IRC we have identified that the cause of the issue is that the XAResource implementation used by CMR is returning the RBROLLBACK error code for scenarios where it does not know the outcome of the commit, rather than RMFAIL which would be the appropriate code to return in this scenario.
Jesper Pedersen <jpederse> updated the status of jira WFLY-4272 to Closed
Ondra, assign back to me once https://github.com/jesperpedersen/ironjacamar/tree/jbjca1244_10 is verified
I've tried to run the testing over all our cmr test cases and most of databases that we support. And I've hit a problem here. When the patched ironjacamar is used then the following scenario fails: 1) Enlist Test XA 2) Enlist DB XA 3) Prepare CMR resource 4) Stopping connection to database(at start of prepare method of Test XA resource) 5) Prepare Test XA 6) Commit CMR resource - connection is down -> XAException.XAER_RMFAIL 7) Commit Test XA 8) Starting connection to database 9) Recovery occurs Currently the CMR seems not to be committed but rollbacked [1]. That bears an inconsistency as Test XA resource is committed because after RMFAIL transaction continues but DB CMR resource is rollbacked (left in state as it was before start of transaction). What could be done for this? Thanks [1] What I can see in log about CMR resource 17:09:58,861 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) RecoverAtomicAction.replayPhase2 recovering 0:ffff0a000006:-166944c2:54d3958c:33 ActionStatus is ActionStatus.COMMITTED 17:09:58,861 TRACE [com.arjuna.ats.arjuna] (Periodic Recovery) BasicAction::phase2Commit() for action-id 0:ffff0a000006:-166944c2:54d3958c:33 17:09:58,861 TRACE [com.arjuna.ats.arjuna] (Periodic Recovery) BasicAction::doCommit (com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord@2dac7d) 17:09:58,862 TRACE [com.arjuna.ats.arjuna] (Periodic Recovery) CommitMarkableResourceRecord.topLevelCommit for com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord@2dac7d, record id=-80000000:-80000000:-80000000:-80000000:-80000000 17:09:58,862 TRACE [com.arjuna.ats.arjuna] (Periodic Recovery) BasicAction.doCommit for 0:ffff0a000006:-166944c2:54d3958c:33 received TwoPhaseOutcome.HEURISTIC_ROLLBACK from class com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord
Hi Ondra, This is working as expected. You could get the same situation with XA. Our transaction manager operates in an optimistic manner. 1. Prepare XAR1 2. Prepare XAR2 3. Commit XAR1 -> RMFAIL 4. TM will still commit XAR2 in order to hopefully achieve an atomic outcome 5. Recovery manager tries to recover XAR1 -> it returns a heuristic rollback. The reason an XAR can send a heuristic rollback in commit are numerous. Here is a link to one example: http://www-01.ibm.com/support/knowledgecenter/SSGU8G_12.1.0/com.ibm.admin.doc/ids_admin_1069.htm I would say that the if the patch is returning RMFAIL for all the DBs then the system is now operating as designed. You should be able to find a transaction in a heuristic state after this situation. I have a unit test to show that: https://github.com/jbosstm/narayana/blob/master/ArjunaJTA/jta/tests/classes/com/hp/mwtests/ts/jta/commitmarkable/TestCommitMarkableResourceReturnUnknownOutcomeFrom1PCCommit.java#L275 Basically you need to scan() a couple of times (with the DB back online) so the AA is properly marked as a heuristic. Then the only option with a CMR heuristic is to delete the participant (but at least the admin knows the situation happened - unlike LRCO) and re-run recovery to clean it up. Hope that clarifies why you get the outcome you have discovered. CMR is not full XA so we can't commit the data after a failure - hence the (valid) heuristic outcome. Tom
Hi Tom, thanks, I've got it. I didn't realize that TM does not do "rollback" but finishes with "heuristic". Ok, what I can see it works fine. Ondra
Hi Jesper, from what I can see there is no issues with the fix. I'm assigning this back to you for the new release could be part of the next ER. Thanks Ondra
Jesper Pedersen <jpederse> updated the status of jira JBJCA-1244 to Resolved
Jesper Pedersen <jpederse> updated the status of jira JBJCA-1244 to Closed
Doc change for Microsoft SQLServer: https://access.redhat.com/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html/Administration_and_Configuration_Guide/Example_Microsoft_SQLServer_Datasource1.html https://access.redhat.com/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html/Administration_and_Configuration_Guide/Example_Microsoft_SQLServer_XA_Datasource.html needs <exception-sorter class-name="org.jboss.jca.adapters.jdbc.extensions.mssql.MSSQLExceptionSorter"></exception-sorter> inside the <validation> element. See other datasource samples for an example.
6.4.0.ER3 verified
Hi Tom We are hitting exactly this case with CMR on Oracle Datasource, where we are getting RMFAIL when Oracle Database is shutdown with "shutdown immediate". So Transaction State is at the end CMR DB Record Rollbacked on Oracle but Tx committed on IBM MQ JMS Queue... Which end up with HEURISTIC Transactions in Phase2commit() on EAP 6.4.17 which needs manual recovery. So basically it works, like provided Patch. You have written: "CMR is not full XA so we can't commit the data after a failure - hence the (valid) heuristic outcome" My Question is, if we would switch from CMR to XA on Oracle Datasource, if in this Case Periodic Recovery will be able to recover such TX in Phase2commit().. ? P.S: I found your commit, where RMFAIL is thrown but only on //Oracle: 08000 (No more data to read from socket) :-) https://github.com/ironjacamar/ironjacamar/pull/278/commits/ff62b8b23f59f9fbb9c15be40fef38efb872c436
Hi Radek, For this case yes you should get an atomic outcome (assuming the database was not in the process of returning a heuristic rollback of its own during commit of course!). If you have more error codes we need to check for please do raise another issue and we will get them added to IJ! Thanks for the report, Tom
Hi Tom For now I don't have more codes. Thanks for the response, Radek
Hi Tom Switching to full XA Datasource helped for this Case. Outcome is atomic inclusive Recovery. Thanks for confirmation
Hi Tom You asked for more codes with CMR. When I have put oracle.jdbc.ReadTimeout Property on CMR Datasource, I saw this Exceptions, when ReadTimeout has closed JDBC Connection: What do you think, is this expected behaviour: <datasource jta="true" jndi-name="java:/jdbc/DS" pool-name="DS" enabled="true" use-java-context="true" connectable="true" statistics-enabled="true"> <connection-url>jdbc:oracle:thin:@Test14:1521:TEST1</connection-url> <datasource-class>oracle.jdbc.OracleDriver</datasource-class> <connection-property name="oracle.jdbc.ReadTimeout"> 5000 </connection-property> <connection-property name="oracle.net.CONNECT_TIMEOUT"> 5000 </connection-property> <driver>OracleJDBCDriver</driver> <pool> <min-pool-size>2</min-pool-size> <max-pool-size>10</max-pool-size> <prefill>true</prefill> <use-strict-min>true</use-strict-min> <flush-strategy>FailingConnectionOnly</flush-strategy> </pool> <security> <security-domain>EncryptedPasswordDS</security-domain> </security> <validation> <valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleValidConnectionChecker"/> <validate-on-match>false</validate-on-match> <background-validation>true</background-validation> <background-validation-millis>900000</background-validation-millis> <use-fast-fail>false</use-fast-fail> <stale-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleStaleConnectionChecker"/> <exception-sorter class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleExceptionSorter"/> </validation> <timeout> <query-timeout>180</query-timeout> </timeout> <statement> <prepared-statement-cache-size>10</prepared-statement-cache-size> <share-prepared-statements>true</share-prepared-statements> </statement> </datasource> By forcing ReadTimeout after 5s I saw this Exceptions : 2017-11-20 21:16:35,047 WARN [org.jboss.jca.core.connectionmanager.listener.TxConnectionListener] (Thread 448_639930413) IJ000305: Connection error occured: org.jboss.jca.core.connectionmanager.listener.TxConnectionListener@324201c8[state=NORMAL managed connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@37ea6d52 connection handles=1 lastUse=1511208182337 trackByTx=true pool=org.jboss.jca.core.connectionmanager.pool.strategy.PoolBySubject@5282118e pool internal context=SemaphoreArrayListManagedConnectionPool@7b06cd7[pool=DS] xaResource=LocalXAResourceImpl@8271568[connectionListener=324201c8 connectionManager=387ff920 warned=false currentXid=< formatId=131077, gtrid_length=36, bqual_length=36, tx_uid=0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2d, node_name=nodeApp1, branch_uid=0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2f, subordinatenodename=null, eis_name=java:/jdbc/DS > productName=Oracle productVersion=Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options jndiName=java:/jdbc/DS] txSync=null]: java.sql.SQLRecoverableException: IO Error: Socket read timed out at oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedStatement.java:967) at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1150) at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4798) at oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:4875) at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeUpdate(OraclePreparedStatementWrapper.java:1361) at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeUpdate(CachedPreparedStatement.java:114) at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeUpdate(WrappedPreparedStatement.java:493) 2017-11-20 21:16:35,059 WARN [com.arjuna.ats.jta] (Thread 448_639930413) ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=36, bqual_length=36, tx_uid=0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2d, node_name=nodeApp1, branch_uid=0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2f, subordinatenodename=null, eis_name=java:/jdbc/DS > (LocalXAResourceImpl@8271568[connectionListener=324201c8 connectionManager=387ff920 warned=false currentXid=null productName=Oracle productVersion=Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options jndiName=java:/jdbc/DS]) failed with exception code XAException.XAER_RMFAIL: org.jboss.jca.core.spi.transaction.local.LocalXAException: IJ001160: Could not rollback local transaction at org.jboss.jca.core.tx.jbossts.LocalConnectableXAResourceImpl.rollback(LocalConnectableXAResourceImpl.java:111) at com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord.topLevelAbort(CommitMarkableResourceRecord.java:453) at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2901) at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2880) at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645) at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118) at com.arjuna.ats.arjuna.AtomicAction.abort(AtomicAction.java:186) at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.rollbackAndDisassociate(TransactionImple.java:1313) at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.rollback(BaseTransaction.java:143) at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.rollback(BaseTransactionManagerDelegate.java:114) at org.jboss.tm.usertx.client.ServerVMClientUserTransaction.rollback(ServerVMClientUserTransaction.java:202) 2017-11-20 21:16:35,067 WARN [com.arjuna.ats.arjuna] (Thread 448_639930413) ARJUNA012091: Top-level abort of action 0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2d received TwoPhaseOutcome.FINISH_ERROR from com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord
Hi Radek, That does look like an issue because the DB may have rolled back or not and IronJacamar could not infer the outcome. Please can you assign to JCA and report the issue against all affected products. Thanks, Tom
However, I don't think your simulated error would be in the right place actually. Because you need that timeout to happen when the LocalConnectableXAResourceImpl is issuing the commit call - which I don't think it is in your java.sql.SQLRecoverableException: IO Error: Socket read timed out
I agree with you. In this case it will fail before 2-end Phase. But I think Periodic Recovery might have problem with recovering, what is left in object_store...
Please can you raise a new BZ with JCA component to get the additional error code handled by IronJacamar?
Hi Tom sure, here new BZ for JCA: https://bugzilla.redhat.com/show_bug.cgi?id=1516950