Bug 1181132 - Data inconsistency for CMR when connection is broken after database resource commits
Summary: Data inconsistency for CMR when connection is broken after database resource ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: JCA
Version: 6.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ER3
: EAP 6.4.0
Assignee: Jesper Pedersen
QA Contact: Ondrej Chaloupka
URL:
Whiteboard:
Depends On:
Blocks: 1184128
TreeView+ depends on / blocked
 
Reported: 2015-01-12 13:21 UTC by Ondrej Chaloupka
Modified: 2019-08-19 12:41 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
: 1516950 (view as bug list)
Environment:
Last Closed: 2019-08-19 12:41:20 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
server.log with the failure (363.24 KB, text/plain)
2015-01-12 13:40 UTC, Ondrej Chaloupka
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker JBJCA-1244 0 Major Closed LocalXAResourceImpl returns RBROLLBACK in certain scenarios which violate the XA specification 2017-12-14 15:13:52 UTC
Red Hat Issue Tracker WFLY-4272 0 Major Closed Data inconsistency for CMR when connection is broken after database resource commits 2017-12-14 15:13:52 UTC

Description Ondrej Chaloupka 2015-01-12 13:21:45 UTC
We've found issue for CMR where data inconsistency occurs. This happens when CMR db resource successfully commits but the connection is broken before TM receives confirmation about such state. As connection got broken XAException.XA_RBROLLBACK is thrown and TM decides to rollback whole XA transaction.
That ends in situation where db resource was committed and other XA participants are rollbacked.

This is test flow:
- prepare DB xa resource
- prepare test xa resource
- commit DB xa resource
- DB commits
- before confirmation is received by TM the connection crashes
- jdbc driver returns XAException.XA_RBROLLBACK as connection is down
- doAbort is called for the transaction
- test xa resource is rollbacked
- doForget is called and no information about the inconsistent transaction remains in object store
- recovery manager does nothing

Comment 3 Ondrej Chaloupka 2015-01-12 13:40:11 UTC
Created attachment 979160 [details]
server.log with the failure

Comment 5 tom.jenkinson 2015-01-12 15:36:04 UTC
Based on the description this sounds like a blocker to me. Proposing as blocker pending a more comprehensive review of the nature of the bug.

Comment 6 tom.jenkinson 2015-01-15 17:30:15 UTC
After discussions with Jesper on the forum and IRC we have identified that the cause of the issue is that the XAResource implementation used by CMR is returning the RBROLLBACK error code for scenarios where it does not know the outcome of the commit, rather than RMFAIL which would be the appropriate code to return in this scenario.

Comment 10 JBoss JIRA Server 2015-01-20 14:58:45 UTC
Jesper Pedersen <jpederse> updated the status of jira WFLY-4272 to Closed

Comment 15 Jesper Pedersen 2015-02-05 13:45:18 UTC
Ondra, assign back to me once https://github.com/jesperpedersen/ironjacamar/tree/jbjca1244_10 is verified

Comment 16 Ondrej Chaloupka 2015-02-08 23:04:35 UTC
I've tried to run the testing over all our cmr test cases and most of databases that we support. And I've hit a problem here.

When the patched ironjacamar is used then the following scenario fails:
1) Enlist Test XA
2) Enlist DB XA
3) Prepare CMR resource
4) Stopping connection to database(at start of prepare method of Test XA resource)
5) Prepare Test XA
6) Commit CMR resource
   - connection is down -> XAException.XAER_RMFAIL
7) Commit Test XA
8) Starting connection to database
9) Recovery occurs

Currently the CMR seems not to be committed but rollbacked [1].
That bears an inconsistency as Test XA resource is committed because after RMFAIL transaction continues but DB CMR resource is rollbacked (left in state as it was before start of transaction).

What could be done for this?
Thanks


[1] What I can see in log about CMR resource
17:09:58,861 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) RecoverAtomicAction.replayPhase2 recovering 0:ffff0a000006:-166944c2:54d3958c:33 ActionStatus is ActionStatus.COMMITTED
17:09:58,861 TRACE [com.arjuna.ats.arjuna] (Periodic Recovery) BasicAction::phase2Commit() for action-id 0:ffff0a000006:-166944c2:54d3958c:33
17:09:58,861 TRACE [com.arjuna.ats.arjuna] (Periodic Recovery) BasicAction::doCommit (com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord@2dac7d)
17:09:58,862 TRACE [com.arjuna.ats.arjuna] (Periodic Recovery) CommitMarkableResourceRecord.topLevelCommit for com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord@2dac7d, record id=-80000000:-80000000:-80000000:-80000000:-80000000
17:09:58,862 TRACE [com.arjuna.ats.arjuna] (Periodic Recovery) BasicAction.doCommit for 0:ffff0a000006:-166944c2:54d3958c:33 received TwoPhaseOutcome.HEURISTIC_ROLLBACK from class com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord

Comment 17 tom.jenkinson 2015-02-09 08:47:29 UTC
Hi Ondra,

This is working as expected. You could get the same situation with XA. Our transaction manager operates in an optimistic manner.

1. Prepare XAR1
2. Prepare XAR2
3. Commit XAR1 -> RMFAIL
4. TM will still commit XAR2 in order to hopefully achieve an atomic outcome
5. Recovery manager tries to recover XAR1 -> it returns a heuristic rollback.

The reason an XAR can send a heuristic rollback in commit are numerous. Here is a link to one example:
http://www-01.ibm.com/support/knowledgecenter/SSGU8G_12.1.0/com.ibm.admin.doc/ids_admin_1069.htm

I would say that the if the patch is returning RMFAIL for all the DBs then the system is now operating as designed.

You should be able to find a transaction in a heuristic state after this situation. I have a unit test to show that:
https://github.com/jbosstm/narayana/blob/master/ArjunaJTA/jta/tests/classes/com/hp/mwtests/ts/jta/commitmarkable/TestCommitMarkableResourceReturnUnknownOutcomeFrom1PCCommit.java#L275

Basically you need to scan() a couple of times (with the DB back online) so the AA is properly marked as a heuristic. Then the only option with a CMR heuristic is to delete the participant (but at least the admin knows the situation happened - unlike LRCO) and re-run recovery to clean it up.

Hope that clarifies why you get the outcome you have discovered. CMR is not full XA so we can't commit the data after a failure - hence the (valid) heuristic outcome.

Tom

Comment 18 Ondrej Chaloupka 2015-02-09 10:54:41 UTC
Hi Tom,

thanks, I've got it. I didn't realize that TM does not do "rollback" but finishes with "heuristic". Ok, what I can see it works fine.

Ondra

Comment 19 Ondrej Chaloupka 2015-02-09 10:56:10 UTC
Hi Jesper,

from what I can see there is no issues with the fix.

I'm assigning this back to you for the new release could be part of the next ER.

Thanks
Ondra

Comment 20 JBoss JIRA Server 2015-02-13 18:55:38 UTC
Jesper Pedersen <jpederse> updated the status of jira JBJCA-1244 to Resolved

Comment 21 JBoss JIRA Server 2015-02-16 14:07:17 UTC
Jesper Pedersen <jpederse> updated the status of jira JBJCA-1244 to Closed

Comment 22 Jesper Pedersen 2015-02-16 17:41:20 UTC
Doc change for Microsoft SQLServer:

https://access.redhat.com/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html/Administration_and_Configuration_Guide/Example_Microsoft_SQLServer_Datasource1.html

https://access.redhat.com/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html/Administration_and_Configuration_Guide/Example_Microsoft_SQLServer_XA_Datasource.html

needs

<exception-sorter class-name="org.jboss.jca.adapters.jdbc.extensions.mssql.MSSQLExceptionSorter"></exception-sorter>

inside the <validation> element.

See other datasource samples for an example.

Comment 23 Ondrej Chaloupka 2015-03-03 12:02:04 UTC
6.4.0.ER3 verified

Comment 24 Radek Rodak 2017-11-06 19:31:55 UTC
Hi Tom

We are hitting exactly this case with CMR on Oracle Datasource, where we are getting RMFAIL when Oracle Database is shutdown with "shutdown immediate".

So Transaction State is at the end CMR DB Record Rollbacked on Oracle but Tx committed on IBM MQ JMS Queue...

Which end up with HEURISTIC Transactions in Phase2commit() on EAP 6.4.17 which needs manual recovery.

So basically it works, like provided Patch.

You have written:

"CMR is not full XA so we can't commit the data after a failure - hence the (valid) heuristic outcome"

My Question is, if we would switch from CMR to XA on Oracle Datasource, if in this Case 
Periodic Recovery will be able to recover such TX in Phase2commit().. ?

P.S:

I found your commit, where RMFAIL is thrown but only on //Oracle: 08000 (No more data to read from socket)  :-)

https://github.com/ironjacamar/ironjacamar/pull/278/commits/ff62b8b23f59f9fbb9c15be40fef38efb872c436

Comment 25 tom.jenkinson 2017-11-06 20:14:25 UTC
Hi Radek,

For this case yes you should get an atomic outcome (assuming the database was not in the process of returning a heuristic rollback of its own during commit of course!).

If you have more error codes we need to check for please do raise another issue and we will get them added to IJ!

Thanks for the report,
Tom

Comment 26 Radek Rodak 2017-11-07 21:25:30 UTC
Hi Tom

For now I don't have more codes.

Thanks for the response,
Radek

Comment 27 Radek Rodak 2017-11-21 12:32:03 UTC
Hi Tom

Switching to full XA Datasource helped for this Case.
Outcome is atomic inclusive Recovery.

Thanks for confirmation

Comment 28 Radek Rodak 2017-11-21 12:35:33 UTC
Hi Tom

You asked for more codes with CMR.

When I have put oracle.jdbc.ReadTimeout Property on CMR Datasource, I saw this Exceptions, when ReadTimeout has closed JDBC Connection:

What do you think, is this expected behaviour:

<datasource jta="true" jndi-name="java:/jdbc/DS" pool-name="DS" enabled="true" use-java-context="true" connectable="true" statistics-enabled="true">
                    <connection-url>jdbc:oracle:thin:@Test14:1521:TEST1</connection-url>
                    <datasource-class>oracle.jdbc.OracleDriver</datasource-class>
                    <connection-property name="oracle.jdbc.ReadTimeout">
                        5000
                    </connection-property>
                    <connection-property name="oracle.net.CONNECT_TIMEOUT">
                        5000
                    </connection-property>
                    <driver>OracleJDBCDriver</driver>
                    <pool>
                        <min-pool-size>2</min-pool-size>
                        <max-pool-size>10</max-pool-size>
                        <prefill>true</prefill>
                        <use-strict-min>true</use-strict-min>
                        <flush-strategy>FailingConnectionOnly</flush-strategy>
                    </pool>
                    <security>
                        <security-domain>EncryptedPasswordDS</security-domain>
                    </security>
                    <validation>
                        <valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleValidConnectionChecker"/>
                        <validate-on-match>false</validate-on-match>
                        <background-validation>true</background-validation>
                        <background-validation-millis>900000</background-validation-millis>
                        <use-fast-fail>false</use-fast-fail>
                        <stale-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleStaleConnectionChecker"/>
                        <exception-sorter class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleExceptionSorter"/>
                    </validation>
                    <timeout>
                        <query-timeout>180</query-timeout>
                    </timeout>
                    <statement>
                        <prepared-statement-cache-size>10</prepared-statement-cache-size>
                        <share-prepared-statements>true</share-prepared-statements>
                    </statement>
                </datasource>
	
By forcing ReadTimeout after 5s I saw this Exceptions : 	
		    

2017-11-20 21:16:35,047 WARN  [org.jboss.jca.core.connectionmanager.listener.TxConnectionListener] (Thread 448_639930413) IJ000305: Connection error occured: org.jboss.jca.core.connectionmanager.listener.TxConnectionListener@324201c8[state=NORMAL managed connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@37ea6d52 connection handles=1 lastUse=1511208182337 trackByTx=true pool=org.jboss.jca.core.connectionmanager.pool.strategy.PoolBySubject@5282118e pool internal context=SemaphoreArrayListManagedConnectionPool@7b06cd7[pool=DS] xaResource=LocalXAResourceImpl@8271568[connectionListener=324201c8 connectionManager=387ff920 warned=false currentXid=< formatId=131077, gtrid_length=36, bqual_length=36, tx_uid=0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2d, node_name=nodeApp1, branch_uid=0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2f, subordinatenodename=null, eis_name=java:/jdbc/DS > productName=Oracle productVersion=Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options jndiName=java:/jdbc/DS] txSync=null]: java.sql.SQLRecoverableException: IO Error: Socket read timed out
        at oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedStatement.java:967)
        at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1150)
        at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4798)
        at oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:4875)
        at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeUpdate(OraclePreparedStatementWrapper.java:1361)
        at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeUpdate(CachedPreparedStatement.java:114)
        at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeUpdate(WrappedPreparedStatement.java:493)
	
2017-11-20 21:16:35,059 WARN  [com.arjuna.ats.jta] (Thread 448_639930413) ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=36, bqual_length=36, tx_uid=0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2d, node_name=nodeApp1, branch_uid=0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2f, subordinatenodename=null, eis_name=java:/jdbc/DS > (LocalXAResourceImpl@8271568[connectionListener=324201c8 connectionManager=387ff920 warned=false currentXid=null productName=Oracle productVersion=Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options jndiName=java:/jdbc/DS]) failed with exception code XAException.XAER_RMFAIL: org.jboss.jca.core.spi.transaction.local.LocalXAException: IJ001160: Could not rollback local transaction
        at org.jboss.jca.core.tx.jbossts.LocalConnectableXAResourceImpl.rollback(LocalConnectableXAResourceImpl.java:111)
        at com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord.topLevelAbort(CommitMarkableResourceRecord.java:453)
        at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2901)
        at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2880)
        at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645)
        at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118)
        at com.arjuna.ats.arjuna.AtomicAction.abort(AtomicAction.java:186)
        at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.rollbackAndDisassociate(TransactionImple.java:1313)
        at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.rollback(BaseTransaction.java:143)
        at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.rollback(BaseTransactionManagerDelegate.java:114)
        at org.jboss.tm.usertx.client.ServerVMClientUserTransaction.rollback(ServerVMClientUserTransaction.java:202)
	
2017-11-20 21:16:35,067 WARN  [com.arjuna.ats.arjuna] (Thread 448_639930413) ARJUNA012091: Top-level abort of action 0:ffff0aa207ec:-237e2b6:5a12e7e9:cfce2d received TwoPhaseOutcome.FINISH_ERROR from com.arjuna.ats.internal.jta.resources.arjunacore.CommitMarkableResourceRecord

Comment 29 tom.jenkinson 2017-11-23 15:28:33 UTC
Hi Radek,

That does look like an issue because the DB may have rolled back or not and IronJacamar could not infer the outcome. 

Please can you assign to JCA and report the issue against all affected products.

Thanks,
Tom

Comment 30 tom.jenkinson 2017-11-23 15:31:44 UTC
However, I don't think your simulated error would be in the right place actually. Because you need that timeout to happen when the LocalConnectableXAResourceImpl is issuing the commit call - which I don't think it is in your java.sql.SQLRecoverableException: IO Error: Socket read timed out

Comment 31 Radek Rodak 2017-11-23 15:57:48 UTC
I agree with you. In this case it will fail before 2-end Phase. But I think Periodic Recovery might have problem with recovering, what is left in object_store...

Comment 32 tom.jenkinson 2017-11-23 23:15:12 UTC
Please can you raise a new BZ with JCA component to get the additional error code handled by IronJacamar?

Comment 33 Radek Rodak 2017-11-23 23:18:43 UTC
Hi Tom

sure, here new BZ for JCA:
https://bugzilla.redhat.com/show_bug.cgi?id=1516950


Note You need to log in before you can comment on or make changes to this bug.