534381 – (RHQ-1183) configure transaction manager recovery

Bug 534381 (RHQ-1183) - configure transaction manager recovery

Summary: configure transaction manager recovery

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	RHQ-1183
Product:	RHQ Project
Classification:	Other
Component:	Core Server
Sub Component:
Version:	1.1
Hardware:	All
OS:	All
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	John Mazzitelli
QA Contact:
Docs Contact:
URL:	http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:	RHQ-1184 RHQ-938
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-25 20:29 UTC by John Mazzitelli
Modified:	2009-11-10 21:22 UTC (History)
CC List:	0 users
Fixed In Version:	1.2
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)
rhq-1183.patch (73.23 KB, text/x-patch) 2008-11-29 05:18 UTC, John Mazzitelli	no flags	Details
jms-1pc-trace.log (23.95 KB, text/x-log) 2008-12-23 06:38 UTC, John Mazzitelli	no flags	Details
jms-producer-1pc.log (36.04 KB, text/x-log) 2008-12-23 07:37 UTC, John Mazzitelli	no flags	Details
jms-producer-2pc.log (44.96 KB, text/x-log) 2008-12-23 07:38 UTC, John Mazzitelli	no flags	Details
View All

Description John Mazzitelli 2008-11-25 20:29:00 UTC

We need to do this:

http://www.jboss.com/index.html?module=bb&op=viewtopic&t=146138

to avoid the ugly XAResource recovery problems.

Comment 1 John Mazzitelli 2008-11-25 23:15:31 UTC

I'm almost there, but I can't seem to get some properties set.

2008-11-25 17:45:27,593 ERROR [STDERR] java.lang.NullPointerException
2008-11-25 17:45:27,593 ERROR [STDERR] 	at javax.naming.InitialContext.getURLScheme(InitialContext.j
ava:269)
2008-11-25 17:45:27,593 ERROR [STDERR] 	at javax.naming.InitialContext.getURLOrDefaultInitCtx(Initia
lContext.java:318)
2008-11-25 17:45:27,593 ERROR [STDERR] 	at javax.naming.InitialContext.lookup(InitialContext.java:39
2)
2008-11-25 17:45:27,593 ERROR [STDERR] 	at com.arjuna.ats.internal.jdbc.recovery.JDBCXARecovery.crea
teDataSource(JDBCXARecovery.java:174)
2008-11-25 17:45:27,593 ERROR [STDERR] 	at com.arjuna.ats.internal.jdbc.recovery.JDBCXARecovery.hasM
oreResources(JDBCXARecovery.java:141)
2008-11-25 17:45:27,593 ERROR [STDERR] 	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecover
yModule.resourceInitiatedRecovery(XARecoveryModule.java:679)
2008-11-25 17:45:27,593 ERROR [STDERR] 	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecover
yModule.periodicWorkSecondPass(XARecoveryModule.java:179)
2008-11-25 17:45:27,593 ERROR [STDERR] 	at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.
doWork(PeriodicRecovery.java:237)
2008-11-25 17:45:27,593 ERROR [STDERR] 	at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.
run(PeriodicRecovery.java:163)

But i have the JNDI name set in the jbossjta-properties.xml set:

    <properties depends="arjuna" name="jta">
...
        <!-- RHQ - add the ability to recover our transactions -->
        <property name="com.arjuna.ats.jta.recovery.XAResourceRecoveryJDBC" value="com.arjuna.ats.internal.jdbc.recovery.JDBCXARecovery"/>
        <!-- <property name="DatabaseURL" value="${rhq.server.database.connection-url}"/> -->
        <property name="DatabaseJNDIName" value="java:/RHQDS"/>
        <property name="UserName" value="${rhq.server.database.user-name}"/>
        <property name="Password" value="${rhq.server.database.password}"/>

I can't figure out why the TM code can't see these properties. I stepped through the code and for sure its getting null for the value of the property.

Comment 2 John Mazzitelli 2008-11-26 08:25:45 UTC

There is a bug in JBossTM that causes the NPE mentioned in the previous comment.

Read: http://www.jboss.com/index.html?module=bb&op=viewtopic&t=146138

In addition, the following two things have to be done:

1) our java:/RHQDS is currently a non-XA local-tx datasource.  We need to configure it as a <xa-datasource> in rhq-ds.xml

2) this doesn't work as expected inside jbossjta-properties.xml:

        <property name="UserName" value="${rhq.server.database.user-name}"/>
        <property name="Password" value="${rhq.server.database.password}"/>

The ${} aren't replaced, they are literall the values of UserName and Password.  We are going to need to write our own extension to com.arjuna.ats.internal.jdbc.recovery.JDBCXARecovery that takes the properties and replaces them.  This is going to suck because we can't put that class in our ear - it has to go into server/lib so it can be found at server startup. We would have to configure it like this:

        <!-- RHQ - add the ability to recover our transactions -->
        <property name="com.arjuna.ats.jta.recovery.XAResourceRecoveryJDBC" value="org.rhq.JDBCXARecovery;jbossjta-properties.xml"/>   
        <!-- <property name="DatabaseURL" value="${rhq.server.database.connection-url}"/> -->
        <property name="DatabaseJNDIName" value="java:/RHQDS"/>
        <property name="UserName" value="${rhq.server.database.user-name}"/>
        <property name="Password" value="${rhq.server.database.password}"/>

Another alternative would be to have the installer replace these ${} at deploy time, effectively hardcoded the values. Of course, if the user ever changes the DB user/pass, we no longer have all our config in a single file (rhq-server.properties) - we are once again spreading our configuration into the JBossAS internal deployment files (and this is not what we want to start - ALL configuration must be adjusted from within rhq-server.properties. We could add this to rhq-server.properties:

UserName=the.user.name
Password=the.password

right below the rhq.server.database.X settings - we'd be duplicating the configuration, but at least it would all be in the same file (and right next to each other).

Comment 3 John Mazzitelli 2008-11-26 08:27:35 UTC

The best thing would be to write our own JDBCXARecovery class - maybe we can put the source in enterprise/server/container - any classes in there would be bundled up in a single .jar and the container build can deploy that .jar in the server/lib directory.

Comment 4 John Mazzitelli 2008-11-26 22:47:25 UTC

see attached patch for what we can use to make sure XA recoverability works.

This is the XARecovery implementation which is copied from the Arjuna example, except we allow for ${} variables in the prop values and we allow for the case when the data source isn't deployed yet (like when a newly installed server is started but the installer hasn't been  told to deploy the ear).

Last thing on the plate is to get our datasource to be a XA datasource, as opposed to a <local-tx-datasource>.

Comment 5 John Mazzitelli 2008-11-26 22:50:06 UTC

the example in the following link (section 8.3.2) seems to indicate the xa-datasource has different names for some of its required elements: http://docs.jboss.org/jbossas/getting_started/v4/html/db.html

Once we get this fully implemented, we need to figure out how to force a failure that causes the recovery to happen.  Not sure how to do this, perhaps we can come up with a clever way to do this that our admin/test/control.jsp can trigger.

Comment 6 John Mazzitelli 2008-11-27 01:26:05 UTC

read this: http://www.jboss.org/community/docs/DOC-9328

lots of "use this property to fix oracle problems"

There will need to be changes to the installer and container build scripts now that rhq-ds.xml is going to be database specific (right now, its generic with only ${var} being able to make it behave differently

Comment 7 John Mazzitelli 2008-11-27 01:30:39 UTC

typical oracle XA data source config: http://www.jboss.org/community/docs/DOC-12246
typical postgres XA data source config: http://www.jboss.org/community/docs/DOC-12248

Comment 8 John Mazzitelli 2008-11-27 19:11:08 UTC

if we do this (almost complete) we should invalidate RHQ-1017 - the tx-object-store will actually be important to keep around for recovery purposes.

Comment 9 John Mazzitelli 2008-11-27 19:11:49 UTC

Afer this is complete, we need to invalidate RHQ-938 - JMS data store can remain (in fact, should remain) XA compliant.

Comment 10 John Mazzitelli 2008-11-27 19:13:56 UTC

The JBossTM integration with JBossAS 4.2.1 does not seem fully complete.

Read:

https://jira.jboss.org/jira/browse/JBTM-319

and its associated forum thread.  I may take the work done for that JIRA and put it in our custom recovery object (as recommended by someone on the JBossTM team)

Comment 11 John Mazzitelli 2008-11-27 20:02:32 UTC

I will take this code and refactor my recoverer to be like this one:

http://anonsvn.labs.jboss.com/labs/jbosstm/branches/JBOSSTS_4_2_3_GA_SP/atsintegration/classes/com/arjuna/ats/internal/jbossatx/jta/AppServerJDBCXARecovery.java

Right now, I'm basing my recoverer code off of JDBCXARecovery - but that class is broken and doesn't work. I will give the above class a try to see if it works any better.

Comment 12 John Mazzitelli 2008-11-27 20:03:29 UTC

I think this wiki:

http://www.jboss.org/community/docs/DOC-10789

has examples on how we can force a tx error to occur that triggers a recovery.

Comment 13 John Mazzitelli 2008-11-29 05:18:33 UTC

attached is rhq-1183.patch - it is a patch to svn rev 2136.

It adds full XA enablement/configuration to the RHQ server.

Was able to start from installer to create schema.  imported agent and created an alert. I confirmed alert definitions get inserted (so JMS is working).  I ran db maintannce (vacuum, analyze, reindex) and the purge data job.

I did the above tests on both Postgres 8.3 and Oracle10g.

No errors seen.

Would like to now try to figure out how we can trigger XA recovery to prove this really works now :)

Comment 14 John Mazzitelli 2008-11-29 05:20:32 UTC

that old patch that was attached is now deleted - turns out that arjuna class I copied and refactored doesn't work when deployed in JBossAS.  the new grand patch that is attached has a new recoverer object that now works.

Comment 15 John Mazzitelli 2008-11-29 05:52:40 UTC

ran unit tests, all pass.  all smoke tests on postgres and oracle pass.

checked into svn - rev 2137

Comment 16 John Mazzitelli 2008-11-29 06:02:42 UTC

before closing this, I will want to come up with some way to actually see that we no longer get that "XAResource not serializable" error , and that we actually do recover from a tx failure.

Comment 17 John Mazzitelli 2008-11-29 06:28:56 UTC

did a really quick test to see what would happen.

running the server, agent sending metrics.  I shutdown postgres.

get tons of errors in the logs dealing with the fact that the conn pool cannot get db connections due to cannot connect.

But towards the end of the logs, I see this:

2008-11-29 01:17:53,625 WARN  [com.arjuna.ats.jta.logging.loggerI18N] [com.arjuna.ats.internal.jta.recovery.xarecovery1] Local XARecoveryModule.xaRecovery  got XA exception org.postgresql.xa.PGXAException: Error during recover, XAException.XAER_RMERR

So, it looks like the recovery DID get attempted! I assume I hadn't turned the DB back on at the time of the attempt so it still failed.

Will have to play with this some more.

Comment 18 John Mazzitelli 2008-11-29 20:37:10 UTC

I tried the same test on Oracle, and got the same thing.  I thought maybe the problem is the database needs to be configured for XA support (perhaps they aren't configured for it out of box?). But it seems odd that both Oracle and Postgres show the same error. 

One positive - after I shutdown the server and restarted, I didn't get these errors anymore.  From what I recall before (in the old setup w/o XA), restarting the server just resumed the old XAResource is not serializable errors - it would never go away unless you purged the tx-object-store directory (or you waited 12 hours which I think is how long JBossTM will wait before expiring the txs). So I got that going for me :-/

Comment 19 John Mazzitelli 2008-12-01 15:10:02 UTC

http://anonsvn.labs.jboss.com/labs/jbosstm/workspace/adinn/orchestration/README

how to trigger failures to test recovery

Comment 20 John Mazzitelli 2008-12-23 06:38:01 UTC

svn rev 2527 now tries to make sure our JMS consumer bean never enlists more than one datasource/XA resource in a tx.

I am attaching "jms-1pc-trace.log" that shows the JBossTM debug logging which illustrates one JMS call performing 1PC. It demonstrates that it does not engage in 2PC semantics and it never writes tx logs to the object store.

That attached log file has "*** [mazz]" comments threaded throughout to explain what the log messages are telling us.

A good forum post to read that shows what our logs DID look like when this svn rev commit was not in trunk (thus this shows what our logs looked like when performing 2PC from our JMS consumer bean):

http://www.jboss.com/index.html?module=bb&op=viewtopic&t=147697

Comment 21 John Mazzitelli 2008-12-23 07:18:10 UTC

I just noticed that we are still peforming 2PC that is triggered in a tx running an incoming agent command. Turns out, I forgot to wrap the JMS producer in a new tx to again make sure the tx does not enlist both the JMS XA resource and the DB XA resource.

I will attach a log file (jms-producer-2pc.log) that illustrates what JBossTM logs when performing 2PC. I added two "fatal" level log messages to CachedConditionProducerBean.sendActivateAlertConditionMessage... the first says "!!! BEFORE" at the first line of that method and the second says "!!! AFTER" at the last line of the method.

You can easily see the JMS XA resource get enlisted with the DB XA resource (well, you can just see 2 generic XA resources enlisted, but one has to be the JMS resource and the other the DB resource) in the same tx.  Notice the calls to phase2Commit and the object store API.

I will try to fix this so we send the message to the JMS queue in a REQUIRES_NEW method

Comment 22 John Mazzitelli 2008-12-23 07:27:47 UTC

svn rev 2528 puts REQUIRES_NEW on the producer SLSB so we no longer enlist the JMS XA resource with our DB XA resource. I verified that we are no longer performing 2PC when putting messages onto the queue. I'll attach a log file to show this.

Comment 23 John Mazzitelli 2008-12-23 07:35:48 UTC

Notice the difference between the 1pc log and the 2pc log (this is after my latest svn checkin).

Notice that you no longer see phase2Commit nor do you see the object store get the tx logs written to it. You only see one phase commit methods getting called. This verifies that we no longer use 2PC when putting messages on the JMS queue.

Comment 24 John Mazzitelli 2008-12-23 08:17:58 UTC

I've documented this in the JBoss Transactions forum here:

http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4198192#4198192

I confirmed after running the server for a while and producing alot of alerts (and looking in the logs for these JBossTM messages) that we are no longer using 2PC. We are still using XA resouces, but there is trivial overhead involved since we aren't using 2PC and therefore never perform any additional prepare phases or write tx logs to the object store.

Because we do not use 2PC in any tx's that involve JMS, we do NOT have to fix/implement a JBossMQ recovery object (which also involves fixing a bug in JBossMQ)... see https://jira.jboss.org/jira/browse/JBAS-5502 and https://jira.jboss.org/jira/browse/JBTM-279 - this is a good thing because without those fixes, we could not recover from JBossMQ/JMS tx failures and thus using 2PC would have been pointless anyway. But we now remove 2PC semantics from our JMS usage so this point is now moot.

I am finally going to close this issue. We have XA resources deployed in our server and we have recovery objects implemented and installed - in case we ever need 2PC in the future. But right now, we never use 2PC so we do not use XA features. This helps fix the problem where our tx-object-store fills up during times of db failures and which eventually kills the server. If we ever do use 2PC (for our DB resources), our recovery implementation should be able to handle it (but we should NOT enable 2PC for JMS resources unless we upgrade our JBossAS and use the new JBoss Messaging - it has JBossTM recovery objects built in that we just have to configure/enable).

Comment 25 Charles Crouch 2009-01-15 23:25:44 UTC

I need a re-open workflow here :-(

I'm seeing the following fire every 2m10s after starting a JON server, (/trunk on jan14th against Oracle)

 [com.arjuna.ats.jta.logging.loggerI18N] [com.arjuna.ats.internal.jta.recovery.xarecovery1] Local XARecoveryModule.xaRecovery got XA exception javax.transaction.xa.XAException, XAException.XAER_RMERR

The only thing in the tx-object-store is

./HashedActionStore/defaultStore/Recovery/TransactionStatusManager/#33#:
a1058dc_845a_496f8084_0

Stopping and starting the server doesn't help, the message keeps appearing. Oracle is up and I can browse around the tables in dbvisualizer. The full server log is at:

/home/test_jon/jon03/perf/server/trunk/dev-container/logs/rhq-server-log4j.log

Comment 26 Jeff Weiss 2009-01-16 00:14:45 UTC

Rejected at request of ccrouch:

<ccrouch> jweiss do you have the jira magic to reopen http://jira.rhq-project.org/browse/RHQ-1183 ?

Comment 27 John Mazzitelli 2009-01-16 00:26:58 UTC

Read this thread, starting here:

http://www.jboss.com/index.html?module=bb&op=viewtopic&t=146138&postdays=0&postorder=asc&start=20

I specifically saw this XAER_RMERR, but according to Mark L, this is a generic "something went wrong, but that's about all I can tell you" error code.

Last time I hit it, it was because the db connection used by the AppServerJDBCXARecovery object was invalid - but I fixed that. See https://jira.jboss.org/jira/browse/JBTM-441

Did you happen to do anything to the database? Remove the user/schema? Clean the database?

There are permissions that must be assigned to the DB user in order for it to try to check what tx's need recovery. If you do not, you will get errors (unknown if it will be this specific error or not).

See:

http://management-platform.blogspot.com/2008/11/transaction-recovery-in-jbossas.html

in the text "Special Note To Oracle Users" for the permissions.

Comment 28 John Mazzitelli 2009-01-16 00:28:44 UTC

FYI: it was Jonathan, not Mark, that told me about that _RMERR code. Here's what he said (from that forum thread I linked to earlier):

"RMERR is a generic XA error code that covers a wide range of errors that can be collectively described as 'the resource manager is sulking'. It may be because the resource instance you have is on a connection that died when the db went down and did not reconnect. Or it may be because the user you are connecting as does not have the right permissions on the db to do recovery. Or maybe it's just in a bad mood. Try bouncing the app server too, that should help narrow down the possibilities."

Comment 29 Charles Crouch 2009-01-16 16:31:03 UTC

This turned out to be a doco update: RHQ-1368

Comment 30 Red Hat Bugzilla 2009-11-10 20:27:31 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1183
Imported an attachment (id=368507)
Imported an attachment (id=368508)
Imported an attachment (id=368509)
Imported an attachment (id=368510)
This bug relates to RHQ-1017

Note You need to log in before you can comment on or make changes to this bug.