We need to do this: http://www.jboss.com/index.html?module=bb&op=viewtopic&t=146138 to avoid the ugly XAResource recovery problems.
I'm almost there, but I can't seem to get some properties set. 2008-11-25 17:45:27,593 ERROR [STDERR] java.lang.NullPointerException 2008-11-25 17:45:27,593 ERROR [STDERR] at javax.naming.InitialContext.getURLScheme(InitialContext.j ava:269) 2008-11-25 17:45:27,593 ERROR [STDERR] at javax.naming.InitialContext.getURLOrDefaultInitCtx(Initia lContext.java:318) 2008-11-25 17:45:27,593 ERROR [STDERR] at javax.naming.InitialContext.lookup(InitialContext.java:39 2) 2008-11-25 17:45:27,593 ERROR [STDERR] at com.arjuna.ats.internal.jdbc.recovery.JDBCXARecovery.crea teDataSource(JDBCXARecovery.java:174) 2008-11-25 17:45:27,593 ERROR [STDERR] at com.arjuna.ats.internal.jdbc.recovery.JDBCXARecovery.hasM oreResources(JDBCXARecovery.java:141) 2008-11-25 17:45:27,593 ERROR [STDERR] at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecover yModule.resourceInitiatedRecovery(XARecoveryModule.java:679) 2008-11-25 17:45:27,593 ERROR [STDERR] at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecover yModule.periodicWorkSecondPass(XARecoveryModule.java:179) 2008-11-25 17:45:27,593 ERROR [STDERR] at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery. doWork(PeriodicRecovery.java:237) 2008-11-25 17:45:27,593 ERROR [STDERR] at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery. run(PeriodicRecovery.java:163) But i have the JNDI name set in the jbossjta-properties.xml set: <properties depends="arjuna" name="jta"> ... <!-- RHQ - add the ability to recover our transactions --> <property name="com.arjuna.ats.jta.recovery.XAResourceRecoveryJDBC" value="com.arjuna.ats.internal.jdbc.recovery.JDBCXARecovery"/> <!-- <property name="DatabaseURL" value="${rhq.server.database.connection-url}"/> --> <property name="DatabaseJNDIName" value="java:/RHQDS"/> <property name="UserName" value="${rhq.server.database.user-name}"/> <property name="Password" value="${rhq.server.database.password}"/> I can't figure out why the TM code can't see these properties. I stepped through the code and for sure its getting null for the value of the property.
There is a bug in JBossTM that causes the NPE mentioned in the previous comment. Read: http://www.jboss.com/index.html?module=bb&op=viewtopic&t=146138 In addition, the following two things have to be done: 1) our java:/RHQDS is currently a non-XA local-tx datasource. We need to configure it as a <xa-datasource> in rhq-ds.xml 2) this doesn't work as expected inside jbossjta-properties.xml: <property name="UserName" value="${rhq.server.database.user-name}"/> <property name="Password" value="${rhq.server.database.password}"/> The ${} aren't replaced, they are literall the values of UserName and Password. We are going to need to write our own extension to com.arjuna.ats.internal.jdbc.recovery.JDBCXARecovery that takes the properties and replaces them. This is going to suck because we can't put that class in our ear - it has to go into server/lib so it can be found at server startup. We would have to configure it like this: <!-- RHQ - add the ability to recover our transactions --> <property name="com.arjuna.ats.jta.recovery.XAResourceRecoveryJDBC" value="org.rhq.JDBCXARecovery;jbossjta-properties.xml"/> <!-- <property name="DatabaseURL" value="${rhq.server.database.connection-url}"/> --> <property name="DatabaseJNDIName" value="java:/RHQDS"/> <property name="UserName" value="${rhq.server.database.user-name}"/> <property name="Password" value="${rhq.server.database.password}"/> Another alternative would be to have the installer replace these ${} at deploy time, effectively hardcoded the values. Of course, if the user ever changes the DB user/pass, we no longer have all our config in a single file (rhq-server.properties) - we are once again spreading our configuration into the JBossAS internal deployment files (and this is not what we want to start - ALL configuration must be adjusted from within rhq-server.properties. We could add this to rhq-server.properties: UserName=the.user.name Password=the.password right below the rhq.server.database.X settings - we'd be duplicating the configuration, but at least it would all be in the same file (and right next to each other).
The best thing would be to write our own JDBCXARecovery class - maybe we can put the source in enterprise/server/container - any classes in there would be bundled up in a single .jar and the container build can deploy that .jar in the server/lib directory.
see attached patch for what we can use to make sure XA recoverability works. This is the XARecovery implementation which is copied from the Arjuna example, except we allow for ${} variables in the prop values and we allow for the case when the data source isn't deployed yet (like when a newly installed server is started but the installer hasn't been told to deploy the ear). Last thing on the plate is to get our datasource to be a XA datasource, as opposed to a <local-tx-datasource>.
the example in the following link (section 8.3.2) seems to indicate the xa-datasource has different names for some of its required elements: http://docs.jboss.org/jbossas/getting_started/v4/html/db.html Once we get this fully implemented, we need to figure out how to force a failure that causes the recovery to happen. Not sure how to do this, perhaps we can come up with a clever way to do this that our admin/test/control.jsp can trigger.
read this: http://www.jboss.org/community/docs/DOC-9328 lots of "use this property to fix oracle problems" There will need to be changes to the installer and container build scripts now that rhq-ds.xml is going to be database specific (right now, its generic with only ${var} being able to make it behave differently
typical oracle XA data source config: http://www.jboss.org/community/docs/DOC-12246 typical postgres XA data source config: http://www.jboss.org/community/docs/DOC-12248
if we do this (almost complete) we should invalidate RHQ-1017 - the tx-object-store will actually be important to keep around for recovery purposes.
Afer this is complete, we need to invalidate RHQ-938 - JMS data store can remain (in fact, should remain) XA compliant.
The JBossTM integration with JBossAS 4.2.1 does not seem fully complete. Read: https://jira.jboss.org/jira/browse/JBTM-319 and its associated forum thread. I may take the work done for that JIRA and put it in our custom recovery object (as recommended by someone on the JBossTM team)
I will take this code and refactor my recoverer to be like this one: http://anonsvn.labs.jboss.com/labs/jbosstm/branches/JBOSSTS_4_2_3_GA_SP/atsintegration/classes/com/arjuna/ats/internal/jbossatx/jta/AppServerJDBCXARecovery.java Right now, I'm basing my recoverer code off of JDBCXARecovery - but that class is broken and doesn't work. I will give the above class a try to see if it works any better.
I think this wiki: http://www.jboss.org/community/docs/DOC-10789 has examples on how we can force a tx error to occur that triggers a recovery.
attached is rhq-1183.patch - it is a patch to svn rev 2136. It adds full XA enablement/configuration to the RHQ server. Was able to start from installer to create schema. imported agent and created an alert. I confirmed alert definitions get inserted (so JMS is working). I ran db maintannce (vacuum, analyze, reindex) and the purge data job. I did the above tests on both Postgres 8.3 and Oracle10g. No errors seen. Would like to now try to figure out how we can trigger XA recovery to prove this really works now :)
that old patch that was attached is now deleted - turns out that arjuna class I copied and refactored doesn't work when deployed in JBossAS. the new grand patch that is attached has a new recoverer object that now works.
ran unit tests, all pass. all smoke tests on postgres and oracle pass. checked into svn - rev 2137
before closing this, I will want to come up with some way to actually see that we no longer get that "XAResource not serializable" error , and that we actually do recover from a tx failure.
did a really quick test to see what would happen. running the server, agent sending metrics. I shutdown postgres. get tons of errors in the logs dealing with the fact that the conn pool cannot get db connections due to cannot connect. But towards the end of the logs, I see this: 2008-11-29 01:17:53,625 WARN [com.arjuna.ats.jta.logging.loggerI18N] [com.arjuna.ats.internal.jta.recovery.xarecovery1] Local XARecoveryModule.xaRecovery got XA exception org.postgresql.xa.PGXAException: Error during recover, XAException.XAER_RMERR So, it looks like the recovery DID get attempted! I assume I hadn't turned the DB back on at the time of the attempt so it still failed. Will have to play with this some more.
I tried the same test on Oracle, and got the same thing. I thought maybe the problem is the database needs to be configured for XA support (perhaps they aren't configured for it out of box?). But it seems odd that both Oracle and Postgres show the same error. One positive - after I shutdown the server and restarted, I didn't get these errors anymore. From what I recall before (in the old setup w/o XA), restarting the server just resumed the old XAResource is not serializable errors - it would never go away unless you purged the tx-object-store directory (or you waited 12 hours which I think is how long JBossTM will wait before expiring the txs). So I got that going for me :-/
http://anonsvn.labs.jboss.com/labs/jbosstm/workspace/adinn/orchestration/README how to trigger failures to test recovery
svn rev 2527 now tries to make sure our JMS consumer bean never enlists more than one datasource/XA resource in a tx. I am attaching "jms-1pc-trace.log" that shows the JBossTM debug logging which illustrates one JMS call performing 1PC. It demonstrates that it does not engage in 2PC semantics and it never writes tx logs to the object store. That attached log file has "*** [mazz]" comments threaded throughout to explain what the log messages are telling us. A good forum post to read that shows what our logs DID look like when this svn rev commit was not in trunk (thus this shows what our logs looked like when performing 2PC from our JMS consumer bean): http://www.jboss.com/index.html?module=bb&op=viewtopic&t=147697
I just noticed that we are still peforming 2PC that is triggered in a tx running an incoming agent command. Turns out, I forgot to wrap the JMS producer in a new tx to again make sure the tx does not enlist both the JMS XA resource and the DB XA resource. I will attach a log file (jms-producer-2pc.log) that illustrates what JBossTM logs when performing 2PC. I added two "fatal" level log messages to CachedConditionProducerBean.sendActivateAlertConditionMessage... the first says "!!! BEFORE" at the first line of that method and the second says "!!! AFTER" at the last line of the method. You can easily see the JMS XA resource get enlisted with the DB XA resource (well, you can just see 2 generic XA resources enlisted, but one has to be the JMS resource and the other the DB resource) in the same tx. Notice the calls to phase2Commit and the object store API. I will try to fix this so we send the message to the JMS queue in a REQUIRES_NEW method
svn rev 2528 puts REQUIRES_NEW on the producer SLSB so we no longer enlist the JMS XA resource with our DB XA resource. I verified that we are no longer performing 2PC when putting messages onto the queue. I'll attach a log file to show this.
Notice the difference between the 1pc log and the 2pc log (this is after my latest svn checkin). Notice that you no longer see phase2Commit nor do you see the object store get the tx logs written to it. You only see one phase commit methods getting called. This verifies that we no longer use 2PC when putting messages on the JMS queue.
I've documented this in the JBoss Transactions forum here: http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4198192#4198192 I confirmed after running the server for a while and producing alot of alerts (and looking in the logs for these JBossTM messages) that we are no longer using 2PC. We are still using XA resouces, but there is trivial overhead involved since we aren't using 2PC and therefore never perform any additional prepare phases or write tx logs to the object store. Because we do not use 2PC in any tx's that involve JMS, we do NOT have to fix/implement a JBossMQ recovery object (which also involves fixing a bug in JBossMQ)... see https://jira.jboss.org/jira/browse/JBAS-5502 and https://jira.jboss.org/jira/browse/JBTM-279 - this is a good thing because without those fixes, we could not recover from JBossMQ/JMS tx failures and thus using 2PC would have been pointless anyway. But we now remove 2PC semantics from our JMS usage so this point is now moot. I am finally going to close this issue. We have XA resources deployed in our server and we have recovery objects implemented and installed - in case we ever need 2PC in the future. But right now, we never use 2PC so we do not use XA features. This helps fix the problem where our tx-object-store fills up during times of db failures and which eventually kills the server. If we ever do use 2PC (for our DB resources), our recovery implementation should be able to handle it (but we should NOT enable 2PC for JMS resources unless we upgrade our JBossAS and use the new JBoss Messaging - it has JBossTM recovery objects built in that we just have to configure/enable).
I need a re-open workflow here :-( I'm seeing the following fire every 2m10s after starting a JON server, (/trunk on jan14th against Oracle) [com.arjuna.ats.jta.logging.loggerI18N] [com.arjuna.ats.internal.jta.recovery.xarecovery1] Local XARecoveryModule.xaRecovery got XA exception javax.transaction.xa.XAException, XAException.XAER_RMERR The only thing in the tx-object-store is ./HashedActionStore/defaultStore/Recovery/TransactionStatusManager/#33#: a1058dc_845a_496f8084_0 Stopping and starting the server doesn't help, the message keeps appearing. Oracle is up and I can browse around the tables in dbvisualizer. The full server log is at: /home/test_jon/jon03/perf/server/trunk/dev-container/logs/rhq-server-log4j.log
Rejected at request of ccrouch: <ccrouch> jweiss do you have the jira magic to reopen http://jira.rhq-project.org/browse/RHQ-1183 ?
Read this thread, starting here: http://www.jboss.com/index.html?module=bb&op=viewtopic&t=146138&postdays=0&postorder=asc&start=20 I specifically saw this XAER_RMERR, but according to Mark L, this is a generic "something went wrong, but that's about all I can tell you" error code. Last time I hit it, it was because the db connection used by the AppServerJDBCXARecovery object was invalid - but I fixed that. See https://jira.jboss.org/jira/browse/JBTM-441 Did you happen to do anything to the database? Remove the user/schema? Clean the database? There are permissions that must be assigned to the DB user in order for it to try to check what tx's need recovery. If you do not, you will get errors (unknown if it will be this specific error or not). See: http://management-platform.blogspot.com/2008/11/transaction-recovery-in-jbossas.html in the text "Special Note To Oracle Users" for the permissions.
FYI: it was Jonathan, not Mark, that told me about that _RMERR code. Here's what he said (from that forum thread I linked to earlier): "RMERR is a generic XA error code that covers a wide range of errors that can be collectively described as 'the resource manager is sulking'. It may be because the resource instance you have is on a connection that died when the db went down and did not reconnect. Or it may be because the user you are connecting as does not have the right permissions on the db to do recovery. Or maybe it's just in a bad mood. Try bouncing the app server too, that should help narrow down the possibilities."
This turned out to be a doco update: RHQ-1368
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1183 Imported an attachment (id=368507) Imported an attachment (id=368508) Imported an attachment (id=368509) Imported an attachment (id=368510) This bug relates to RHQ-1017