Red Hat Bugzilla – Bug 534354
do not use CMT when we do not need it
Last modified: 2014-05-02 16:33:24 EDT
MeasurementCompressionManagerBean has a few methods that compress/purge data - called from our quartz job.
Some have tx timeouts of 10 mins, others 30 mins.
However, sometimes the calculation and purges take longer than that (over an hour in some rare cases).
I've seen where this actually works - the database data is purged/compressed and commited fine (these methods use straight JDBC, going around the JPA entity manager).
But then when the SLSB method returns, the TxManager interceptors notice the timeout was exceeded and log messages and throw exceptions like "tx is not active!".
Obviously, we completely circumvented the container transaction management due to the fact we see that the data was committed to the db.
Therefore, I think we need to use NOT_SUPPORTED or NEVER for some of these methods - we should look at all the places that use JDBC (not the entity manager) and decide if we can remove CMT.
I'm setting this to critical since this may cause the tx manager to think it needs to recover the transaction when it does not have to. We don't need to see these errors in the log nor do we want these exceptions thrown when there really isn't a problem.
I checked all usages of the MeasurementCompressionManager SLSB and it all originates from the DataPurgeJob. Since its happening asynchronously away from everything else, we should at least have it be a very large timeout (on the order of hours at least - its possible this job can take 2 hours, perhaps more as the data gets larger).
Or, we just not have it use CMT and we rely on the database to handle the transactioning itself - which is really all we need in this case.
These three are called from the compression manager bean:
int deleted = eventManager.purgeEventData(deleteUpToTime);
int alertsDeleted = alertManager.deleteAlerts(0, now - this.purgeAlert);
we need to watch out for tx timeouts here too
"Multiple threads active in the same transaction"
"it could also be due to the transaction timeout going off and JBossTS terminating the transaction automatically while the application thread is still associated with it."
"In situations where a timeout occurs, the business logic thread may not notice the transaction has been rolled back until it tries a commit. In such cases you may see something like:
[com.arjuna.ats.internal.jta.transaction.arjunacore.inactive] The transaction is not active!"
This is clearly the situation we are hitting. The JDBC driver call and the rest of our SLSB method doesn't know the tx timed out - but because we went around the entitymanager and straight thru JDBC, our stuff really did commit. After the method returned, the tx manager will attempt the rollback, but we've already commited, which is why the data still shows in the database.
compression manager bean and data purge job has been refactored - all purging is done from within data purge job, compression bean does not purge calltime, alert, event data anymore
rev2105 checks in the new transaction timeouts (see RHQ-1170). With that checkin went changes the purge methods' timeouts - they are all now 6 hours.
after some testing, i realize this has nothing to do with us using JDBC vs. JPA entity manager. When a transaction times out, arjuna tx mgr does so asynchrnously - it rollsback that tx as soon as the timeout occurs. The thread running our method never gets an interrupt and is free to continue, however, when it returns, it will also attempt to rollback (you'll get an error in the logs from arjuna saying sometihng like "trying to abort an already aborted tx". This has nothing to do with JDBC or entityManager usage - its the same no matter what. Both will get rolled back. Unfortunately, there is no way to know that the tx has timed out (that i am aware of) and if we are stuck executing a SQL statement via JDBC statement.execute[Update], that must finish before the method returns (again, no thread interrupt happens)... so it could be seconds, minutes, hours before that method returns and realizes that all its work was for naught and everything was rolled back.
Now that RHQ-1195 provides a timeout feature, we should go through our stuff and see if we can benefit from interrupting the SLSB methods when the tx times out (like the data purge stuff)
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1159
This bug is related to RHQ-1195
This bug relates to RHQ-1170