Bug 534198 (RHQ-1017)

Summary: consider clearing the tx-object-store at startup
Product: [Other] RHQ Project Reporter: John Mazzitelli <mazz>
Component: Core ServerAssignee: John Mazzitelli <mazz>
Status: CLOSED NOTABUG QA Contact:
Severity: medium Docs Contact:
Priority: urgent    
Version: unspecifiedKeywords: Improvement
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-1017
Whiteboard:
Fixed In Version: 1.2 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Mazzitelli 2008-10-22 21:38:00 UTC
I've seen this several times.  the last time, I shutdown the DB at which time the servers (although in maintenance mode) wigged out and spit out bunches of DB errors... to be expected. So I shutdown the servers.  After a few hours, the DB came back online and I restarted the server.

At this time, the servers logged tons of these kinds of messages:

(11:02:33 AM) mazz: 2008-10-22 10:01:45,018 WARN  [com.arjuna.ats.jta.logging.loggerI18N] [com.arjuna.ats.internal.jta.resources.arjunacore.norecoveryxa] [com.arjuna.ats.internal.jta.resources.arjunacore.norecoveryxa] Could not find new XAResource to use for recovering non-serializable XAResource < 131075, 29, 27, 1-a102368:b942:48fe2b53:cbd97a102368:b942:48fe2b53:cbda2^@^@^@^@^@^>

We are not using any 2PC-recoverable resources.  After we restart, we should do so clean and not have the transaction manager attempt to recover from past failures.

Would it be wise to have rhs-server.sh/bat completely purge the $RHQ_SERVER_HOME/jbossas/server/default/data/tx-object-store directory? This is where the Arjuna tx manager stores the data it needs when trying to recover failed tx's.  I end up hand deleting this directory anyway whenever I get these kinds of errors, so its not like something horribly wrong will always happen.  I don't know what the ramifications are with me deleting this stuff, but it always seems to do the trick.  Whether or not I'm losing data that I otherwise would not have is an unknown, but I doubt I am.

Comment 1 John Mazzitelli 2008-10-30 13:53:53 UTC
I'm marking this as "fix for 1.2" and "blocker" to at least force us to look into this problem.  I tried restarting servers after the system bogged down due to the hibernate cache timing out.  Had to hard kill everything and was getting DB errors for days.  After I restarted, I saw tons of these arjuna messages repeat.

Comment 2 John Mazzitelli 2008-11-03 13:30:08 UTC
Here's a patch to implement this in the unix script rhq-server.sh.  For the window script, it's a little more difficult because we first have to check to see if the window service is already running and if so we should not remove the directory.

Index: modules/enterprise/server/container/src/main/bin-resources/bin/rhq-server.sh
===================================================================
--- modules/enterprise/server/container/src/main/bin-resources/bin/rhq-server.sh	(revision 1864)
+++ modules/enterprise/server/container/src/main/bin-resources/bin/rhq-server.sh	(working copy)
@@ -130,6 +130,16 @@
 }
 
 # ----------------------------------------------------------------------
+# Performs some things that must be done just prior to starting server
+# ----------------------------------------------------------------------
+
+prepare_to_start ()
+{
+   # remove the Transactions Object Store directory (see RHQ-1017)
+   rm -rf ${RHQ_SERVER_HOME}/jbossas/server/default/data/tx-object-store
+}
+
+# ----------------------------------------------------------------------
 # Determine what specific platform we are running on.
 # Set some platform-specific variables.
 # ----------------------------------------------------------------------
@@ -285,7 +295,10 @@
         echo Starting RHQ Server in console...
 
         echo "$$" > $PIDFILE
-        
+
+        # before we start, prepare the server
+        prepare_to_start
+
         # start the server, making sure its working directory is the JBossAS bin directory 
         cd ${RHQ_SERVER_HOME}/jbossas/bin
         $_JBOSS_RUN_SCRIPT $RHQ_SERVER_CMDLINE_OPTS
@@ -307,7 +320,10 @@
 
         LAUNCH_JBOSS_IN_BACKGROUND=true
         export LAUNCH_JBOSS_IN_BACKGROUND
-        
+
+        # before we start, prepare the server
+        prepare_to_start
+
         # start the server, making sure its working directory is the JBossAS bin directory 
         cd ${RHQ_SERVER_HOME}/jbossas/bin
         if [ -z "$RHQ_SERVER_DEBUG" ]; then


Comment 3 John Mazzitelli 2008-11-03 15:18:08 UTC
This source code to Java Service Wrapper (for our particular version) will be useful - look at the exit codes for when you pass in the -q or -qs options.

http://wrapper.svn.sourceforge.net/viewvc/wrapper/trunk/wrapper/src/c/wrapper_win.c?revision=1148&view=markup

Specifically, the exit code is really a bitmask of the following:

If Windows Service is DISABLED, exit code is ORed with 32
If Windows Service is MANUAL started, exit code is ORed with 16
If Windows Service is AUTO started, exit code is ORed with 8
(the above three are mutually exclusive, only one of the bit values 32, 16 and 8 will ever be set)

If the Windows Service is Interactive with the desktop, exit code is ORed with 4.

If the Windows Service is RUNNING (this is the important one), exit code is ORed with 2.

If the Windows Service is installed, exit code is ORed with 1 (this bit is always set if any of the other bits are set - in other words, if the windows service is not installed, the exit code is always 0 - if the windows service is installed, the exit code will always have its bit value 1 set.).

Because of the weird way windows bat scripts handle exit codes, do something like this:

   if ERRORLEVEL 37 if not ERRORLEVEL 38 ... purge tx-object-directory
   if ERRORLEVEL 33 if not ERRORLEVEL 34  ... purge tx-object-directory
   if ERRORLEVEL 21 if not ERRORLEVEL 22  ... purge tx-object-directory
   if ERRORLEVEL 17 if not ERRORLEVEL 18  ... purge tx-object-directory
   if ERRORLEVEL 13 if not ERRORLEVEL 14 ... purge tx-object-directory
   if ERRORLEVEL 9 if not ERRORLEVEL 10 ... purge tx-object-directory

where exit code values of 37, 33, 21, 17, 13 and 9 are all the value bitmask values that the Java Service Wrapper will return when the service is not running (e.g. 9==installed service, ran AUTOMATICALLY at startup and is not running==1+8)

Comment 4 John Mazzitelli 2008-11-03 15:44:34 UTC
Here is the exact code that we need to run in order to remove the tx-object-store directory just prior to starting the window service. Note that this code will only remove the tx-object-store if the windows service is NOT running:

   "%RHQ_SERVER_WRAPPER_EXE_FILE_PATH%" -qs "%RHQ_SERVER_WRAPPER_CONF_FILE_PATH%"
   if ERRORLEVEL 37 (
      if not ERRORLEVEL 38 (
         rmdir /q /s %RHQ_SERVER_HOME%\jbossas\server\default\data\tx-object-store > NUL 2>&1
      )
   ) else if ERRORLEVEL 33 (
      if not ERRORLEVEL 34 (
         rmdir /q /s %RHQ_SERVER_HOME%\jbossas\server\default\data\tx-object-store > NUL 2>&1
      )
   ) else if ERRORLEVEL 21 (
      if not ERRORLEVEL 22 (
         rmdir /q /s %RHQ_SERVER_HOME%\jbossas\server\default\data\tx-object-store > NUL 2>&1
      )
   ) else if ERRORLEVEL 17 (
      if not ERRORLEVEL 18 (
         rmdir /q /s %RHQ_SERVER_HOME%\jbossas\server\default\data\tx-object-store > NUL 2>&1
      )
   ) else if ERRORLEVEL 13 (
      if not ERRORLEVEL 14 (
         rmdir /q /s %RHQ_SERVER_HOME%\jbossas\server\default\data\tx-object-store > NUL 2>&1
      )
   ) else if ERRORLEVEL 9 (
      if not ERRORLEVEL 10 (
         rmdir /q /s %RHQ_SERVER_HOME%\jbossas\server\default\data\tx-object-store > NUL 2>&1
      )
   ) else if ERRORLEVEL 0 (
      if not ERRORLEVEL 1 (
         rmdir /q /s %RHQ_SERVER_HOME%\jbossas\server\default\data\tx-object-store > NUL 2>&1
      )
   )

Comment 5 John Mazzitelli 2008-11-22 22:29:07 UTC
This confirms that completely purging the tx-object-store is OK.  It does say, though, that it is indicates a config issue:

http://anonsvn.labs.jboss.com/labs/jbosstm/trunk/atsintegration/docs/IntegrationGuide.odt

"Further, even when the transaction is recorded ,that record may not contain all the required information. This is particularly the case when using drivers that have non-serializable XAResource implementations, which unfortunately is most of them.

To address these situations, JBossTS uses configurable recovery modules. By providing a suitable plugin for any resource manager that is used, it can be ensured that all transactions will be recovered successfully. Without such configuration, the system will keep retrying failed transactions repeatedly without success. This leads to three consequences: resource managers may remain locked, denying service to other applications until such time as an administrator intervenes; entries for the failed transactions may remain in the ObjectStore indefinitely; the recovery manager will log failed recovery attempts on each pass, leading to larger than necessary log files. These situations commonly manifest themselves by repeated log entries of the form "Could not find new XAResource to use for recovering non-serializable XAResource < id string >"  Such errors can be ignored in development environments and eliminated by shutting down JBossTS and removing the contents of the ObjectStore before restarting. However, it is important to realize such log entries are indicative of misconfiguration and should be a serious concern in a production environment."

Comment 6 John Mazzitelli 2008-11-22 22:38:43 UTC
https://www.redhat.com/docs/manuals/jboss/jboss-eap-4.3/doc/jbossts/TX_Core_Failure_Recovery_Guide.pdf

in case we actually have to write a recovery module - I highly doubt we do, but as a last resort, see above - we could write a dummy recovery module that just throws out the tx that is trying to  be recovered.

Comment 7 John Mazzitelli 2008-11-25 20:28:09 UTC
if we implement this properly:

http://www.jboss.com/index.html?module=bb&op=viewtopic&t=146138

we wont need to clean the tx-object-store - we will recovery properly and actually use this for what it was intended.

Comment 8 John Mazzitelli 2008-11-29 20:34:05 UTC
not implementing this.

Comment 9 Red Hat Bugzilla 2009-11-10 20:21:57 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1017
This bug is related to RHQ-1183
This bug relates to RHQ-938
This bug relates to RHQ-1032