Created attachment 969530 [details] Server Logs Description of problem: In 4.13, setting the option "Delete resource configuration history older than..." to 180 days, the purge operation is not actually purging the data. Configuration History still contains data from May 2012, with 169487 rows. The table RHQ_CONFIG_PROPERTY contains 30677141 rows. The following is logged but it is not purging anything: 19:30:00,836 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Purging resource configuration history data older than Wed Jun 18 20:30:00 CEST 2014 19:30:01,983 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Resource configuration history purged [0] - completed in [1147]ms Also, this might be related since it comes in the log at about the same interval: UnsupportedOperationException on mergeInventoryReport() https://developer.jboss.org/thread/250979 Version-Release number of selected component (if applicable): RHQ 4.13 on x86_64 RHEL 6.5 Database Product Version : Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 Driver Version : 11.2.0.3.0 How reproducible: Every time, well on six installations we have the same occurs. Steps to Reproduce: 1. 2. 3. Actual results: Data is not purged from database. Expected results: Would expect data to be purged. Additional info: Attaching logs.
I also find stuff like this in the logs: 11:30:05,682 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Resource configuration history purge starting at Thu Dec 18 11:30:05 CET 2014 11:30:05,682 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Purging resource configuration history data older than Tue Dec 18 11:30:05 CET 2012 11:54:13,709 WARN [com.arjuna.ats.arjuna] (RHQScheduler_Worker-2) ARJUNA012077: Abort called on already aborted atomic action 0:ffff9b374821:4479387a:54917281:13f67b 11:54:13,710 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 22) ARJUNA012113: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 22,5,main] missed interrupt when cancelling TX 0:ffff9b374821:4479387a:54917281:13f67b -- exiting as zombie (zombie count decremented to 0) 11:54:13,711 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) ResourceConfigurationUpdate: could not fully process the batched purge: javax.transaction.RollbackException: ARJUNA016102: The transaction is not active! Uid is 0:ffff9b374821:4479387a:54917281:13f67b at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1179) [jbossjts-jacorb-4.17.17.Final.jar:4.17.17.Final] at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:126) [jbossjts-jacorb-4.17.17.Final.jar:4.17.17.Final] at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:75) at org.jboss.tm.usertx.client.ServerVMClientUserTransaction.commit(ServerVMClientUserTransaction.java:173) at org.rhq.enterprise.server.purge.PurgeTemplate.deleteRows(PurgeTemplate.java:207) [rhq-server.jar:4.13.0] at org.rhq.enterprise.server.purge.PurgeTemplate.execute(PurgeTemplate.java:111) [rhq-server.jar:4.13.0] at org.rhq.enterprise.server.purge.ResourceConfigurationUpdatePurge.execute(ResourceConfigurationUpdatePurge.java:37) [rhq-server.jar:4.13.0] at org.rhq.enterprise.server.purge.PurgeManagerBean.purgeResourceConfigHistory(PurgeManagerBean.java:177) [rhq-server.jar:4.13.0] at sun.reflect.GeneratedMethodAccessor389.invoke(Unknown Source) [:1.7.0_55] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_55] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_55]
I am not an Oracle admin and I don't really dare run massive 'delete from' SQLs in production without being sure what I'm doing. I really wish there was a simple CLI 'oneliner' to just purge the history and start fresh. I am attaching a screenshot from DBVis which shows the row count of the RHQ_CONFIG tables. Also attaching latest logs from RHQ server.
Created attachment 970505 [details] RHQ server logs as of 18/12
Created attachment 970506 [details] RHQ table sizes.
I have set DEBUG on org.rhq.enterprise.server.purge.PurgeTemplate. Not that it gives us much more... 14:30:07,452 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Resource configuration history purge starting at Thu Dec 18 14:30:07 CET 2014 14:30:07,452 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Purging resource configuration history data older than Tue Dec 18 14:30:07 CET 2012 14:31:38,288 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) Loaded 101274 key(s) of ResourceConfigurationUpdate 14:31:38,307 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) Deleting 30000 row(s) of ResourceConfigurationUpdate ... 14:52:25,201 WARN [com.arjuna.ats.arjuna] (RHQScheduler_Worker-2) ARJUNA012077: Abort called on already aborted atomic action 0:ffff9b374821:-70503208:5492d1ed:4d75 14:52:25,203 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012113: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 0,5,main] missed interrupt when cancelling TX 0:ffff9b374821:-70503208:5492d1ed:4d75 -- exiting as zombie (zombie count decremented to 0) 14:52:25,204 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) ResourceConfigurationUpdate: could not fully process the batched purge: javax.transaction.RollbackException: ARJUNA016102: The transaction is not active! Uid is 0:ffff9b374821:-70503208:5492d1ed:4d75 at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1179) [jbossjts-jacorb-4.17.17.Final.jar:4.17.17.Final] at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:126) [jbossjts-jacorb-4.17.17.Final.jar:4.17.17.Final] at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:75) at org.jboss.tm.usertx.client.ServerVMClientUserTransaction.commit(ServerVMClientUserTransaction.java:173) ... 14:52:25,361 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) Loaded 0 key(s) of ResourceConfigurationUpdateFromGroup 14:52:25,379 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) Loaded 0 key(s) of GroupResourceConfigurationUpdate 14:52:25,379 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Resource configuration history purged [0] - completed in [1337927]ms Maybe an idea to set DEBUG on com.arjuna.ats ? Attaching full server.log
Created attachment 970639 [details] Server logs with DEBUG on.
Sorry for the late reply. It truly tells me more: the problem is not in the process of loading the keys, but in the process of deletion (as indicated by the "Loaded 101274 key(s) of ResourceConfigurationUpdate" message). Can you start your RHQ servers with the following system property (must be set on all servers in an HA deployment): "-Dorg.rhq.enterprise.server.purge.PurgeTemplate.BATCH_SIZE=1000" The default is 30000. I think 30000 is fine for simple deletions but deleting lines in RHQ_CONFIG implies the deletion of various lines in RHQ_CONFIG_PROPERTY.
Thanks for the update Thomas, I have set the property: RHQ_SERVER_ADDITIONAL_JAVA_OPTS="-Dorg.rhq.enterprise.server.purge.PurgeTemplate.BATCH_SIZE=1000" I assume this means a smaller number of rows scheduled for deletion on each run, so since it runs every hour I would guess it must take a couple of days to delete most of the 180k entries in config history. I think the default should be smaller than 30k anyway - assuming it runs once an hour there really should be no need to remove 30000 entries from the RHQ_CONFIG table every time it runs. Either that, or the transaction timeout should be increased to maybe to 60 minutes to give it a chance to complete in time.
(In reply to Stian Lund from comment #8) > Thanks for the update Thomas, I have set the property: > > RHQ_SERVER_ADDITIONAL_JAVA_OPTS="-Dorg.rhq.enterprise.server.purge. > PurgeTemplate.BATCH_SIZE=1000" > > I assume this means a smaller number of rows scheduled for deletion on each > run, so since it runs every hour I would guess it must take a couple of days > to delete most of the 180k entries in config history. > No. When the data purge job runs, it loads *all* the keys and store them in a temp file. Then it deletes rows in batches which size is determined by the system property. In other words, no matter the size of batches, when the purge job completes, the same amount of data should have been deleted. > I think the default should be smaller than 30k anyway - assuming it runs > once an hour there really should be no need to remove 30000 entries from the > RHQ_CONFIG table every time it runs. Experience in the community showed that 30k is pretty fine for other data purged by the job. The problem here is that deleting a row in RHQ_CONFIG also means deleting many rows in RHQ_CONFIG_PROP, as well as in RHQ_RESOURCE_CONFIG_UPDATE. > > Either that, or the transaction timeout should be increased to maybe to 60 > minutes to give it a chance to complete in time. I disagree. Look at your logs: it's already 20 minutes when the transaction is rollbacked. The purpose of batched deletion is precisely to make transactions short lived.
Ok thanks for the explanation :) This seems to have helped a lot; it has now deleted a lot of those old entries. 11:32:37,641 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) Loaded 82010 key(s) of ResourceConfigurationUpdate 11:32:37,641 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) Deleting 1000 row(s) of ResourceConfigurationUpdate (RHQScheduler_Worker-1) Deleting 1000 row(s) of ResourceConfigurationUpdate ... 11:52:29,520 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) Deleting 1000 row(s) of ResourceConfigurationUpdate 11:52:47,017 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) ResourceConfigurationUpdate: could not fully process the batched purge: java.sql.BatchUpdateException: ORA-00060: deadlock detected while waiting for resource It did exit with an ERROR but I think this is something that will be ok for next run. It seems to be unable to delete some of the old 2012 entries but I will let it run a couple more times before maybe deleting some manually. > I disagree. Look at your logs: it's already 20 minutes when the transaction is rollbacked. Well, if the transaction runs for 20 minutes and gets rolled back then stuff should be done in smaller batches or timeouts increased. Smaller batches is always better anyway I think. Anyway this shouldn't be something that affects a lot of users and since RHQ now has the ability to purge this data it should be ok for most. Interesting to note is also that the BZ I reported on UnsupportedOperationException seems to be gone now that the purge is working? https://bugzilla.redhat.com/show_bug.cgi?id=1174841
Closed NOTABUG : batch size configuration needed
(In reply to Stian Lund from comment #10) > Interesting to note is also that the BZ I reported on > UnsupportedOperationException seems to be gone now that the purge is working? > https://bugzilla.redhat.com/show_bug.cgi?id=1174841 I doubt purge has anything to do with the problem described in Bug 1174841 (there's no connection with inventory report process)
Just updating a bit; I had ~68k Recent Operations in one Test-installation since I have been trying to repro a bug with RHQ Agent restarts. The same timeout occurred there when trying to delete, so I still suggest the batch should be set smaller by default in RHQ/JON. Otherwise users have to set this value for every RHQ server installation.