1174747 – 4.13 Resource Configuration History, not being purged

Bug 1174747 - 4.13 Resource Configuration History, not being purged

Summary: 4.13 Resource Configuration History, not being purged

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Configuration, Core Server, Database
Sub Component:
Version:	4.13
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	RHQ Project Maintainer
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-12-16 12:21 UTC by Stian Lund
Modified:	2015-01-14 14:50 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-01-05 13:08:24 UTC
Embargoed:

Attachments	(Terms of Use)
Server Logs (815.38 KB, application/zip) 2014-12-16 12:21 UTC, Stian Lund	no flags	Details
RHQ server logs as of 18/12 (848.49 KB, application/zip) 2014-12-18 11:53 UTC, Stian Lund	no flags	Details
RHQ table sizes. (20.77 KB, image/png) 2014-12-18 11:54 UTC, Stian Lund	no flags	Details
Server logs with DEBUG on. (81.35 KB, application/zip) 2014-12-18 16:19 UTC, Stian Lund	no flags	Details
View All

Description Stian Lund 2014-12-16 12:21:31 UTC

Created attachment 969530 [details]
Server Logs

Description of problem:

In 4.13, setting the option "Delete resource configuration history older than..." to 180 days, the purge operation is not actually purging the data.

Configuration History still contains data from May 2012, with 169487 rows.

The table RHQ_CONFIG_PROPERTY contains 30677141 rows.

The following is logged but it is not purging anything:

19:30:00,836 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Purging resource configuration history data older than Wed Jun 18 20:30:00 CEST 2014
19:30:01,983 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Resource configuration history purged [0] - completed in [1147]ms

Also, this might be related since it comes in the log at about the same interval:
UnsupportedOperationException on mergeInventoryReport()
https://developer.jboss.org/thread/250979

Version-Release number of selected component (if applicable):

RHQ 4.13 on x86_64 RHEL 6.5
Database Product Version : Oracle Database 11g Enterprise Edition Release 11.2.0.3.0
Driver Version :	11.2.0.3.0


How reproducible:

Every time, well on six installations we have the same occurs.

Steps to Reproduce:
1. 
2.
3.

Actual results:

Data is not purged from database.

Expected results:

Would expect data to be purged.

Additional info:

Attaching logs.

Comment 1 Stian Lund 2014-12-18 11:28:24 UTC

I also find stuff like this in the logs:

11:30:05,682 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Resource configuration history purge starting at Thu Dec 18 11:30:05 CET 2014
11:30:05,682 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Purging resource configuration history data older than Tue Dec 18 11:30:05 CET 2012

11:54:13,709 WARN  [com.arjuna.ats.arjuna] (RHQScheduler_Worker-2) ARJUNA012077: Abort called on already aborted atomic action 0:ffff9b374821:4479387a:54917281:13f67b
11:54:13,710 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 22) ARJUNA012113: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 22,5,main] missed interrupt when cancelling TX 0:ffff9b374821:4479387a:54917281:13f67b -- exiting as zombie (zombie count decremented to 0)
11:54:13,711 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) ResourceConfigurationUpdate: could not fully process the batched purge: javax.transaction.RollbackException: ARJUNA016102: The transaction is not active! Uid is 0:ffff9b374821:4479387a:54917281:13f67b
	at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1179) [jbossjts-jacorb-4.17.17.Final.jar:4.17.17.Final]
	at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:126) [jbossjts-jacorb-4.17.17.Final.jar:4.17.17.Final]
	at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:75)
	at org.jboss.tm.usertx.client.ServerVMClientUserTransaction.commit(ServerVMClientUserTransaction.java:173)
	at org.rhq.enterprise.server.purge.PurgeTemplate.deleteRows(PurgeTemplate.java:207) [rhq-server.jar:4.13.0]
	at org.rhq.enterprise.server.purge.PurgeTemplate.execute(PurgeTemplate.java:111) [rhq-server.jar:4.13.0]
	at org.rhq.enterprise.server.purge.ResourceConfigurationUpdatePurge.execute(ResourceConfigurationUpdatePurge.java:37) [rhq-server.jar:4.13.0]
	at org.rhq.enterprise.server.purge.PurgeManagerBean.purgeResourceConfigHistory(PurgeManagerBean.java:177) [rhq-server.jar:4.13.0]
	at sun.reflect.GeneratedMethodAccessor389.invoke(Unknown Source) [:1.7.0_55]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_55]
	at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_55]

Comment 2 Stian Lund 2014-12-18 11:53:15 UTC

I am not an Oracle admin and I don't really dare run massive 'delete from' SQLs in production without being sure what I'm doing.

I really wish there was a simple CLI 'oneliner' to just purge the history and start fresh.

I am attaching a screenshot from DBVis which shows the row count of the RHQ_CONFIG tables. Also attaching latest logs from RHQ server.

Comment 3 Stian Lund 2014-12-18 11:53:49 UTC

Created attachment 970505 [details]
RHQ server logs as of 18/12

Comment 4 Stian Lund 2014-12-18 11:54:15 UTC

Created attachment 970506 [details]
RHQ table sizes.

Comment 5 Stian Lund 2014-12-18 16:18:38 UTC

I have set DEBUG on org.rhq.enterprise.server.purge.PurgeTemplate.

Not that it gives us much more...

14:30:07,452 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Resource configuration history purge starting at Thu Dec 18 14:30:07 CET 2014
14:30:07,452 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Purging resource configuration history data older than Tue Dec 18 14:30:07 CET 2012
14:31:38,288 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) Loaded 101274 key(s) of ResourceConfigurationUpdate
14:31:38,307 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) Deleting 30000 row(s) of ResourceConfigurationUpdate

...

14:52:25,201 WARN  [com.arjuna.ats.arjuna] (RHQScheduler_Worker-2) ARJUNA012077: Abort called on already aborted atomic action 0:ffff9b374821:-70503208:5492d1ed:4d75
14:52:25,203 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012113: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 0,5,main] missed interrupt when cancelling TX 0:ffff9b374821:-70503208:5492d1ed:4d75 -- exiting as zombie (zombie count decremented to 0)
14:52:25,204 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) ResourceConfigurationUpdate: could not fully process the batched purge: javax.transaction.RollbackException: ARJUNA016102: The transaction is not active! Uid is 0:ffff9b374821:-70503208:5492d1ed:4d75
	at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1179) [jbossjts-jacorb-4.17.17.Final.jar:4.17.17.Final]
	at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:126) [jbossjts-jacorb-4.17.17.Final.jar:4.17.17.Final]
	at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:75)
	at org.jboss.tm.usertx.client.ServerVMClientUserTransaction.commit(ServerVMClientUserTransaction.java:173)

...
14:52:25,361 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) Loaded 0 key(s) of ResourceConfigurationUpdateFromGroup
14:52:25,379 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-2) Loaded 0 key(s) of GroupResourceConfigurationUpdate
14:52:25,379 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-2) Resource configuration history purged [0] - completed in [1337927]ms

Maybe an idea to set DEBUG on com.arjuna.ats ?

Attaching full server.log

Comment 6 Stian Lund 2014-12-18 16:19:17 UTC

Created attachment 970639 [details]
Server logs with DEBUG on.

Comment 7 Thomas Segismont 2015-01-02 15:53:58 UTC

Sorry for the late reply. It truly tells me more: the problem is not in the process of loading the keys, but in the process of deletion (as indicated by the "Loaded 101274 key(s) of ResourceConfigurationUpdate" message).

Can you start your RHQ servers with the following system property (must be set on all servers in an HA deployment):
"-Dorg.rhq.enterprise.server.purge.PurgeTemplate.BATCH_SIZE=1000"

The default is 30000. I think 30000 is fine for simple deletions but deleting lines in RHQ_CONFIG implies the deletion of various lines in RHQ_CONFIG_PROPERTY.

Comment 8 Stian Lund 2015-01-05 10:05:13 UTC

Thanks for the update Thomas, I have set the property:

RHQ_SERVER_ADDITIONAL_JAVA_OPTS="-Dorg.rhq.enterprise.server.purge.PurgeTemplate.BATCH_SIZE=1000"

I assume this means a smaller number of rows scheduled for deletion on each run, so since it runs every hour I would guess it must take a couple of days to delete most of the 180k entries in config history.

I think the default should be smaller than 30k anyway - assuming it runs once an hour there really should be no need to remove 30000 entries from the RHQ_CONFIG table every time it runs.

Either that, or the transaction timeout should be increased to maybe to 60 minutes to give it a chance to complete in time.

Comment 9 Thomas Segismont 2015-01-05 10:32:11 UTC

(In reply to Stian Lund from comment #8)
> Thanks for the update Thomas, I have set the property:
> 
> RHQ_SERVER_ADDITIONAL_JAVA_OPTS="-Dorg.rhq.enterprise.server.purge.
> PurgeTemplate.BATCH_SIZE=1000"
> 
> I assume this means a smaller number of rows scheduled for deletion on each
> run, so since it runs every hour I would guess it must take a couple of days
> to delete most of the 180k entries in config history.
> 

No. When the data purge job runs, it loads *all* the keys and store them in a temp file. Then it deletes rows in batches which size is determined by the system property.

In other words, no matter the size of batches, when the purge job completes, the same amount of data should have been deleted.

> I think the default should be smaller than 30k anyway - assuming it runs
> once an hour there really should be no need to remove 30000 entries from the
> RHQ_CONFIG table every time it runs.

Experience in the community showed that 30k is pretty fine for other data purged by the job. The problem here is that deleting a row in RHQ_CONFIG also means deleting many rows in RHQ_CONFIG_PROP, as well as in RHQ_RESOURCE_CONFIG_UPDATE.



> 
> Either that, or the transaction timeout should be increased to maybe to 60
> minutes to give it a chance to complete in time.

I disagree. Look at your logs: it's already 20 minutes when the transaction is rollbacked. The purpose of batched deletion is precisely to make transactions short lived.

Comment 10 Stian Lund 2015-01-05 12:26:42 UTC

Ok thanks for the explanation :)

This seems to have helped a lot; it has now deleted a lot of those old entries.

11:32:37,641 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) Loaded 82010 key(s) of ResourceConfigurationUpdate
11:32:37,641 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) Deleting 1000 row(s) of ResourceConfigurationUpdate
(RHQScheduler_Worker-1) Deleting 1000 row(s) of ResourceConfigurationUpdate
...
11:52:29,520 DEBUG [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) Deleting 1000 row(s) of ResourceConfigurationUpdate

11:52:47,017 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) ResourceConfigurationUpdate: could not fully process the batched purge: java.sql.BatchUpdateException: ORA-00060: deadlock detected while waiting for resource

It did exit with an ERROR but I think this is something that will be ok for next run.

It seems to be unable to delete some of the old 2012 entries but I will let it run a couple more times before maybe deleting some manually.

> I disagree. Look at your logs: it's already 20 minutes when the transaction is rollbacked.

Well, if the transaction runs for 20 minutes and gets rolled back then stuff should be done in smaller batches or timeouts increased. Smaller batches is always better anyway I think.

Anyway this shouldn't be something that affects a lot of users and since RHQ now has the ability to purge this data it should be ok for most.

Interesting to note is also that the BZ I reported on UnsupportedOperationException seems to be gone now that the purge is working?
https://bugzilla.redhat.com/show_bug.cgi?id=1174841

Comment 11 Thomas Segismont 2015-01-05 13:08:24 UTC

Closed NOTABUG : batch size configuration needed

Comment 12 Thomas Segismont 2015-01-05 13:10:12 UTC

(In reply to Stian Lund from comment #10)
> Interesting to note is also that the BZ I reported on
> UnsupportedOperationException seems to be gone now that the purge is working?
> https://bugzilla.redhat.com/show_bug.cgi?id=1174841

I doubt purge has anything to do with the problem described in Bug 1174841 (there's no connection with inventory report process)

Comment 13 Stian Lund 2015-01-14 14:50:51 UTC

Just updating a bit; I had ~68k Recent Operations in one Test-installation since I have been trying to repro a bug with RHQ Agent restarts.

The same timeout occurred there when trying to delete, so I still suggest the batch should be set smaller by default in RHQ/JON. Otherwise users have to set this value for every RHQ server installation.

Note You need to log in before you can comment on or make changes to this bug.