Bug 1255196 - Event data purge job results in OutOfMemoryError when there are over 10 million events to be purged
Event data purge job results in OutOfMemoryError when there are over 10 milli...
Status: CLOSED ERRATA
Product: JBoss Operations Network
Classification: JBoss
Component: Database (Show other bugs)
JON 3.3.0
Unspecified Unspecified
medium Severity high
: ER01
: JON 3.3.5
Assigned To: Michael Burman
Sunil Kondkar
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-19 18:52 EDT by Larry O'Leary
Modified: 2016-02-03 10:02 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-03 10:02:53 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1583243 None None None Never

  None (edit)
Description Larry O'Leary 2015-08-19 18:52:25 EDT
Description of problem:
The JBoss ON server becomes unstable and will stop responding after executing a data purge job that should have purged over 10 million event entries from the RHQ_EVENT table.

Version-Release number of selected component (if applicable):
3.3.0

How reproducible:
Always

Steps to Reproduce:
1.  Install, configure and start JBoss ON 3.3.0 system.
2.  Insert 10 million (10,000,000) events into the rhq_event table.

    This can be done using the following BASH commands. You will need to replace the values assigned to pgDbName, pgDbHost, pgDbPort and pgDbUser with appropriate values for your test environment.

    pgDbName=rhq
    pgDbHost=172.17.0.2
    pgDbPort=5432
    pgDbUser=rhqadmin
 
    startingEventId=99999
    eventSourceId=99999
    startingEventTimestamp=1435712400000
    resourceId=10001
    resourceTypeName='JBossAS7 Standalone Server'
    numOfEvents=10000000
    counter=0
 
    echo "DELETE FROM rhq_event WHERE event_source_id = ${eventSourceId};
    DELETE FROM rhq_event_source WHERE id = ${eventSourceId};
    INSERT INTO rhq_event_source (id,event_def_id,resource_id,location) VALUES (${eventSourceId},(SELECT id FROM rhq_event_def WHERE resource_type_id = (SELECT id FROM rhq_resource_type WHERE name = '${resourceTypeName}')),${resourceId},'/tmp/non/existent/event/source/no.log');
     
    COPY rhq_event (id, event_source_id, \"timestamp\", severity, detail) FROM stdin;" > static-events.sql
     
    until [ $counter -ge $numOfEvents ]; do
        echo "$(($startingEventId + $counter))"$'\t'"$eventSourceId"$'\t'"$(($startingEventTimestamp + $counter))"$'\t'"INFO"$'\t'"[static event message] A static event detail" >> static-events.sql
        let counter+=1
    done
    echo '\.' >> static-events.sql
     
    psql -d $pgDbName -f static-events.sql -h $pgDbHost -p $pgDbPort -U $pgDbUser

3.  Force the JBoss ON server to run its data purge jobs using the URL http://<JON_SERVER>:7080/portal/admin/test/control.jsp?mode=dataPurgeJob

    After the URL has been invoked, the JBoss ON server log should include the following output:

    Event data purge log messages found in server.log
    22:32:08,612 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-5) Event data purge starting at Wed Aug 12 22:32:08 UTC 2015
    22:32:08,612 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-5) Purging event data older than Wed Jul 29 22:32:08 UTC 2015

Actual results:
The event data purge job will fail due to OutOfMemoryError: Java heap space:

22:32:20,669 ERROR [org.jboss.as.ejb3.invocation] (RHQScheduler_Worker-5) JBAS014134: EJB Invocation failed on component PurgeManagerBean for method public abstract int org.rhq.enterprise.server.purge.PurgeManagerLocal.purgeEventData(long): javax.ejb.EJBException: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
	at org.jboss.as.ejb3.tx.BMTInterceptor.handleException(BMTInterceptor.java:80) [jboss-as-ejb3-7.4.0.Final-redhat-19.jar:7.4.0.Final-redhat-19]
	at org.jboss.as.ejb3.tx.EjbBMTInterceptor.checkStatelessDone(EjbBMTInterceptor.java:92) [jboss-as-ejb3-7.4.0.Final-redhat-19.jar:7.4.0.Final-redhat-19]
    ...
	at org.rhq.enterprise.server.purge.PurgeManagerLocal$$$view157.purgeEventData(Unknown Source) [rhq-server.jar:4.12.0.JON330GA]
	at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.purgeEventData(DataPurgeJob.java:210) [rhq-server.jar:4.12.0.JON330GA]
	at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.purgeEverything(DataPurgeJob.java:97) [rhq-server.jar:4.12.0.JON330GA]
	at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.executeJobCode(DataPurgeJob.java:84) [rhq-server.jar:4.12.0.JON330GA]
	at org.rhq.enterprise.server.scheduler.jobs.AbstractStatefulJob.execute(AbstractStatefulJob.java:48) [rhq-server.jar:4.12.0.JON330GA]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5]
Caused by: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
	... 37 more
Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.io.ObjectOutputStream$HandleTable.growEntries(ObjectOutputStream.java:2350) [rt.jar:1.7.0_65]
	at java.io.ObjectOutputStream$HandleTable.assign(ObjectOutputStream.java:2275) [rt.jar:1.7.0_65]
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) [rt.jar:1.7.0_65]
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) [rt.jar:1.7.0_65]
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) [rt.jar:1.7.0_65]
	at org.rhq.enterprise.server.purge.PurgeTemplate.loadKeys(PurgeTemplate.java:152) [rhq-server.jar:4.12.0.JON330GA]
	at org.rhq.enterprise.server.purge.PurgeTemplate.execute(PurgeTemplate.java:93) [rhq-server.jar:4.12.0.JON330GA]
	at org.rhq.enterprise.server.purge.PurgeManagerBean.purgeEventData(PurgeManagerBean.java:79) [rhq-server.jar:4.12.0.JON330GA]
    ...


Expected results:
No error and data should have been purged.

Additional info:
After the OOME occurs, the server is unstable and usually a kill -9 must be performed to get it to shutdown. 

Also, the OOME is not always a Java heap space error. In some cases it is a GC overhead limit exceeded error. However, the offending thread and stack, as seen in a heap dump, is:

RHQScheduler_Worker-2
  at org.postgresql.core.PGStream.ReceiveTupleV3()[[B (PGStream.java:359)
  at org.postgresql.core.v3.QueryExecutorImpl.processResults(Lorg/postgresql/core/ResultHandler;I)V (QueryExecutorImpl.java:1854)
  at org.postgresql.core.v3.QueryExecutorImpl.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;Lorg/postgresql/core/ResultHandler;III)V (QueryExecutorImpl.java:255)
  at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;I)V (AbstractJdbc2Statement.java:555)
  at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(I)Z (AbstractJdbc2Statement.java:417)
  at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery()Ljava/sql/ResultSet; (AbstractJdbc2Statement.java:302)
  ...
  at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeQuery()Ljava/sql/ResultSet; (WrappedPreparedStatement.java:462)
  at org.rhq.enterprise.server.purge.PurgeTemplate.loadKeys()Lorg/rhq/enterprise/server/purge/PurgeTemplate$KeysInfo; (PurgeTemplate.java:149)
  at org.rhq.enterprise.server.purge.PurgeTemplate.execute()I (PurgeTemplate.java:93)
  at org.rhq.enterprise.server.purge.PurgeManagerBean.purgeEventData(J)I (PurgeManagerBean.java:79)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Native Method)
  ...
  at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object; (ProxyInvocationHandler.java:73)
  at org.rhq.enterprise.server.purge.PurgeManagerLocal$$$view157.purgeEventData(J)I (Unknown Source)
  at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.purgeEventData(Lorg/rhq/enterprise/server/purge/PurgeManagerLocal;Ljava/util/Properties;)V (DataPurgeJob.java:210)
  at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.purgeEverything(Ljava/util/Properties;)V (DataPurgeJob.java:97)
  at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.executeJobCode(Lorg/quartz/JobExecutionContext;)V (DataPurgeJob.java:84)
  at org.rhq.enterprise.server.scheduler.jobs.AbstractStatefulJob.execute(Lorg/quartz/JobExecutionContext;)V (AbstractStatefulJob.java:48)
  at org.quartz.core.JobRunShell.run()V (JobRunShell.java:202)
  at org.quartz.simpl.SimpleThreadPool$WorkerThread.run()V (SimpleThreadPool.java:525)
Comment 2 Michael Burman 2015-11-03 11:01:12 EST
This happens because PurgeTemplate.loadKeys() calls writeObject(..) which stores a reference in the heap. After a while, the heap is consumed.

This can be fixed by calling ObjectOutputStream.reset() method, but there's a performance penalty (a large one). A solution could be to run the reset() on every X-invocations, we'll just have to find a suitable number here that still gives acceptable performance.
Comment 3 Larry O'Leary 2015-11-03 11:43:47 EST
That's what I figured.

Would it be possible and safe to have the reset call performed based on object size and available heap? For example, when oos reaches 30% of available heap?
Comment 4 Michael Burman 2015-11-03 14:13:21 EST
I wouldn't go that far as the purge job could take a long time and meanwhile we might do something else that wants heap as well -> we'd be back to the OOM. And following java heap is a chasing game with the GC.

I think I'll try and see if using the configured batch_size is suitable as by default it would be every 30 000 events. It's configurable value for those who need more performance and it's used in the actual execution phase also (30 000 to the DB in one transaction).
Comment 5 Michael Burman 2015-11-04 06:56:06 EST
The control.jsp does not work if JON / RHQ is run with Java 8.. so hopefully this isn't something we recommend to customers.
Comment 6 Michael Burman 2015-11-04 07:22:35 EST
Fixed in the master:

commit 338b11ad458808557b1c39a9f89cd003790a194b
Author: Michael Burman <miburman@redhat.com>
Date:   Wed Nov 4 14:20:12 2015 +0200

    [BZ 1255196] reset objectStreamOutput every BATCH_SIZE object to prevent heap overload
Comment 7 Michael Burman 2015-11-18 10:48:48 EST
Merged to release/jon3.3.x:

commit 51414e9d23b97cf73ea9482ce7a0e6b0ff75a641
Author: Michael Burman <miburman@redhat.com>
Date:   Wed Nov 4 14:20:12 2015 +0200

    [BZ 1255196] reset objectStreamOutput every BATCH_SIZE object to prevent heap overload
    
    (cherry picked from commit 338b11ad458808557b1c39a9f89cd003790a194b)
Comment 8 Simeon Pinder 2015-12-09 01:29:13 EST
Moving to ON_QA as available to test with the following brew build:

JON Cumulative patch build: https://brewweb.devel.redhat.com/buildinfo?buildID=469635
  *Note: jon-server-patch-3.3.0.GA.zip maps to DR01 build of jon-server-3.3.0.GA-update-05.zip.
Comment 9 Sunil Kondkar 2015-12-11 09:08:35 EST
Verified on Version :	
3.3.0.GA Update 05
Build Number :	
bb9d2ba:95ae59f

Followed the steps and verified that the event data(10 million events) is purged without error.

Below is in server log:

07:30:00,099 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-5) Purging event data older than Fri Nov 27 07:30:00 EST 2015
07:37:41,266 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-5) Event data purged [10000000] - completed in [461168]ms
Comment 11 errata-xmlrpc 2016-02-03 10:02:53 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-0118.html

Note You need to log in before you can comment on or make changes to this bug.