Description of problem: The JBoss ON server becomes unstable and will stop responding after executing a data purge job that should have purged over 10 million event entries from the RHQ_EVENT table. Version-Release number of selected component (if applicable): 3.3.0 How reproducible: Always Steps to Reproduce: 1. Install, configure and start JBoss ON 3.3.0 system. 2. Insert 10 million (10,000,000) events into the rhq_event table. This can be done using the following BASH commands. You will need to replace the values assigned to pgDbName, pgDbHost, pgDbPort and pgDbUser with appropriate values for your test environment. pgDbName=rhq pgDbHost=172.17.0.2 pgDbPort=5432 pgDbUser=rhqadmin startingEventId=99999 eventSourceId=99999 startingEventTimestamp=1435712400000 resourceId=10001 resourceTypeName='JBossAS7 Standalone Server' numOfEvents=10000000 counter=0 echo "DELETE FROM rhq_event WHERE event_source_id = ${eventSourceId}; DELETE FROM rhq_event_source WHERE id = ${eventSourceId}; INSERT INTO rhq_event_source (id,event_def_id,resource_id,location) VALUES (${eventSourceId},(SELECT id FROM rhq_event_def WHERE resource_type_id = (SELECT id FROM rhq_resource_type WHERE name = '${resourceTypeName}')),${resourceId},'/tmp/non/existent/event/source/no.log'); COPY rhq_event (id, event_source_id, \"timestamp\", severity, detail) FROM stdin;" > static-events.sql until [ $counter -ge $numOfEvents ]; do echo "$(($startingEventId + $counter))"$'\t'"$eventSourceId"$'\t'"$(($startingEventTimestamp + $counter))"$'\t'"INFO"$'\t'"[static event message] A static event detail" >> static-events.sql let counter+=1 done echo '\.' >> static-events.sql psql -d $pgDbName -f static-events.sql -h $pgDbHost -p $pgDbPort -U $pgDbUser 3. Force the JBoss ON server to run its data purge jobs using the URL http://<JON_SERVER>:7080/portal/admin/test/control.jsp?mode=dataPurgeJob After the URL has been invoked, the JBoss ON server log should include the following output: Event data purge log messages found in server.log 22:32:08,612 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-5) Event data purge starting at Wed Aug 12 22:32:08 UTC 2015 22:32:08,612 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-5) Purging event data older than Wed Jul 29 22:32:08 UTC 2015 Actual results: The event data purge job will fail due to OutOfMemoryError: Java heap space: 22:32:20,669 ERROR [org.jboss.as.ejb3.invocation] (RHQScheduler_Worker-5) JBAS014134: EJB Invocation failed on component PurgeManagerBean for method public abstract int org.rhq.enterprise.server.purge.PurgeManagerLocal.purgeEventData(long): javax.ejb.EJBException: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.jboss.as.ejb3.tx.BMTInterceptor.handleException(BMTInterceptor.java:80) [jboss-as-ejb3-7.4.0.Final-redhat-19.jar:7.4.0.Final-redhat-19] at org.jboss.as.ejb3.tx.EjbBMTInterceptor.checkStatelessDone(EjbBMTInterceptor.java:92) [jboss-as-ejb3-7.4.0.Final-redhat-19.jar:7.4.0.Final-redhat-19] ... at org.rhq.enterprise.server.purge.PurgeManagerLocal$$$view157.purgeEventData(Unknown Source) [rhq-server.jar:4.12.0.JON330GA] at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.purgeEventData(DataPurgeJob.java:210) [rhq-server.jar:4.12.0.JON330GA] at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.purgeEverything(DataPurgeJob.java:97) [rhq-server.jar:4.12.0.JON330GA] at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.executeJobCode(DataPurgeJob.java:84) [rhq-server.jar:4.12.0.JON330GA] at org.rhq.enterprise.server.scheduler.jobs.AbstractStatefulJob.execute(AbstractStatefulJob.java:48) [rhq-server.jar:4.12.0.JON330GA] at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5] Caused by: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space ... 37 more Caused by: java.lang.OutOfMemoryError: Java heap space at java.io.ObjectOutputStream$HandleTable.growEntries(ObjectOutputStream.java:2350) [rt.jar:1.7.0_65] at java.io.ObjectOutputStream$HandleTable.assign(ObjectOutputStream.java:2275) [rt.jar:1.7.0_65] at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) [rt.jar:1.7.0_65] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) [rt.jar:1.7.0_65] at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) [rt.jar:1.7.0_65] at org.rhq.enterprise.server.purge.PurgeTemplate.loadKeys(PurgeTemplate.java:152) [rhq-server.jar:4.12.0.JON330GA] at org.rhq.enterprise.server.purge.PurgeTemplate.execute(PurgeTemplate.java:93) [rhq-server.jar:4.12.0.JON330GA] at org.rhq.enterprise.server.purge.PurgeManagerBean.purgeEventData(PurgeManagerBean.java:79) [rhq-server.jar:4.12.0.JON330GA] ... Expected results: No error and data should have been purged. Additional info: After the OOME occurs, the server is unstable and usually a kill -9 must be performed to get it to shutdown. Also, the OOME is not always a Java heap space error. In some cases it is a GC overhead limit exceeded error. However, the offending thread and stack, as seen in a heap dump, is: RHQScheduler_Worker-2 at org.postgresql.core.PGStream.ReceiveTupleV3()[[B (PGStream.java:359) at org.postgresql.core.v3.QueryExecutorImpl.processResults(Lorg/postgresql/core/ResultHandler;I)V (QueryExecutorImpl.java:1854) at org.postgresql.core.v3.QueryExecutorImpl.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;Lorg/postgresql/core/ResultHandler;III)V (QueryExecutorImpl.java:255) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;I)V (AbstractJdbc2Statement.java:555) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(I)Z (AbstractJdbc2Statement.java:417) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery()Ljava/sql/ResultSet; (AbstractJdbc2Statement.java:302) ... at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.executeQuery()Ljava/sql/ResultSet; (WrappedPreparedStatement.java:462) at org.rhq.enterprise.server.purge.PurgeTemplate.loadKeys()Lorg/rhq/enterprise/server/purge/PurgeTemplate$KeysInfo; (PurgeTemplate.java:149) at org.rhq.enterprise.server.purge.PurgeTemplate.execute()I (PurgeTemplate.java:93) at org.rhq.enterprise.server.purge.PurgeManagerBean.purgeEventData(J)I (PurgeManagerBean.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Native Method) ... at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object; (ProxyInvocationHandler.java:73) at org.rhq.enterprise.server.purge.PurgeManagerLocal$$$view157.purgeEventData(J)I (Unknown Source) at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.purgeEventData(Lorg/rhq/enterprise/server/purge/PurgeManagerLocal;Ljava/util/Properties;)V (DataPurgeJob.java:210) at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.purgeEverything(Ljava/util/Properties;)V (DataPurgeJob.java:97) at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.executeJobCode(Lorg/quartz/JobExecutionContext;)V (DataPurgeJob.java:84) at org.rhq.enterprise.server.scheduler.jobs.AbstractStatefulJob.execute(Lorg/quartz/JobExecutionContext;)V (AbstractStatefulJob.java:48) at org.quartz.core.JobRunShell.run()V (JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run()V (SimpleThreadPool.java:525)
This happens because PurgeTemplate.loadKeys() calls writeObject(..) which stores a reference in the heap. After a while, the heap is consumed. This can be fixed by calling ObjectOutputStream.reset() method, but there's a performance penalty (a large one). A solution could be to run the reset() on every X-invocations, we'll just have to find a suitable number here that still gives acceptable performance.
That's what I figured. Would it be possible and safe to have the reset call performed based on object size and available heap? For example, when oos reaches 30% of available heap?
I wouldn't go that far as the purge job could take a long time and meanwhile we might do something else that wants heap as well -> we'd be back to the OOM. And following java heap is a chasing game with the GC. I think I'll try and see if using the configured batch_size is suitable as by default it would be every 30 000 events. It's configurable value for those who need more performance and it's used in the actual execution phase also (30 000 to the DB in one transaction).
The control.jsp does not work if JON / RHQ is run with Java 8.. so hopefully this isn't something we recommend to customers.
Fixed in the master: commit 338b11ad458808557b1c39a9f89cd003790a194b Author: Michael Burman <miburman> Date: Wed Nov 4 14:20:12 2015 +0200 [BZ 1255196] reset objectStreamOutput every BATCH_SIZE object to prevent heap overload
Merged to release/jon3.3.x: commit 51414e9d23b97cf73ea9482ce7a0e6b0ff75a641 Author: Michael Burman <miburman> Date: Wed Nov 4 14:20:12 2015 +0200 [BZ 1255196] reset objectStreamOutput every BATCH_SIZE object to prevent heap overload (cherry picked from commit 338b11ad458808557b1c39a9f89cd003790a194b)
Moving to ON_QA as available to test with the following brew build: JON Cumulative patch build: https://brewweb.devel.redhat.com/buildinfo?buildID=469635 *Note: jon-server-patch-3.3.0.GA.zip maps to DR01 build of jon-server-3.3.0.GA-update-05.zip.
Verified on Version : 3.3.0.GA Update 05 Build Number : bb9d2ba:95ae59f Followed the steps and verified that the event data(10 million events) is purged without error. Below is in server log: 07:30:00,099 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-5) Purging event data older than Fri Nov 27 07:30:00 EST 2015 07:37:41,266 INFO [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-5) Event data purged [10000000] - completed in [461168]ms
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-0118.html