Description of problem: A large number of unused alert definitions (8000+) cannot be purged during Data Purge Job that runs every hour. Initially, a large number of alerts were triggered frequently, so the transaction that is used to remove alert definitions would timeout (this was a case even when the transaction timeout was changed to 5 hours). So, in order to remove unused alert definitions, the following steps were performed: 1. Disabled all currently enabled alerts; 2. The JBoss ON server started in the maintenance mode (in order to prevent JBoss ON agents to connect); 3. HornetQ messaging folders emptied ( $JON_SERVER_HOME/jbossas/standalone/data/messagingbindings...); 4. The leftover transaction data removed by emptying $JON_SERVER_HOME/jbossas/standalone/data/tx-object-store; 5. Changed the transaction timeout to 10h; 6. Started JBoss ON server; The server.log file showed that 3 hours after the Alert definition unused purge method was started, the server started to show timeout issues: management plug-in requests were timing out, many EJB timer timeouts and many Quartz timer timeouts. Finally, the server.log file shows: java.lang.OutOfMemoryError: GC overhead limit exceeded errors. Version-Release number of selected component (if applicable): JBoss ON 3.3.10 How reproducible: Sometimes Steps to Reproduce: The following are the steps that I performed in order to get transaction timeout when trying to remove unused alert definitions. However, I couldn't reproduce GC overhead limit exceeded errors as that's in the customer's log file. 1. I created a large number of alert definitions in my JBoss ON server using REST API: **************************************************** while true; do curl -s -L -H 'Content-Type: application/json' -X POST http://bkramer.usersys.redhat.com:7080/rest/alert/definitions?resourceId=18306 -d '{"name":"restPlatformMeas","enabled":true,"priority":"MEDIUM","conditionMode":"ANY","conditions":[{"category":"CHANGE","id":37824,"threshold":null,"option":null,"triggerId":null,"comparator":null,"measurementDefinition":10019,"name":"Actual Free Memory"}],"notifications":[{"id":10041,"senderName":"Direct Emails","config":{"emailAddress":"bkramer"},"extraConfig":{}}],"dampeningCategory":"NONE"}' -u rhqadmin:rhqadmin; sleep 1s; done **************************************************** 2. Above metric "Actual Free Memory" is enabled to be collected every 30 seconds (very aggressive); 3. Once the number of created alert definitions is greater then 8000 and the number of messages in the AlertConditionQueue is > 400k, continue to create new alert definitions but delete all alert definitions that were previously created (using JBoss ON CLI [1]) and also remove all previously triggered alerts (from the alert history page in the JBoss ON UI). 4. Execute the following SQL to make sure that the number of unused alert definitions > 8000: **************************************************** select count(alertdefin0_.ID) as col_0_0_ from RHQ_ALERT_DEFINITION alertdefin0_ where alertdefin0_.DELETED=true and (alertdefin0_.ID not in (select alertdefin2_.ID from RHQ_ALERT alert1_ inner join RHQ_ALERT_DEFINITION alertdefin2_ on alert1_.ALERT_DEFINITION_ID=alertdefin2_.ID)) **************************************************** 4. Wait for the DataPurgeJob and Alert definition unused purge method to start; Actual results: In my environment, removal of the alert definitions failed due to the transaction timeout. In the customer's environment, deletion of the alert definitions fails with java.lang.OutOfMemoryError: GC overhead limit exceeded errors although their JBoss ON server had 10Gb allocated. Expected results: Unused alert definitions are properly deleted and no exceptions are triggered. Additional info: [1] **************************************************** var criteria = new AlertDefinitionCriteria(); //Change the number of entries returned by search command to 1000 instead of default 200 - by using setPaging method criteria.setPaging(0,10000); criteria.addFilterName("restPlatformMeas"); var alertdefs = AlertDefinitionManager.findAlertDefinitionsByCriteria(criteria); if ( alertdefs != null ) { if ( alertdefs.size() > 0 ) { for ( i =0; i < alertdefs.size(); ++i) { var alert = alertdefs.get(i); println("alert: " + alert); AlertDefinitionManager.removeAlertDefinitions([alert.id]); } } } ****************************************************
One more thing - the OOME started to happen 3.5 hours after the alert definitions purge method was started (and that's for the transaction timeout set to 10 hours). After this, the memory allocated for the JBoss On server was increased to 20GB but the purge still failed due to the transaction timeout returning.
This is because that alert definition purge does not actually use the purge mechanism we have in use. That's why there's no protection against large deletes. It applies to all alert* processes (except purgeAlertData), orphaned drift removals and removeStaleAvailabilityResourceErrors + removeResourceErrorDuplicates. They need to be migrated (well, rebuild) to the purgeJobs.
As a workaround, indexes could be added to the rhq_alert_definition + rhq_alert as currently the delete query does two full table scans in Postgres. But that would only help with the transaction timeout. The OOM is a weird one as this does not really do that many allocations (and most of them should be shortlived to allow GC to clean them up easily) making me wonder if that's unrelated bug. In the master (I did not have Oracle to test this with - it requires different SQL syntax): commit f8e0daea1f9d7ebb7aedd7075ea361aa1ccffd50 (HEAD -> master) Author: Michael Burman <miburman> Date: Fri May 25 23:33:06 2018 +0300 [BZ 1579733] Purge of orphaned alert definitions should use the PurgeTemplate framework to reduce the possibility of transaction timeouts
Following problem found on JON 3.3.11.ER01: Purge job failed with: 05:30:00,056 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) AlertDefinition: could not fully process the batched purge: java.sql.BatchUpdateException: Batch entry 0 DELETE FROM rhq_alert_definition WHERE id = 10041 was aborted. following was visible in postgres log: ERROR: update or delete on table "rhq_alert_definition" violates foreign key constraint "rhq_alert_condition_alert_definition_id_fkey" on table "rhq_alert_condition" DETAIL: Key (id)=(10041) is still referenced from table "rhq_alert_condition". Attaching content of rhq_alert_condition and rhq_alert_definition tables and complete server.log
Created attachment 1484655 [details] complete server log
Created attachment 1484656 [details] content of rhq_alert_contition table
Created attachment 1484657 [details] content of rhq_alert_definition table
There's additional missing feature / bug in this case. The alert conditions are not unlinked, so they will keep the deleted alert definitions locked -> we can't purge them. In the master that fixes this behavior: commit dcff72ae8395a016b8f155ebd23f289cdc4b0063 (HEAD -> master) Author: Michael Burman <miburman> Date: Fri Sep 21 11:11:28 2018 +0300 [BZ 1579733] If AlertDefinition is set to deleted, remove links to all notifications and conditions also
This commit will ignore the broken rows to prevent FK violation: commit 5785ffcf35ebaa5ada7c3eaf453f2f5c8a0da90e (HEAD -> master) Author: Michael Burman <miburman> Date: Fri Sep 21 15:15:43 2018 +0300 [BZ 1579733] Ignore rows from alert_definition table that still have existing links in alert_condition We still need to remove those broken ones from older versions.. so one more commit coming (another feature).
Final part in the master: commit dbd695fb44217a671546a6bfb3056e5f257fe08c (HEAD -> master) Author: Michael Burman <miburman> Date: Mon Sep 24 23:25:31 2018 +0300 [BZ 159733] Unlink deleted definitions from the conditions before purging conditions and definitions
Moving to ON_QA. Available to test with http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/186/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip. * This maps to CR02 build of 3.3.11.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2930