Bug 1579733

Summary: JBoss ON server unable to purge unused alert definitions due to the transaction timeout
Product: [JBoss] JBoss Operations Network Reporter: bkramer <bkramer>
Component: Database, Core ServerAssignee: Michael Burman <miburman>
Status: CLOSED ERRATA QA Contact: Filip Brychta <fbrychta>
Severity: urgent Docs Contact:
Priority: high    
Version: JON 3.3.10CC: fbrychta, loleary, miburman, spinder
Target Milestone: CR02Keywords: Triaged
Target Release: JON 3.3.11   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-16 17:07:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
complete server log
none
content of rhq_alert_contition table
none
content of rhq_alert_definition table none

Description bkramer 2018-05-18 08:27:33 UTC
Description of problem:
A large number of unused alert definitions (8000+) cannot be purged during Data Purge Job that runs every hour. Initially, a large number of alerts were triggered frequently, so the transaction that is used to remove alert definitions would timeout (this was a case even when the transaction timeout was changed to 5 hours). So, in order to remove unused alert definitions, the following steps were performed:

1. Disabled all currently enabled alerts;
2. The JBoss ON server started in the maintenance mode (in order to prevent JBoss ON agents to connect);
3. HornetQ messaging folders emptied ( $JON_SERVER_HOME/jbossas/standalone/data/messagingbindings...);
4. The leftover transaction data removed by emptying $JON_SERVER_HOME/jbossas/standalone/data/tx-object-store; 
5. Changed the transaction timeout to 10h;
6. Started JBoss ON server;

The server.log file showed that 3 hours after the Alert definition unused purge method was started, the server started to show timeout issues: management plug-in requests were timing out, many EJB timer timeouts and many Quartz timer timeouts. Finally, the server.log file shows:  java.lang.OutOfMemoryError: GC overhead limit exceeded errors. 


Version-Release number of selected component (if applicable):
JBoss ON 3.3.10

How reproducible:
Sometimes

Steps to Reproduce:

The following are the steps that I performed in order to get transaction timeout when trying to remove unused alert definitions. However, I couldn't reproduce GC overhead limit exceeded errors as that's in the customer's log file.

1. I created a large number of alert definitions in my JBoss ON server using REST API:

****************************************************
while true; do  curl -s -L -H 'Content-Type: application/json' -X POST http://bkramer.usersys.redhat.com:7080/rest/alert/definitions?resourceId=18306 -d '{"name":"restPlatformMeas","enabled":true,"priority":"MEDIUM","conditionMode":"ANY","conditions":[{"category":"CHANGE","id":37824,"threshold":null,"option":null,"triggerId":null,"comparator":null,"measurementDefinition":10019,"name":"Actual Free Memory"}],"notifications":[{"id":10041,"senderName":"Direct Emails","config":{"emailAddress":"bkramer"},"extraConfig":{}}],"dampeningCategory":"NONE"}' -u rhqadmin:rhqadmin; sleep 1s; done
****************************************************

2. Above metric "Actual Free Memory" is enabled to be collected every 30 seconds (very aggressive);

3. Once the number of created alert definitions is greater then 8000 and the number of messages in the AlertConditionQueue is > 400k, continue to create new alert definitions but delete all alert definitions that were previously created (using JBoss ON CLI [1]) and also remove all previously triggered alerts (from the alert history page in the JBoss ON UI). 

4. Execute the following SQL to make sure that the number of unused alert definitions > 8000:

****************************************************
select count(alertdefin0_.ID) as col_0_0_ from RHQ_ALERT_DEFINITION alertdefin0_ where alertdefin0_.DELETED=true and (alertdefin0_.ID not in  (select alertdefin2_.ID from RHQ_ALERT alert1_ inner join RHQ_ALERT_DEFINITION alertdefin2_ on alert1_.ALERT_DEFINITION_ID=alertdefin2_.ID))
****************************************************

4. Wait for the DataPurgeJob and Alert definition unused purge method to start;


Actual results:
In my environment, removal of the alert definitions failed due to the transaction timeout. In the customer's environment, deletion of the alert definitions fails with java.lang.OutOfMemoryError: GC overhead limit exceeded errors although their JBoss ON server had 10Gb allocated. 


Expected results:
Unused alert definitions are properly deleted and no exceptions are triggered. 


Additional info:

[1] 
****************************************************
var criteria = new AlertDefinitionCriteria();

//Change the number of entries returned by search command to 1000 instead of default 200 - by using setPaging method
criteria.setPaging(0,10000);
criteria.addFilterName("restPlatformMeas");

var alertdefs = AlertDefinitionManager.findAlertDefinitionsByCriteria(criteria);

if ( alertdefs != null ) {
  if ( alertdefs.size() > 0 ) {
     for ( i =0; i < alertdefs.size(); ++i) {
         var alert = alertdefs.get(i);
         println("alert: " + alert);
         AlertDefinitionManager.removeAlertDefinitions([alert.id]);
     }
  }
}
****************************************************

Comment 1 bkramer 2018-05-18 15:22:57 UTC
One more thing - the OOME started to happen 3.5 hours after the alert definitions purge method was started (and that's for the transaction timeout set to 10 hours). After this, the memory allocated for the JBoss On server was increased to 20GB but the purge still failed due to the transaction timeout returning.

Comment 2 Michael Burman 2018-05-24 18:28:55 UTC
This is because that alert definition purge does not actually use the purge mechanism we have in use. That's why there's no protection against large deletes.

It applies to all alert* processes (except purgeAlertData), orphaned drift removals and removeStaleAvailabilityResourceErrors + removeResourceErrorDuplicates.

They need to be migrated (well, rebuild) to the purgeJobs.

Comment 3 Michael Burman 2018-05-25 20:42:14 UTC
As a workaround, indexes could be added to the rhq_alert_definition + rhq_alert  as currently the delete query does two full table scans in Postgres. But that would only help with the transaction timeout. The OOM is a weird one as this does not really do that many allocations (and most of them should be shortlived to allow GC to clean them up easily) making me wonder if that's unrelated bug.

In the master (I did not have Oracle to test this with - it requires different SQL syntax):

commit f8e0daea1f9d7ebb7aedd7075ea361aa1ccffd50 (HEAD -> master)
Author: Michael Burman <miburman>
Date:   Fri May 25 23:33:06 2018 +0300

    [BZ 1579733] Purge of orphaned alert definitions should use the PurgeTemplate framework to reduce the possibility of transaction timeouts

Comment 8 Filip Brychta 2018-09-19 11:02:35 UTC
Following problem found on JON 3.3.11.ER01:
Purge job failed with:
05:30:00,056 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) AlertDefinition: could not fully process the batched purge: java.sql.BatchUpdateException: Batch entry 0 DELETE FROM rhq_alert_definition WHERE id = 10041 was aborted.

following was visible in postgres log:

ERROR:  update or delete on table "rhq_alert_definition" violates foreign key constraint "rhq_alert_condition_alert_definition_id_fkey" on table "rhq_alert_condition"
DETAIL:  Key (id)=(10041) is still referenced from table "rhq_alert_condition".

Attaching content of rhq_alert_condition and rhq_alert_definition tables and complete server.log

Comment 9 Filip Brychta 2018-09-19 11:03:14 UTC
Created attachment 1484655 [details]
complete server log

Comment 10 Filip Brychta 2018-09-19 11:03:42 UTC
Created attachment 1484656 [details]
content of rhq_alert_contition table

Comment 11 Filip Brychta 2018-09-19 11:04:14 UTC
Created attachment 1484657 [details]
content of rhq_alert_definition table

Comment 13 Michael Burman 2018-09-21 08:12:28 UTC
There's additional missing feature / bug in this case. The alert conditions are not unlinked, so they will keep the deleted alert definitions locked -> we can't purge them. 

In the master that fixes this behavior:

commit dcff72ae8395a016b8f155ebd23f289cdc4b0063 (HEAD -> master)
Author: Michael Burman <miburman>
Date:   Fri Sep 21 11:11:28 2018 +0300

    [BZ 1579733] If AlertDefinition is set to deleted, remove links to all notifications and conditions also

Comment 14 Michael Burman 2018-09-21 12:17:01 UTC
This commit will ignore the broken rows to prevent FK violation:

commit 5785ffcf35ebaa5ada7c3eaf453f2f5c8a0da90e (HEAD -> master)
Author: Michael Burman <miburman>
Date:   Fri Sep 21 15:15:43 2018 +0300

    [BZ 1579733] Ignore rows from alert_definition table that still have existing links in alert_condition

We still need to remove those broken ones from older versions.. so one more commit coming (another feature).

Comment 15 Michael Burman 2018-09-24 20:26:43 UTC
Final part in the master:

commit dbd695fb44217a671546a6bfb3056e5f257fe08c (HEAD -> master)
Author: Michael Burman <miburman>
Date:   Mon Sep 24 23:25:31 2018 +0300

    [BZ 159733] Unlink deleted definitions from the conditions before purging conditions and definitions

Comment 29 errata-xmlrpc 2018-10-16 17:07:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2930