Bug 1579733
Summary: | JBoss ON server unable to purge unused alert definitions due to the transaction timeout | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | bkramer <bkramer> | ||||||||
Component: | Database, Core Server | Assignee: | Michael Burman <miburman> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Filip Brychta <fbrychta> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | JON 3.3.10 | CC: | fbrychta, loleary, miburman, spinder | ||||||||
Target Milestone: | CR02 | Keywords: | Triaged | ||||||||
Target Release: | JON 3.3.11 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-10-16 17:07:04 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
bkramer
2018-05-18 08:27:33 UTC
One more thing - the OOME started to happen 3.5 hours after the alert definitions purge method was started (and that's for the transaction timeout set to 10 hours). After this, the memory allocated for the JBoss On server was increased to 20GB but the purge still failed due to the transaction timeout returning. This is because that alert definition purge does not actually use the purge mechanism we have in use. That's why there's no protection against large deletes. It applies to all alert* processes (except purgeAlertData), orphaned drift removals and removeStaleAvailabilityResourceErrors + removeResourceErrorDuplicates. They need to be migrated (well, rebuild) to the purgeJobs. As a workaround, indexes could be added to the rhq_alert_definition + rhq_alert as currently the delete query does two full table scans in Postgres. But that would only help with the transaction timeout. The OOM is a weird one as this does not really do that many allocations (and most of them should be shortlived to allow GC to clean them up easily) making me wonder if that's unrelated bug. In the master (I did not have Oracle to test this with - it requires different SQL syntax): commit f8e0daea1f9d7ebb7aedd7075ea361aa1ccffd50 (HEAD -> master) Author: Michael Burman <miburman> Date: Fri May 25 23:33:06 2018 +0300 [BZ 1579733] Purge of orphaned alert definitions should use the PurgeTemplate framework to reduce the possibility of transaction timeouts Following problem found on JON 3.3.11.ER01: Purge job failed with: 05:30:00,056 ERROR [org.rhq.enterprise.server.purge.PurgeTemplate] (RHQScheduler_Worker-1) AlertDefinition: could not fully process the batched purge: java.sql.BatchUpdateException: Batch entry 0 DELETE FROM rhq_alert_definition WHERE id = 10041 was aborted. following was visible in postgres log: ERROR: update or delete on table "rhq_alert_definition" violates foreign key constraint "rhq_alert_condition_alert_definition_id_fkey" on table "rhq_alert_condition" DETAIL: Key (id)=(10041) is still referenced from table "rhq_alert_condition". Attaching content of rhq_alert_condition and rhq_alert_definition tables and complete server.log Created attachment 1484655 [details]
complete server log
Created attachment 1484656 [details]
content of rhq_alert_contition table
Created attachment 1484657 [details]
content of rhq_alert_definition table
There's additional missing feature / bug in this case. The alert conditions are not unlinked, so they will keep the deleted alert definitions locked -> we can't purge them. In the master that fixes this behavior: commit dcff72ae8395a016b8f155ebd23f289cdc4b0063 (HEAD -> master) Author: Michael Burman <miburman> Date: Fri Sep 21 11:11:28 2018 +0300 [BZ 1579733] If AlertDefinition is set to deleted, remove links to all notifications and conditions also This commit will ignore the broken rows to prevent FK violation: commit 5785ffcf35ebaa5ada7c3eaf453f2f5c8a0da90e (HEAD -> master) Author: Michael Burman <miburman> Date: Fri Sep 21 15:15:43 2018 +0300 [BZ 1579733] Ignore rows from alert_definition table that still have existing links in alert_condition We still need to remove those broken ones from older versions.. so one more commit coming (another feature). Final part in the master: commit dbd695fb44217a671546a6bfb3056e5f257fe08c (HEAD -> master) Author: Michael Burman <miburman> Date: Mon Sep 24 23:25:31 2018 +0300 [BZ 159733] Unlink deleted definitions from the conditions before purging conditions and definitions Moving to ON_QA. Available to test with http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/186/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip. * This maps to CR02 build of 3.3.11. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2930 |