Bug 534562 (RHQ-1347)

Summary: Investigate high volume of alerts on JON Server reliability
Product: [Other] RHQ Project Reporter: Charles Crouch <ccrouch>
Component: AlertsAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: unspecifiedCC: cwelton, hbrock
Target Milestone: ---Keywords: FutureFeature, Task
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-1347
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-18 15:40:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Charles Crouch 2009-01-12 21:38:00 UTC
<snip>
there could be problems with alerts if you have a lot of
MATCHING conditions.  The alerts cache is fast.  So, as long as most
pieces of data do NOT trigger conditions to become true we'll stay
standing no problem.  Issues arise when users create conditions that
always fire each time the metric is collected.  Realistically speaking,
conditions like that have no meaning.  Alerts SHOULD be used for
exceptional cases, not as a new form of logging.  ; )

That said, I have tested high volume alert on postgres 8.3 then the db
and server were colocated and with NO other appreciable load being
applied to the server machine at that time.  I saw that the system was
handling about ~1500 matches / second.  This will certainly decrease
when the server is remote to the db, and will certainly go down further
when the system is loaded, but what do we think is an acceptable /
reasonable figure to shoot for?
<snip>

From the above 1500matches/second sounds pretty fast. With 30 second metric collection intervals, a single condition per alert definition and a single alert definition per resource, this is equivalent to firing an alert every 30seconds on every resource in a 45k resource inventory. The other side of this equation is how well its possible to select and purge from the alert tables once this volume of data has been inserted.



Comment 1 Charles Crouch 2009-01-12 21:43:36 UTC
Its seems generally unlikely that users will setup their alert systems in such a way as to generate this many alerts from measurements alone, but it could be more feasible if you consider the rapid way in which Events can be inserted into the system. In this case it maybe helpful to require users to specify a regex filter, (they can use * if they want all alerts of a given priority) in the alert definition. Making this field mandatory, and adding a warning about the possibly large number of alerts which could be generated from unfiltered alerts, maybe help reduce the likelihood of users running into problems here.

Comment 2 Red Hat Bugzilla 2009-11-10 20:31:04 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1347


Comment 3 wes hayutin 2010-02-16 15:44:48 UTC
mass move off the qa triage list.  These are tasks for dev.

Comment 4 Corey Welton 2010-08-18 15:40:22 UTC
Closing per triage

*** This bug has been marked as a duplicate of bug 536155 ***