Bug 534314 (RHQ-1122) - throttle events
Summary: throttle events
Keywords:
Status: CLOSED NEXTRELEASE
Alias: RHQ-1122
Product: RHQ Project
Classification: Other
Component: Performance
Version: unspecified
Hardware: All
OS: All
high
medium
Target Milestone: ---
: ---
Assignee: John Mazzitelli
QA Contact: Jeff Weiss
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On: RHQ-1064
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-11-14 21:11 UTC by John Mazzitelli
Modified: 2014-11-09 22:48 UTC (History)
1 user (show)

Fixed In Version: 1.2
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description John Mazzitelli 2008-11-14 21:11:00 UTC
We need a way to throttle the amount of events we store in the database.

If, for example, a JbossAS server resource is capturing events from its log4j at the WARN level, and something goes horribly wrong in that managed resource that causes WARN messages to emit infinitly, we could blow up our server by asking it to insert an abnormal amount of events. (see linked issue as an example of this happening).

We should have a threshold (perhaps configurable on a per resource basis, or on a whole event-subsystem global basis) that says, "if we get more than X events in an event report, only insert X-N events" or maybe time based in the plugin container like "if we get X events in Y seconds, only report back X-N events".

Perhaps we do some kind of filtering - if we get similar events in X seconds, only send up 1 of them.

In short, we need a throttling mechanism to avoid inserting too many events in the database.


Comment 1 John Mazzitelli 2008-11-19 17:20:35 UTC
this is critical - something has to be done. After perf testing, with almost 8M rows of event data needing to be purged, it took a long time.  We need to make sure we do not put too many events in the database.

I suggest for one we limit the number of events in the database to 7 days worth. See RHQ-1064

Comment 2 Joseph Marques 2008-11-19 17:24:02 UTC
almost sounds like we need rhq_event_r01, rhq_event_r02, .... ; )

Comment 3 Charles Crouch 2008-11-19 17:54:31 UTC
Simply limiting the amount of data in the events table to 7days worth isnt going to help by itself, e.g. we generated over 7m rows in roughly 4days. In fact this is going to reduce the usability of this feature for people with a low volume of event, but want to keep more history.

As mentioned in the main description we need to limit the rate at which events are coming in. That's the main difference I see between this and metrics is that we have *no* control over the rate at which events are added. At least with metrics we have an idea based on the default metric collection intervals and the environment size. So one option would be to put an upper limit on "event density", e.g. make sure the event table won't contain more than 1000/10000/...events in any given hour time slice. Then we can tune our purging policy with this "event density" so that we know any one run of the purge job wont ever be asked to delete more than 100k/1m rows at once. 

Ensuring a maximum "event density" is going to be tough across multiple event sources, maybe we assume most people won't have any more than 20/100/.. event sources and just put a limit of 1/20th, 1/100th of the max insertion rate on any event source.

Comment 4 John Mazzitelli 2009-01-13 15:45:42 UTC
as a poor-mans optimization, we should make it configurable the amount of maximum events the agent sends to the server at a time. its currently hardcoded to 1000 (i think). if we allow that to be configurable, we can mitigate major problems (e.g. setting it to 50 or 100 will help limit the # of events)

Comment 5 Charles Crouch 2009-01-13 21:31:07 UTC
Further to the suggestion we discussed this morning, which mazz doc'd above...

EventReport limits the number of events sent back in a single report for each EventSource...
private static final int MAX_EVENTS_PER_SOURCE = 1000;

We could lower this to be say 60, given EventContext.MINIMUM_POLLING_INTERVAL = 60; // 1 minute, and then add another api to EventReport to allow for it to be overriden.
public void addEvent(@NotNull Event event, @NotNull EventSource eventSource, int maxEventsPerSource) {

maxEventsPerSource (which is really maxEventsPerSourcePerReport) could then be populated by passing through the value set by the plugin using a new api on EventContext

void registerEventPoller(EventPoller poller, int pollingInterval, String sourceLocation, int maxEventsPerSource);
The existing registerEventPoller methods would just use a default value for maxEventsPerSource.

EventPollerRunner created in registerEventPoller() would need another argument added to its constructor to take maxEventsPerSource and then pass it through to eventManager.publishEvents() which would finally call the new addEvent method on EventReport.

Comment 6 Charles Crouch 2009-01-13 22:00:17 UTC
We could also add a more global maximum to EventReport e.g. 
private static final int MAX_EVENTS_PER_REPORT = 180;

Whenever a call is made to EventReport.addEvent() we could increment our counter, ignoring the EventSource, and compare to  MAX_EVENTS_PER_REPORT. 
In terms of making this value configurable we can't do it in a plugin configuration because it applies across plugins. Maybe this is ripe for a special Mazz agent configuration setting :-)

Comment 7 Jay Shaughnessy 2009-01-15 16:24:21 UTC
It certainly makes sense to throttle the number of events coming in. With Severity based log events the chances of flooding are high.  I do think that we should attempt to filter repeating events with a simple filter  (the same event text (strip out the log message delimiter) within some period of time or some number of events.

As mentioned above, unless the reported number of events are manageable the feature is not useful.  Moving forward it may make sense to move some of the Alert-on-Event features from alerting to events. meaning, instead of having the interesting patterns defined in the alerts, move them to the event definition.  This would reduce the event collection as opposed to having them discarded at alert matching time.  Note this would require that Events then also be identifiable by either their pattern(s) or a name, for reference in the alert def.


Comment 8 John Mazzitelli 2009-01-16 16:40:47 UTC
fyi: doc page on how to add new agent preference:

http://support.rhq-project.org/display/RHQ/Design-Adding+Agent+Configuration+Preference

Comment 9 John Mazzitelli 2009-01-27 18:03:07 UTC
(12:59:18 PM) joseph: so mazz, does this rate limit for events work by discarding events once the report is full, or does it queue them up and spool, or what?
(1:00:01 PM) mazz: the idea is that we throw them away
(1:00:33 PM) mazz: the idea being that if we go over the limit, it means we can't support handling that many events
(1:00:40 PM) joseph: can we add a special evnet in there, "Threw away [discardedCount] events due to throttling"
(1:00:42 PM) mazz: so queueing won' t help
(1:00:55 PM) mazz: we might be able to do that
(1:00:59 PM) joseph: so from a monitoring perspective we know whether it's happening and how often and how many are being dropped
(1:01:15 PM) joseph: we can then alert on that with an eventt-based alert with filtering
(1:02:02 PM) mazz: yeah... I think that's what I'll do - I won't mess with the internals any more than I have to - I'll avoid touching the descriptor and doing things per-event-"type|definition|whatever"

Comment 10 John Mazzitelli 2009-01-27 18:33:59 UTC
svn rev 2769 added a quick change to the event limit per source (200) and I added the limit of 400 events total in the report.  since have some more work for this issue (the configurability and the "discarded event" that we want to put in the event report if it gets full).

Comment 11 John Mazzitelli 2009-01-27 21:15:47 UTC
svn rev 2779 makes the event sender initial delay and period configurable. The event report max # of events total and max # of events per source are also configurable.

Comment 12 John Mazzitelli 2009-01-27 23:18:20 UTC
two limits are now in effect - max total and max per source.  if one of the limits are breeched, a warn event is added to the report to indicate events are being dropped.

to test, set the two limits (they are configurable in agent-configuration.xml) to something really low (like 1) and import a resource that emits alot of events and see that the events are limited (you know this by seeing the WARN event).

Comment 13 John Mazzitelli 2009-01-27 23:33:57 UTC
grrr... I should have put the number of events that were dropped in the WARN message. reopening so I can do this

Comment 14 John Mazzitelli 2009-01-28 08:03:30 UTC
if a limit is breached, the warn message that is added to the report contains the number of events that were dropped (this is useful to find out how many messages/events were lost)

Comment 15 Charles Crouch 2009-01-29 05:16:00 UTC
Starting the RHQ server blows the event source limit..

WARN 	/home/test_jon/jon03/perf/serv 	Event Report Limit Reached: reached the maximum allowed events [200] for this event source - dropped [65] events

Comment 16 Jeff Weiss 2009-02-09 18:55:47 UTC
started server as charles suggested, got this in the agent that is monitoring the rhq server:

2009-02-09 13:52:53,247 WARN  [EventManager.sender-1] (core.domain.event.transfer.EventReport)- Event Report Limit Reached: reached the maximum allowed events [200] for this event source - dropped [58] events: source=[EventSource[id=0, eventDefinition.name=logEntry, resource.name=witte.usersys.redhat.com RHQ Server,  JBossAS 4.2.1.GA default (0.0.0.0:2099), location=/home/jonqa/jon/jon-server-2.2.0-SNAPSHOT/logs/rhq-server-log4j.log]]


Comment 17 Red Hat Bugzilla 2009-11-10 20:24:25 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1122
This bug is duplicated by RHQ-1335
This bug relates to RHQ-114



Note You need to log in before you can comment on or make changes to this bug.