1058187 – Alert definitions recovery mechanism is unreliable

Bug 1058187 - Alert definitions recovery mechanism is unreliable

Summary: Alert definitions recovery mechanism is unreliable

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Alerts
Sub Component:
Version:	4.9
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	RHQ 4.10
Assignee:	Libor Zoubek
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-27 07:55 UTC by Ilya Maleev
Modified:	2015-11-02 00:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-04-23 12:31:29 UTC
Embargoed:

Attachments	(Terms of Use)
simple log message generator (271 bytes, application/x-shellscript) 2014-02-06 12:22 UTC, Libor Zoubek	no flags	Details
View All

Description Ilya Maleev 2014-01-27 07:55:04 UTC

Steps to Reproduce:
1. Create 2 alert definitions:
   - definition1. Condition is Event detection with regexp1. Auto-disable is enabled.
   - definition2. Condition is Event detection with regexp2. The first definition is selected in Recovery settings.
2. Generate events triggering these alerts in turn with 1 sec interval in cycle (event1, 1 sec, event2, 1 sec, event1 etc.). I did it via log records generation with log parsing enabled.
3. Check events in Events tab.
4. Check alerts in Alerts tab.

Expected results:
2. event1 triggers an alert1 and disables the definition1. then event2 triggers an alert2 and re-enables the definition1. It is repeatet in cycle.
3. There are all generated events (2 * iterations count)
4. There are all generated alerts (2 * iterations count)

Actual results:
3. OK
2, 4. NOK. The part of alerts is absent. There are several instances of alert1 and several instances of alert2. Their amount is not always equal.

Comment 1 Ilya Maleev 2014-01-27 08:01:34 UTC

Also, the order of alerts sometimes wrong. I mean the alert from e.g. 4th iterarion may appear earlier than alert from 3nd one.

Comment 2 Heiko W. Rupp 2014-01-27 08:56:42 UTC

Hey Ilya,
thanks for the report - it may be that a lot(all?) of this is fixed in the upcoming 4.10 release. If you have a chance to check against 4.10, then we would appreciate your findings.

Comment 3 Libor Zoubek 2014-01-28 09:05:24 UTC

Hello Ilya, 

I'm able to reproduce your bug. I've been playing with interval of generating events. When I set wait interval to 30s, I got expected behavior. This is because agent is sending events once in a 30seconds by default. There is agent preference called rhq.agent.plugins.event-sender.period-secs. When I've set it to 2 and started to generate log events with 2s timeout, I got expected alert count as well.

I am not sure if this behavior is buggy. Considering situation when agent get's disconnected for some time and when it connects again it would start pushing lots of events that would generate lots of alerts, which no longer make sense

Comment 4 Libor Zoubek 2014-02-06 10:40:53 UTC

Ilya, issue has been partially fixed. In general .. recovery alert comes pretty fast after problem alert. We might ask a question if it was really a problem (since it was recovered within seconds). The fix will not take effect, unless you activate it.

In your rhq-server.properties add this line: 

rhq.server.alerted.event.process.delay=500

That means .. when server is processing events, once there is an event which triggers alert, processing will sleep for 500ms, so alert itself has time to get fired. Then when processing next event, (in our test case this event triggers recovery alert) we already know there is problem alert to be recovered.
As you can see, this slows down event processing (in case event triggers alert) and it was not suitable to hardcode the delay (it may vary, you may need to set it higher if your server is under load) or set it by default - thatswhy this system property was introduced.
  
In master https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=07c6074ce424f91931224677dd15190a5fb44176

Comment 5 Libor Zoubek 2014-02-06 12:22:48 UTC

Created attachment 860122 [details]
simple log message generator

Attaching simple log message generator script

Comment 6 Jay Shaughnessy 2014-02-06 22:30:52 UTC

Libor, the fix looks good.  We may want to consider enabling this by default. So, instead of a 0ms default perhaps go back to the 500ms default.  That is a short delay that likely won't negatively affect event processing while still possibly solving the problem, with user interaction.  In the unanticipated case that the 500ms delay is a problem, the user could always set the delay to 0 via the property.

I'll leave it up to you but i think it might be good if the default code actually might prevent the problem out-of-the-box.

Comment 7 Libor Zoubek 2014-02-07 17:45:18 UTC

You're right Jay, being disabled by default does not bring too much value. 

in master 58b032a7b0065b53ef8bd60ae3a01bd3af9d35e3

Comment 8 Heiko W. Rupp 2014-04-23 12:31:29 UTC

Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.

Note You need to log in before you can comment on or make changes to this bug.