Steps to Reproduce: 1. Create 2 alert definitions: - definition1. Condition is Event detection with regexp1. Auto-disable is enabled. - definition2. Condition is Event detection with regexp2. The first definition is selected in Recovery settings. 2. Generate events triggering these alerts in turn with 1 sec interval in cycle (event1, 1 sec, event2, 1 sec, event1 etc.). I did it via log records generation with log parsing enabled. 3. Check events in Events tab. 4. Check alerts in Alerts tab. Expected results: 2. event1 triggers an alert1 and disables the definition1. then event2 triggers an alert2 and re-enables the definition1. It is repeatet in cycle. 3. There are all generated events (2 * iterations count) 4. There are all generated alerts (2 * iterations count) Actual results: 3. OK 2, 4. NOK. The part of alerts is absent. There are several instances of alert1 and several instances of alert2. Their amount is not always equal.
Also, the order of alerts sometimes wrong. I mean the alert from e.g. 4th iterarion may appear earlier than alert from 3nd one.
Hey Ilya, thanks for the report - it may be that a lot(all?) of this is fixed in the upcoming 4.10 release. If you have a chance to check against 4.10, then we would appreciate your findings.
Hello Ilya, I'm able to reproduce your bug. I've been playing with interval of generating events. When I set wait interval to 30s, I got expected behavior. This is because agent is sending events once in a 30seconds by default. There is agent preference called rhq.agent.plugins.event-sender.period-secs. When I've set it to 2 and started to generate log events with 2s timeout, I got expected alert count as well. I am not sure if this behavior is buggy. Considering situation when agent get's disconnected for some time and when it connects again it would start pushing lots of events that would generate lots of alerts, which no longer make sense
Ilya, issue has been partially fixed. In general .. recovery alert comes pretty fast after problem alert. We might ask a question if it was really a problem (since it was recovered within seconds). The fix will not take effect, unless you activate it. In your rhq-server.properties add this line: rhq.server.alerted.event.process.delay=500 That means .. when server is processing events, once there is an event which triggers alert, processing will sleep for 500ms, so alert itself has time to get fired. Then when processing next event, (in our test case this event triggers recovery alert) we already know there is problem alert to be recovered. As you can see, this slows down event processing (in case event triggers alert) and it was not suitable to hardcode the delay (it may vary, you may need to set it higher if your server is under load) or set it by default - thatswhy this system property was introduced. In master https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=07c6074ce424f91931224677dd15190a5fb44176
Created attachment 860122 [details] simple log message generator Attaching simple log message generator script
Libor, the fix looks good. We may want to consider enabling this by default. So, instead of a 0ms default perhaps go back to the 500ms default. That is a short delay that likely won't negatively affect event processing while still possibly solving the problem, with user interaction. In the unanticipated case that the 500ms delay is a problem, the user could always set the delay to 0 via the property. I'll leave it up to you but i think it might be good if the default code actually might prevent the problem out-of-the-box.
You're right Jay, being disabled by default does not bring too much value. in master 58b032a7b0065b53ef8bd60ae3a01bd3af9d35e3
Bulk closing of 4.10 issues. If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.