1099112 – AlertAvailabilityDurationJob interrogates a wrong duration interval

Bug 1099112 - AlertAvailabilityDurationJob interrogates a wrong duration interval

Summary: AlertAvailabilityDurationJob interrogates a wrong duration interval

Keywords:
Status:	CLOSED DUPLICATE of bug 1099114
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Alerts
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	RHQ Project Maintainer
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-19 13:56 UTC by Costel C
Modified:	2014-05-19 15:21 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-05-19 15:21:00 UTC
Embargoed:

Attachments	(Terms of Use)

Description Costel C 2014-05-19 13:56:33 UTC

Description of problem:
The AlertAvailabilityDurationJob sometimes doesn't work properly and alerts are not triggered because it's not looking into the right duration interval.

The AlertAvailabilityDurationJob gets from database the list of the availabilities that matches the duration interval: 
[durationStart = (currentTime - duration_sec * 1000), durationEnd= currentTime]

The problem is see is that the availabilities are stored in database with the
measurement startTime that comes from the availability report (rhq-agent time), but the AlertAvailabilityDurationJob gets the list of availabilities using the
rhq-server current time.

How reproducible: Most of the times in some environments

Use-cases when the duration interval can be wrong:
 - the rhq-agent time is not synchronized with the rhq-server time
 - the availability reported by the rhq-agent reaches the server with a delay
 and therefore the Job is started with delay
 - other cases ?


Steps to Reproduce:
I don't know how to reproduce, it happens in our environment (RHQ Server with 3 nodes) 
1. Set Logger DEBUG mode
2. Define for a resource (in my case an HTTPService) the availability measurement schedule interval to 1 minute
2. Create an alert definition with Availability Duration Stays Not UP for 5 minutes
3. Make the resource availability UP and wait 1 minute 
4. Make the resource availability DOWN and wait 5 minutes

Actual results:
No alert was fired and in the log file appears a debug information("AlertAvailabilityDurationJob: No alert. Resource avail for ... has fluctuated...." along with a list of two availabilities {UP,DOWN}.

So the AlertAvailabilityDurationJob detects sometimes more availabilities, usually includes the previous "Availability UP"

Expected results:
An alert "Availability Duration Stays NOT UP for 5 min" should be fired.

Additional info

Another problem I see is that having more availabilities in the duration interval doesn't fire the "Availability Stays Not UP" alert
But as long we have anything but "Availability UP" {DOWN, UNKNOWN, DISABLED}
the "Availability Stays Not up" should happen.

==========================================================================
The fix that I propose would be to provide the availability startTime to the 
scheduler that checks the availability duration and this to be used as durationStart by the AlertAvailabilityDurationJob

It's needed to do the following modifications in the source code:

1. GlobalConditionCache.java :

 - provide the availability start time
 AvailabilityDurationCacheElement.checkCacheElements(durationCacheElements,
 resource, availabilityType, availability.getStartTime());

 2. AvailabilityDurationCacheElement.java

 - add a new parameter startTime and provide the value to the AvailabilityManager

 public static void checkCacheElements(List<AvailabilityDurationCacheElement>
cacheElements, Resource resource, AvailabilityType providedValue, long
startTime )

LookupUtil.getAvailabilityManager().scheduleAvailabilityDurationCheck(cacheElement,resource, startTime );

 3. AvailabilityManagerLocal.java, AvailabilityManagerBean.java:
 - add a new parameter durationStartTime and provide the value to the Job in
 the infoMap

 public void scheduleAvailabilityDurationCheck(AvailabilityDurationCacheElement
 cacheElement, Resource resource, long durationStartTime );

 infoMap.put(AlertAvailabilityDurationJob.DATAMAP_DURATION_START_TIME,
 String.valueOf(durationStartTime)); // in ms

 4. AlertAvailabilityDurationJob.java:

 - use the provided duration start time

 long durationStart = Long.valueOf(infoMap.get(DATAMAP_DURATION_START_TIME));
 // in ms
 long durationEnd = durationStart + duration * 1000;

 criteria.addFilterInterval(durationStart + 1, durationEnd - 1); // reduce by
 1 ms to fake exclusive an interval filter.

Comment 1 Costel C 2014-05-19 15:21:00 UTC


*** This bug has been marked as a duplicate of bug 1099114 ***

Note You need to log in before you can comment on or make changes to this bug.