Bug 1099112

Summary: AlertAvailabilityDurationJob interrogates a wrong duration interval
Product: [Other] RHQ Project Reporter: Costel C <mulderika>
Component: AlertsAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED DUPLICATE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: hrupp
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-19 15:21:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Costel C 2014-05-19 13:56:33 UTC
Description of problem:
The AlertAvailabilityDurationJob sometimes doesn't work properly and alerts are not triggered because it's not looking into the right duration interval.

The AlertAvailabilityDurationJob gets from database the list of the availabilities that matches the duration interval: 
[durationStart = (currentTime - duration_sec * 1000), durationEnd= currentTime]

The problem is see is that the availabilities are stored in database with the
measurement startTime that comes from the availability report (rhq-agent time), but the AlertAvailabilityDurationJob gets the list of availabilities using the
rhq-server current time.

How reproducible: Most of the times in some environments

Use-cases when the duration interval can be wrong:
 - the rhq-agent time is not synchronized with the rhq-server time
 - the availability reported by the rhq-agent reaches the server with a delay
 and therefore the Job is started with delay
 - other cases ?


Steps to Reproduce:
I don't know how to reproduce, it happens in our environment (RHQ Server with 3 nodes) 
1. Set Logger DEBUG mode
2. Define for a resource (in my case an HTTPService) the availability measurement schedule interval to 1 minute
2. Create an alert definition with Availability Duration Stays Not UP for 5 minutes
3. Make the resource availability UP and wait 1 minute 
4. Make the resource availability DOWN and wait 5 minutes

Actual results:
No alert was fired and in the log file appears a debug information("AlertAvailabilityDurationJob: No alert. Resource avail for ... has fluctuated...." along with a list of two availabilities {UP,DOWN}.

So the AlertAvailabilityDurationJob detects sometimes more availabilities, usually includes the previous "Availability UP"

Expected results:
An alert "Availability Duration Stays NOT UP for 5 min" should be fired.

Additional info

Another problem I see is that having more availabilities in the duration interval doesn't fire the "Availability Stays Not UP" alert
But as long we have anything but "Availability UP" {DOWN, UNKNOWN, DISABLED}
the "Availability Stays Not up" should happen.

==========================================================================
The fix that I propose would be to provide the availability startTime to the 
scheduler that checks the availability duration and this to be used as durationStart by the AlertAvailabilityDurationJob

It's needed to do the following modifications in the source code:

1. GlobalConditionCache.java :

 - provide the availability start time
 AvailabilityDurationCacheElement.checkCacheElements(durationCacheElements,
 resource, availabilityType, availability.getStartTime());

 2. AvailabilityDurationCacheElement.java

 - add a new parameter startTime and provide the value to the AvailabilityManager

 public static void checkCacheElements(List<AvailabilityDurationCacheElement>
cacheElements, Resource resource, AvailabilityType providedValue, long
startTime )

LookupUtil.getAvailabilityManager().scheduleAvailabilityDurationCheck(cacheElement,resource, startTime );

 3. AvailabilityManagerLocal.java, AvailabilityManagerBean.java:
 - add a new parameter durationStartTime and provide the value to the Job in
 the infoMap

 public void scheduleAvailabilityDurationCheck(AvailabilityDurationCacheElement
 cacheElement, Resource resource, long durationStartTime );

 infoMap.put(AlertAvailabilityDurationJob.DATAMAP_DURATION_START_TIME,
 String.valueOf(durationStartTime)); // in ms

 4. AlertAvailabilityDurationJob.java:

 - use the provided duration start time

 long durationStart = Long.valueOf(infoMap.get(DATAMAP_DURATION_START_TIME));
 // in ms
 long durationEnd = durationStart + duration * 1000;

 criteria.addFilterInterval(durationStart + 1, durationEnd - 1); // reduce by
 1 ms to fake exclusive an interval filter.

Comment 1 Costel C 2014-05-19 15:21:00 UTC

*** This bug has been marked as a duplicate of bug 1099114 ***