Description of problem: The AlertAvailabilityDurationJob sometimes doesn't work properly and alerts are not triggered because it's not looking into the right duration interval. The AlertAvailabilityDurationJob gets from database the list of the availabilities that matches the duration interval: [durationStart = (currentTime - duration_sec * 1000), durationEnd= currentTime] The problem is see is that the availabilities are stored in database with the measurement startTime that comes from the availability report (rhq-agent time), but the AlertAvailabilityDurationJob gets the list of availabilities using the rhq-server current time. How reproducible: Most of the times in some environments Use-cases when the duration interval can be wrong: - the rhq-agent time is not synchronized with the rhq-server time - the availability reported by the rhq-agent reaches the server with a delay and therefore the Job is started with delay - other cases ? Steps to Reproduce: I don't know how to reproduce, it happens in our environment (RHQ Server with 3 nodes) 1. Set Logger DEBUG mode 2. Define for a resource (in my case an HTTPService) the availability measurement schedule interval to 1 minute 2. Create an alert definition with Availability Duration Stays Not UP for 5 minutes 3. Make the resource availability UP and wait 1 minute 4. Make the resource availability DOWN and wait 5 minutes Actual results: No alert was fired and in the log file appears a debug information("AlertAvailabilityDurationJob: No alert. Resource avail for ... has fluctuated...." along with a list of two availabilities {UP,DOWN}. So the AlertAvailabilityDurationJob detects sometimes more availabilities, usually includes the previous "Availability UP" Expected results: An alert "Availability Duration Stays NOT UP for 5 min" should be fired. Additional info Another problem I see is that having more availabilities in the duration interval doesn't fire the "Availability Stays Not UP" alert But as long we have anything but "Availability UP" {DOWN, UNKNOWN, DISABLED} the "Availability Stays Not up" should happen. ========================================================================== The fix that I propose would be to provide the availability startTime to the scheduler that checks the availability duration and this to be used as durationStart by the AlertAvailabilityDurationJob It's needed to do the following modifications in the source code: 1. GlobalConditionCache.java : - provide the availability start time AvailabilityDurationCacheElement.checkCacheElements(durationCacheElements, resource, availabilityType, availability.getStartTime()); 2. AvailabilityDurationCacheElement.java - add a new parameter startTime and provide the value to the AvailabilityManager public static void checkCacheElements(List<AvailabilityDurationCacheElement> cacheElements, Resource resource, AvailabilityType providedValue, long startTime ) LookupUtil.getAvailabilityManager().scheduleAvailabilityDurationCheck(cacheElement,resource, startTime ); 3. AvailabilityManagerLocal.java, AvailabilityManagerBean.java: - add a new parameter durationStartTime and provide the value to the Job in the infoMap public void scheduleAvailabilityDurationCheck(AvailabilityDurationCacheElement cacheElement, Resource resource, long durationStartTime ); infoMap.put(AlertAvailabilityDurationJob.DATAMAP_DURATION_START_TIME, String.valueOf(durationStartTime)); // in ms 4. AlertAvailabilityDurationJob.java: - use the provided duration start time long durationStart = Long.valueOf(infoMap.get(DATAMAP_DURATION_START_TIME)); // in ms long durationEnd = durationStart + duration * 1000; criteria.addFilterInterval(durationStart + 1, durationEnd - 1); // reduce by 1 ms to fake exclusive an interval filter.
*** This bug has been marked as a duplicate of bug 1099114 ***