1099114 – AlertAvailabilityDurationJob interrogates a wrong duration interval

Bug 1099114 - AlertAvailabilityDurationJob interrogates a wrong duration interval

Summary: AlertAvailabilityDurationJob interrogates a wrong duration interval

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Alerts
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	RHQ 4.12
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1099112 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-19 13:57 UTC by Costel C
Modified:	2014-12-15 11:36 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-12-15 11:36:41 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1028473	0	high	CLOSED	Availability duration alerts ignore availability changes during the specified interval	2021-02-22 00:41:40 UTC

Description Costel C 2014-05-19 13:57:53 UTC

Description of problem:
The AlertAvailabilityDurationJob sometimes doesn't work properly and alerts are not triggered because it's not looking into the right duration interval.

The AlertAvailabilityDurationJob gets from database the list of the availabilities that matches the duration interval: 
[durationStart = (currentTime - duration_sec * 1000), durationEnd= currentTime]

The problem is see is that the availabilities are stored in database with the
measurement startTime that comes from the availability report (rhq-agent time), but the AlertAvailabilityDurationJob gets the list of availabilities using the
rhq-server current time.

How reproducible: Most of the times in some environments

Use-cases when the duration interval can be wrong:
 - the rhq-agent time is not synchronized with the rhq-server time
 - the availability reported by the rhq-agent reaches the server with a delay
 and therefore the Job is started with delay
 - other cases ?


Steps to Reproduce:
I don't know how to reproduce, it happens in our environment (RHQ Server with 3 nodes) 
1. Set Logger DEBUG mode
2. Define for a resource (in my case an HTTPService) the availability measurement schedule interval to 1 minute
2. Create an alert definition with Availability Duration Stays Not UP for 5 minutes
3. Make the resource availability UP and wait 1 minute 
4. Make the resource availability DOWN and wait 5 minutes

Actual results:
No alert was fired and in the log file appears a debug information("AlertAvailabilityDurationJob: No alert. Resource avail for ... has fluctuated...." along with a list of two availabilities {UP,DOWN}.

So the AlertAvailabilityDurationJob detects sometimes more availabilities, usually includes the previous "Availability UP"

Expected results:
An alert "Availability Duration Stays NOT UP for 5 min" should be fired.

Additional info

Another problem I see is that having more availabilities in the duration interval doesn't fire the "Availability Stays Not UP" alert
But as long we have anything but "Availability UP" {DOWN, UNKNOWN, DISABLED}
the "Availability Stays Not up" should happen.

==========================================================================
The fix that I propose would be to provide the availability startTime to the 
scheduler that checks the availability duration and this to be used as durationStart by the AlertAvailabilityDurationJob

It's needed to do the following modifications in the source code:

1. GlobalConditionCache.java :

 - provide the availability start time
 AvailabilityDurationCacheElement.checkCacheElements(durationCacheElements,
 resource, availabilityType, availability.getStartTime());

 2. AvailabilityDurationCacheElement.java

 - add a new parameter startTime and provide the value to the AvailabilityManager

 public static void checkCacheElements(List<AvailabilityDurationCacheElement>
cacheElements, Resource resource, AvailabilityType providedValue, long
startTime )

LookupUtil.getAvailabilityManager().scheduleAvailabilityDurationCheck(cacheElement,resource, startTime );

 3. AvailabilityManagerLocal.java, AvailabilityManagerBean.java:
 - add a new parameter durationStartTime and provide the value to the Job in
 the infoMap

 public void scheduleAvailabilityDurationCheck(AvailabilityDurationCacheElement
 cacheElement, Resource resource, long durationStartTime );

 infoMap.put(AlertAvailabilityDurationJob.DATAMAP_DURATION_START_TIME,
 String.valueOf(durationStartTime)); // in ms

 4. AlertAvailabilityDurationJob.java:

 - use the provided duration start time

 long durationStart = Long.valueOf(infoMap.get(DATAMAP_DURATION_START_TIME));
 // in ms
 long durationEnd = durationStart + duration * 1000;

 criteria.addFilterInterval(durationStart + 1, durationEnd - 1); // reduce by
 1 ms to fake exclusive an interval filter.

Comment 1 Costel C 2014-05-19 15:21:00 UTC

*** Bug 1099112 has been marked as a duplicate of this bug. ***

Comment 2 Jay Shaughnessy 2014-05-20 16:18:46 UTC

I agree with the analysis.  Although we require clock-sync between agents and servers, the sync is not expected to be perfect, just relatively close.  So, the typical agent, which is not co-located with the server, would likely have some clock delta respective to the server.

I also agree with the proposed solution, which is the same as what I came up with independently.  Although I didn't initially consider the +1/-1 boundary changes on the search, I think that's also good.  

So, excellent analysis Costel. Thanks!

Although I understand your thinking regarding the STAYS NOT UP semantics, I actually think it's working as expected.  NOT UP is intended to represent *an avail type* in the set of {DOWN, UNKNOWN, DISABLED} and the "STAYS" is intended to mean that that avail type does not change during the given interval.  It's not the intent that avail changes between types in the set are covered in an umbrella fashion.  Actually, a change from say, DOWN to UNKNOWN should invalidate the first duration check (DOWN) from firing, and initiate a second duration check for the UNKNOWN change (I think).

Does the current behavior actually affect you negatively or is this an observation?

Working the fix now...

Comment 3 Jay Shaughnessy 2014-05-21 17:56:50 UTC

Applying changes to master, setting author to Costel since the implemented solution was his suggestion:

commit 2a1ec4b4c201367c62e6bd305251a4d2d1ef032a
Author: Costel Cosman <costelcsmn>
Date:   Wed May 21 13:45:27 2014 -0400

Make sure the query for availability changes uses a duration adjusted for
the agent time, not the server time.  This is done by now storing the agent
avail change startTime in the timer's jobInfo for the duration check job.

notes
- these duration check jobs do not survive server restarts, so we can assume
  the job's infoMaps will always have the new startTime set.

Comment 4 Jay Shaughnessy 2014-05-23 19:01:34 UTC

Sorry, that was a local commit hash above, master commit should be:

commit 46e40a32a4ea2101559d7398109564fff1fc3db1
Author: Costel Cosman <costelcsmn>
Date:   Wed May 21 13:45:27 2014 -0400

Comment 5 Costel C 2014-06-30 16:52:25 UTC

Hi Jay, 

Sorry for my delayed answer.

Regarding STAYS NOT UP, it was an observation but it could affect negatively in the future.

From my understanding:
   STAYS DOWN means its remains in the DOWN state 
   STAYS NOT UP means it remains in the NOT UP state.

Otherwise I don't see any reason to keep both STAYS DOWN and STAYS NOT UP.
(You explained that an UNKNOWN invalidates a previous DOWN, which is exactly the behavior of STAYS DOWN)

An use-case: I want to be alerted when the resource remains DOWN or UNKNOWN for a defined duration interval.
(If the agent is DOWN the resource is reported as UNKNOWN)


Regards,
Costel

Comment 6 Heiko W. Rupp 2014-12-15 11:36:41 UTC

Bulk close of items fixed in RHQ 4.12

If you think this is not solved, then please open a *new* BZ and link to this one.

Note You need to log in before you can comment on or make changes to this bug.