Bug 1099114

Summary:	AlertAvailabilityDurationJob interrogates a wrong duration interval
Product:	[Other] RHQ Project	Reporter:	Costel C <mulderika>
Component:	Alerts	Assignee:	Jay Shaughnessy <jshaughn>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike Foley <mfoley>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.10	CC:	hrupp, jshaughn
Target Milestone:	GA
Target Release:	RHQ 4.12
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-12-15 11:36:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Costel C 2014-05-19 13:57:53 UTC

Description of problem:
The AlertAvailabilityDurationJob sometimes doesn't work properly and alerts are not triggered because it's not looking into the right duration interval.

The AlertAvailabilityDurationJob gets from database the list of the availabilities that matches the duration interval: 
[durationStart = (currentTime - duration_sec * 1000), durationEnd= currentTime]

The problem is see is that the availabilities are stored in database with the
measurement startTime that comes from the availability report (rhq-agent time), but the AlertAvailabilityDurationJob gets the list of availabilities using the
rhq-server current time.

How reproducible: Most of the times in some environments

Use-cases when the duration interval can be wrong:
 - the rhq-agent time is not synchronized with the rhq-server time
 - the availability reported by the rhq-agent reaches the server with a delay
 and therefore the Job is started with delay
 - other cases ?


Steps to Reproduce:
I don't know how to reproduce, it happens in our environment (RHQ Server with 3 nodes) 
1. Set Logger DEBUG mode
2. Define for a resource (in my case an HTTPService) the availability measurement schedule interval to 1 minute
2. Create an alert definition with Availability Duration Stays Not UP for 5 minutes
3. Make the resource availability UP and wait 1 minute 
4. Make the resource availability DOWN and wait 5 minutes

Actual results:
No alert was fired and in the log file appears a debug information("AlertAvailabilityDurationJob: No alert. Resource avail for ... has fluctuated...." along with a list of two availabilities {UP,DOWN}.

So the AlertAvailabilityDurationJob detects sometimes more availabilities, usually includes the previous "Availability UP"

Expected results:
An alert "Availability Duration Stays NOT UP for 5 min" should be fired.

Additional info

Another problem I see is that having more availabilities in the duration interval doesn't fire the "Availability Stays Not UP" alert
But as long we have anything but "Availability UP" {DOWN, UNKNOWN, DISABLED}
the "Availability Stays Not up" should happen.

==========================================================================
The fix that I propose would be to provide the availability startTime to the 
scheduler that checks the availability duration and this to be used as durationStart by the AlertAvailabilityDurationJob

It's needed to do the following modifications in the source code:

1. GlobalConditionCache.java :

 - provide the availability start time
 AvailabilityDurationCacheElement.checkCacheElements(durationCacheElements,
 resource, availabilityType, availability.getStartTime());

 2. AvailabilityDurationCacheElement.java

 - add a new parameter startTime and provide the value to the AvailabilityManager

 public static void checkCacheElements(List<AvailabilityDurationCacheElement>
cacheElements, Resource resource, AvailabilityType providedValue, long
startTime )

LookupUtil.getAvailabilityManager().scheduleAvailabilityDurationCheck(cacheElement,resource, startTime );

 3. AvailabilityManagerLocal.java, AvailabilityManagerBean.java:
 - add a new parameter durationStartTime and provide the value to the Job in
 the infoMap

 public void scheduleAvailabilityDurationCheck(AvailabilityDurationCacheElement
 cacheElement, Resource resource, long durationStartTime );

 infoMap.put(AlertAvailabilityDurationJob.DATAMAP_DURATION_START_TIME,
 String.valueOf(durationStartTime)); // in ms

 4. AlertAvailabilityDurationJob.java:

 - use the provided duration start time

 long durationStart = Long.valueOf(infoMap.get(DATAMAP_DURATION_START_TIME));
 // in ms
 long durationEnd = durationStart + duration * 1000;

 criteria.addFilterInterval(durationStart + 1, durationEnd - 1); // reduce by
 1 ms to fake exclusive an interval filter.

Comment 1 Costel C 2014-05-19 15:21:00 UTC

*** Bug 1099112 has been marked as a duplicate of this bug. ***

Comment 2 Jay Shaughnessy 2014-05-20 16:18:46 UTC

I agree with the analysis.  Although we require clock-sync between agents and servers, the sync is not expected to be perfect, just relatively close.  So, the typical agent, which is not co-located with the server, would likely have some clock delta respective to the server.

I also agree with the proposed solution, which is the same as what I came up with independently.  Although I didn't initially consider the +1/-1 boundary changes on the search, I think that's also good.  

So, excellent analysis Costel. Thanks!

Although I understand your thinking regarding the STAYS NOT UP semantics, I actually think it's working as expected.  NOT UP is intended to represent *an avail type* in the set of {DOWN, UNKNOWN, DISABLED} and the "STAYS" is intended to mean that that avail type does not change during the given interval.  It's not the intent that avail changes between types in the set are covered in an umbrella fashion.  Actually, a change from say, DOWN to UNKNOWN should invalidate the first duration check (DOWN) from firing, and initiate a second duration check for the UNKNOWN change (I think).

Does the current behavior actually affect you negatively or is this an observation?

Working the fix now...

Comment 3 Jay Shaughnessy 2014-05-21 17:56:50 UTC

Applying changes to master, setting author to Costel since the implemented solution was his suggestion:

commit 2a1ec4b4c201367c62e6bd305251a4d2d1ef032a
Author: Costel Cosman <costelcsmn>
Date:   Wed May 21 13:45:27 2014 -0400

Make sure the query for availability changes uses a duration adjusted for
the agent time, not the server time.  This is done by now storing the agent
avail change startTime in the timer's jobInfo for the duration check job.

notes
- these duration check jobs do not survive server restarts, so we can assume
  the job's infoMaps will always have the new startTime set.

Comment 4 Jay Shaughnessy 2014-05-23 19:01:34 UTC

Sorry, that was a local commit hash above, master commit should be:

commit 46e40a32a4ea2101559d7398109564fff1fc3db1
Author: Costel Cosman <costelcsmn>
Date:   Wed May 21 13:45:27 2014 -0400

Comment 5 Costel C 2014-06-30 16:52:25 UTC

Hi Jay, 

Sorry for my delayed answer.

Regarding STAYS NOT UP, it was an observation but it could affect negatively in the future.

From my understanding:
   STAYS DOWN means its remains in the DOWN state 
   STAYS NOT UP means it remains in the NOT UP state.

Otherwise I don't see any reason to keep both STAYS DOWN and STAYS NOT UP.
(You explained that an UNKNOWN invalidates a previous DOWN, which is exactly the behavior of STAYS DOWN)

An use-case: I want to be alerted when the resource remains DOWN or UNKNOWN for a defined duration interval.
(If the agent is DOWN the resource is reported as UNKNOWN)


Regards,
Costel

Comment 6 Heiko W. Rupp 2014-12-15 11:36:41 UTC

Bulk close of items fixed in RHQ 4.12

If you think this is not solved, then please open a *new* BZ and link to this one.