Bug 1099114
Summary: | AlertAvailabilityDurationJob interrogates a wrong duration interval | ||
---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Costel C <mulderika> |
Component: | Alerts | Assignee: | Jay Shaughnessy <jshaughn> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.10 | CC: | hrupp, jshaughn |
Target Milestone: | GA | ||
Target Release: | RHQ 4.12 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-12-15 11:36:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Costel C
2014-05-19 13:57:53 UTC
*** Bug 1099112 has been marked as a duplicate of this bug. *** I agree with the analysis. Although we require clock-sync between agents and servers, the sync is not expected to be perfect, just relatively close. So, the typical agent, which is not co-located with the server, would likely have some clock delta respective to the server. I also agree with the proposed solution, which is the same as what I came up with independently. Although I didn't initially consider the +1/-1 boundary changes on the search, I think that's also good. So, excellent analysis Costel. Thanks! Although I understand your thinking regarding the STAYS NOT UP semantics, I actually think it's working as expected. NOT UP is intended to represent *an avail type* in the set of {DOWN, UNKNOWN, DISABLED} and the "STAYS" is intended to mean that that avail type does not change during the given interval. It's not the intent that avail changes between types in the set are covered in an umbrella fashion. Actually, a change from say, DOWN to UNKNOWN should invalidate the first duration check (DOWN) from firing, and initiate a second duration check for the UNKNOWN change (I think). Does the current behavior actually affect you negatively or is this an observation? Working the fix now... Applying changes to master, setting author to Costel since the implemented solution was his suggestion: commit 2a1ec4b4c201367c62e6bd305251a4d2d1ef032a Author: Costel Cosman <costelcsmn> Date: Wed May 21 13:45:27 2014 -0400 Make sure the query for availability changes uses a duration adjusted for the agent time, not the server time. This is done by now storing the agent avail change startTime in the timer's jobInfo for the duration check job. notes - these duration check jobs do not survive server restarts, so we can assume the job's infoMaps will always have the new startTime set. Sorry, that was a local commit hash above, master commit should be: commit 46e40a32a4ea2101559d7398109564fff1fc3db1 Author: Costel Cosman <costelcsmn> Date: Wed May 21 13:45:27 2014 -0400 Hi Jay, Sorry for my delayed answer. Regarding STAYS NOT UP, it was an observation but it could affect negatively in the future. From my understanding: STAYS DOWN means its remains in the DOWN state STAYS NOT UP means it remains in the NOT UP state. Otherwise I don't see any reason to keep both STAYS DOWN and STAYS NOT UP. (You explained that an UNKNOWN invalidates a previous DOWN, which is exactly the behavior of STAYS DOWN) An use-case: I want to be alerted when the resource remains DOWN or UNKNOWN for a defined duration interval. (If the agent is DOWN the resource is reported as UNKNOWN) Regards, Costel Bulk close of items fixed in RHQ 4.12 If you think this is not solved, then please open a *new* BZ and link to this one. |