Bug 1029553
Summary: | Recovery alerts involving availability may not fire in HA environment | |||
---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Jay Shaughnessy <jshaughn> | |
Component: | Alerts, Core Server | Assignee: | Jay Shaughnessy <jshaughn> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.9 | CC: | hrupp | |
Target Milestone: | GA | |||
Target Release: | RHQ 4.10 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1030108 (view as bug list) | Environment: | ||
Last Closed: | 2014-04-23 12:30:09 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1030108 |
Description
Jay Shaughnessy
2013-11-12 15:24:27 UTC
I think the only solution here is to convert the avail duration checking from a clustered quartz trigger to an EJB Timer. The free version of Quartz does not provide for the ability to designate a specific server for executing a trigger in a clustered scheduler(this is the TC Where feature that exists in the pay version). So, the only obvious way to ensure that the avail duration check is executed on the scheduling server, S1 in the example above, is an EJB Timer. The assumption here is that the relevant agent will still be talking to S1, and therefore that is the server needing the immediate global cache reload. There are two weaknesses: In a failover situation we will still be subject to the <= 30s cache refresh window. But this is not different than before, and we've done things to make this a very unlikely occurrence, including the standard delayed failover and now deferred notification handling (which means faster setting of the dirty cache flag). Second, if S1 goes down we'll lose the avail duration alert completely because the EJB Timer will be lost. This is unfortunate but unlikely, and I think in general a loss of some monitoring when losing a monitoring server, may be expected. The benefit of proper alerting in this scenario, nearly all the time, I think outweighs the downside. master commit 14dca980d0b5794b5c13aebe5add5b61fd068a06 Author: Jay Shaughnessy <jshaughn> Date: Tue Nov 12 13:33:32 2013 -0500 Convert avail duration condition checking from quartz job to EJB Timer to ensure that the the job executes on the HA server initiating the job. This server presumably still serves the relevant agent and therefore is the server which should have its global cache updated immediately upon firing of the alert, in order to quickly get recovery alert conditions into the global cache. The only downside is that loss of the server (in an HA env) will mean loss of the avail duration check, and potential alert. In a non-HA env the issue is moot because there would be no where to execute the quartz job as well. Bulk closing of 4.10 issues. If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10. |