Bug 1003132
Summary: | Recovery alerts based on availability may not fire if AlertConditionCache isn't reloaded before next availability report | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | Larry O'Leary <loleary> | ||||||||||||||||||
Component: | Monitoring - Alerts | Assignee: | Jay Shaughnessy <jshaughn> | ||||||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | ||||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||||
Priority: | unspecified | ||||||||||||||||||||
Version: | JON 3.1.2 | CC: | ahovsepy, jshaughn, tsegismo | ||||||||||||||||||
Target Milestone: | ER01 | ||||||||||||||||||||
Target Release: | JON 3.2.0 | ||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||
Clone Of: | |||||||||||||||||||||
: | 1004545 (view as bug list) | Environment: | |||||||||||||||||||
Last Closed: | 2014-01-02 20:35:59 UTC | Type: | Bug | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||
Bug Blocks: | 1004545 | ||||||||||||||||||||
Attachments: |
|
Description
Larry O'Leary
2013-08-30 23:51:49 UTC
Created attachment 792327 [details]
Agent log from reproducer
Created attachment 792328 [details]
JON server log from reproducer
Created attachment 792329 [details]
EAP server log from reproducer
I'm fairly sure I know why this is happening, although I have not yet tested out the theory. It has to do with the way we manager alert condition caches. In this situation, the global condition cache in particular. In the description the two numbers that caught my attention were: * In JON server log at 2013-08-30 16:22:46,890 alert 10001 was disabled as per its recovery definition. * In JON server log at 2013-08-30 16:22:49,967 you can see it received the UP availability report for EAP. Note that there is a 3+ second gap in those times. That would indicate that there was at least a 3 second period after the alert was "declared" disabled and the UP availability was processed. So, if it takes more than 3 seconds to "actually" disable the alert definition then the recovery logic is basically inert. And in fact, it seems that it can take up to 30 seconds for the alert definition to actually be disabled. This is because it is not actually disabled in the global alert condition cache until the "dirty cache" job runs, and this happens at 30s intervals. We use this approach to handle HA environments where all servers must be keep their global condition cache in sync, and availability is a global condition (for reasons beyond this BZ). So servers have a job to go see whether they need to update their global condition cache. Which is fine in general, because a 0-30s interval is typically OK when dealing with agent failover scenarios. But for a single server installation, or simply acknowledging that agents don't actually failover to a different server very often, it's a potentially long gap for a recovery alert definition to wait for activation. I would propose that we update the global condition cache on the relevant server immediately, and let other HA nodes poll for updates. I'll make this change and test my theory. As a side note, they are doing themselves no favors with a 1 minute avail duration condition and a 1 minute availability check interval. A longer avail duration period would have avoided the initial alert firing in the first place. As would a shorter avail check period assuming the avail checking is fast on the agent. > As a side note, they are doing themselves no favors with a 1 minute avail
> duration condition and a 1 minute availability check interval. A longer
> avail duration period would have avoided the initial alert firing in the
> first place. As would a shorter avail check period assuming the avail
> checking is fast on the agent.
The 1 minute duration is for the purpose of reproducing this. In the actual use-case, it is a 10 minute duration.
Jay, Just an idea, how about reloading the availability cache conditions when an inventory report is merged (just for resources reporting an availability change)? This could also be done with Metric/Trait conditions when a Measurement report is merged. And so on. jshaughn@JSHAUGHN-THINK /c/work/rhq (master) commit 69bfdc8c7f4ef28e58cbd0bdc92018dd2efc1afa Author: Jay Shaughnessy <jshaughn> Date: Thu Sep 5 11:43:28 2013 -0400 The issue here is the lag time between processing recovery actions (like disable of the to-be-recovered alert def, enable of the recovery alert def, etc) and the alerting cache reloads, which can be up to 30s. The 30s [max] time is reasonable for other HA nodes, but for the server serving the agent, and\ firing the alert that triggers the recovery handling, it can be too long because the recovery action (if automated, like a restart) may happen before the cache reload, effectively missing the recovery alert. Note that given the current architecture, which executes the alert notifications prior to committing the alert (this should likely change), means that the cache reload must still take place after in-band notification processing. So, even though this fix will execute the reload as soon as possible, it is feasible that a small window of vulnerability will still exist if automated recovery actions are performed very quickly. The steps used to test this against 3.2 had to be updated a bit due to the auto-availability refresh that occurs when you are sitting on the resource page. Specifically, the synchronized restart loop used in the original steps needs to be changed to this: availReportCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] \(rhq\.core\.pc\.inventory\.InventoryManager\)- Availability report content: AV:.*' "${RHQ_AGENT_HOME}/logs/agent.log") availChangeCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] \(rhq\.core\.pc\.inventory\.AvailabilityExecutor\)- Scan Ended : .* \[startTime=[0-9]+, endTime=[0-9]+, runtime=[0-9]+, isFull=true, isForced=false, numResources=45, numGetAvailabilityCalls=[0-9]+, numScheduledRandomly=0, numPushedByInterval=[0-9]+, numAvailabilityChanges=45, numDeferToParent=44\]$' "${RHQ_AGENT_HOME}/logs/agent.log") cd "${JBOSS_HOME}/bin" ./shutdown.sh -S -s jnp://localhost:1099 -u admin -p admin { echo -n "Waiting for availability change." while [ true ]; do newAvailReportCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] \(rhq\.core\.pc\.inventory\.InventoryManager\)- Availability report content: AV:.*' "${RHQ_AGENT_HOME}/logs/agent.log") newAvailChangeCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] \(rhq\.core\.pc\.inventory\.AvailabilityExecutor\)- Scan Ended : .* \[startTime=[0-9]+, endTime=[0-9]+, runtime=[0-9]+, isFull=true, isForced=false, numResources=45, numGetAvailabilityCalls=[0-9]+, numScheduledRandomly=0, numPushedByInterval=[0-9]+, numAvailabilityChanges=45, numDeferToParent=44\]$' "${RHQ_AGENT_HOME}/logs/agent.log") [ ${availReportCount} -eq ${newAvailReportCount} -a ${availChangeCount} -eq ${newAvailChangeCount} ] && { echo -n "." sleep 1s } || { break } done echo "" }; sleep 40s; ./run.sh -c default -b 0.0.0.0 & This assumes that the test environment only has a single EAP 5.2.0 server running the all profile. This yields 45 resources in the availability scan with 44 being deferred to parent when the availability of the EAP server goes up/down. Created attachment 815062 [details]
alertRecovery
Created attachment 815421 [details]
eap_down
Created attachment 815422 [details]
screenshot_15_51_00
Created attachment 815423 [details]
eap_up
Created attachment 815424 [details]
eap-avail
Verified EAP resource was down at 15:51:07 (eap_down screenshot - hour on server is 09, it's 15 on local) EAP resource availability graph shows eap was up at 15:51:09 (eap_avail screenshot) Recovery alert fired at 15:51:12 (eap_up screenshot) checked for both default and call eap profiles, recovery alert is called/fired in <5 secs after resource is available again Follow-Up note: This is still more of an issue in an HA environment than previously stated. Here's the thing, consider an HA env with servers S1 and S2. Agent A monitors resource R and is talking to S1. R has an avail duration alert def for StaysDown5Minutes. R goes down. Agent A reports R down to S1 and S1 schedules a quartz job to fire in 5 minutes. 5 minutes later the quartz job fires, R is still down. So, the duration alert fires, the global cache on the server handling the quartz job is immediately updated and the other server is updated within 30 seconds. But which server handled the quartz job? For some reason the assumption has been that it would be S1. When this was originally designed it didn't matter which server processed the quartz job doing the avail duration check. Because cache reloads were handled always by the 30s recurring jobs executing on each server. But the work for this issue, where we updated the global cache immediately, I think made a bad assumption that it would be updating on, in the case above, S1 - the same server that scheduled the job. That is false. It could be S1 or S2 as Quartz gives no guarantee which clustered scheduler will pick up the job. So, even with no failover, the server servicing the relevant agent may not get the fast cache reload. Looking at more ideas, probably for a new BZ... See Bug 1029553 for the scenario in comment 15. |