Bug 1003132

Summary:

Recovery alerts based on availability may not fire if AlertConditionCache isn't reloaded before next availability report

Product:

[JBoss] JBoss Operations Network

Reporter:

Larry O'Leary <loleary>

Component:

Monitoring - Alerts

Assignee:

Jay Shaughnessy <jshaughn>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Mike Foley <mfoley>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

JON 3.1.2

CC:

ahovsepy, jshaughn, tsegismo

Target Milestone:

ER01

Target Release:

JON 3.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1004545 (view as bug list)

Environment:

Last Closed:

2014-01-02 20:35:59 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1004545

Attachments:

Description	Flags
Agent log from reproducer	none
JON server log from reproducer	none
EAP server log from reproducer	none
alertRecovery	none
eap_down	none
screenshot_15_51_00	none
eap_up	none
eap-avail	none

Description Larry O'Leary 2013-08-30 23:51:49 UTC

Description of problem:
If a recovery alert is based on availability change, it is possible that it will not see the availability change if the availability change happens before the recovery alert is evaluated or loaded into the AlertConditionCache.

This makes alert definitions and recovery very unpredictable.

Version-Release number of selected component (if applicable):
4.4.0.JON312GA

How reproducible:
8 out of 10 tries

Steps to Reproduce:
PLEASE NOTE -- Timing is everything here.

1. Install and start JBoss EAP 5 default server.
2. Enable JBoss ON agent debug logging (this is so we can do the restart based on debug messages):

sed -i 's/^#RHQ_AGENT_DEBUG=.*$/RHQ_AGENT_DEBUG=true/' "${RHQ_AGENT_HOME}/bin/rhq-agent-env.sh"

3. Install and start JBoss ON system.
4. Import EAP default server into inventory.
5. Create new _stays down for 1 minute_ alert definition for EAP default resource:

*Name*: `Alert - Profile Down`
*Condition*:
*Fire alert when*: _ANY_
*Condition Type*: _Availability Duration_
*Availability Duration*: _Stays Down_
*Duration*: `1` _minutes_
*Notification*:
*Notification Sender*: _Resource Operations_
*Resource Selection Mode*: _This Resource_
*Operation*: _Start_
*Recovery*:
*Disable When Fired*: _Yes_

6. Create new _recovery_ alert definition for EAP default resource:

*Name*: `Recovery - Profile Down`
*Condition*:
*Fire alert when*: _ANY_
*Condition Type*: _Availability Change_
*Availability*: _Goes up_
*Recovery*:
*Recovery Alert*: _Alert - Profile Down_

7. Perform a synchronized restart of the EAP server from outside of JBoss ON:

availChangeCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] $rhq\.core\.pc\.inventory\.InventoryManager$- Availability report content: AV:.*' "${RHQ_AGENT_HOME}/logs/agent.log")
cd "${JBOSS_HOME}/bin"
./shutdown.sh -S -s jnp://localhost:1099 -u admin -p admin
{
echo -n "Waiting for availability change."

while [ true ]; do
newAvailChangeCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] $rhq\.core\.pc\.inventory\.InventoryManager$- Availability report content: AV:.*' "${RHQ_AGENT_HOME}/logs/agent.log")
[ ${availChangeCount} -eq ${newAvailChangeCount} ] && {
echo -n "."
sleep 1s
} || {
break
}
done
echo ""
}
sleep 40s
./run.sh -c default -b 0.0.0.0 &

Actual results:
Alert - Profile Down is fired and disabled.
Recovery - Profile Down is never fired and Alert - Profile Down is never re-enabled even though availability went from DOWN to UP.

Expected results:
Alert - Profile Down is fired and disabled.
Recovery - Profile Down is fired shortly after and Alert - Profile Down is re-enabled.

Additional info:
This appears to be related to the availability change happening seconds before the AlertConditionCache is reloaded. I have reproduced this issue quite a few times and captured the relevant logs. Here is what I found in the logs that seemed relevant:

In EAP server log at 2013-08-30 16:21:35,647 you can see the shutdown even was invoked.
In JON agent log at 2013-08-30 16:21:46,392 you can see an availability scan found EAP was DOWN.
This caused EAP to be started externally around 2013-08-30 16:21:46,413.
In JON server log at 2013-08-30 16:21:46,421 you can see it received the DOWN availability report for EAP.
In JON agent log at 2013-08-30 16:22:46,733 you can see another availability scan found EAP was UP (partially).
In EAP server log around 2013-08-30 16:22:47,373 EAP indicated it was fully up.
In JON server log at 2013-08-30 16:22:46,848 you can see the alert being triggered -- Alert - Profile Down[id=10001].
This was because even though EAP is now UP, it had actually been down for over a minute (timing issue).
In JON server log at 2013-08-30 16:22:46,890 alert 10001 was disabled as per its recovery definition.
In JON server log at 2013-08-30 16:22:49,967 you can see it received the UP availability report for EAP.
In JON server log at 2013-08-30 16:22:50,731 you can see that the operation invoked by alert 10001 failed because EAP was already UP.

Comment 1 Larry O'Leary 2013-08-30 23:53:03 UTC

Created attachment 792327 [details]
Agent log from reproducer

Comment 2 Larry O'Leary 2013-08-30 23:53:45 UTC

Created attachment 792328 [details]
JON server log from reproducer

Comment 3 Larry O'Leary 2013-08-30 23:54:16 UTC

Created attachment 792329 [details]
EAP server log from reproducer

Comment 4 Jay Shaughnessy 2013-09-03 17:19:32 UTC

I'm fairly sure I know why this is happening, although I have not yet tested out the theory. It has to do with the way we manager alert condition caches. In this situation, the global condition cache in particular.

In the description the two numbers that caught my attention were:

* In JON server log at 2013-08-30 16:22:46,890 alert 10001 was disabled as per its recovery definition.
* In JON server log at 2013-08-30 16:22:49,967 you can see it received the UP availability report for EAP.

Note that there is a 3+ second gap in those times. That would indicate that there was at least a 3 second period after the alert was "declared" disabled and the UP availability was processed. So, if it takes more than 3 seconds to "actually" disable the alert definition then the recovery logic is basically inert.

And in fact, it seems that it can take up to 30 seconds for the alert definition to actually be disabled. This is because it is not actually disabled in the global alert condition cache until the "dirty cache" job runs, and this happens at 30s intervals.

We use this approach to handle HA environments where all servers must be keep their global condition cache in sync, and availability is a global condition (for reasons beyond this BZ). So servers have a job to go see whether they need to update their global condition cache. Which is fine in general, because a 0-30s interval is typically OK when dealing with agent failover scenarios.

But for a single server installation, or simply acknowledging that agents don't actually failover to a different server very often, it's a potentially long gap for a recovery alert definition to wait for activation.

I would propose that we update the global condition cache on the relevant server immediately, and let other HA nodes poll for updates. I'll make this change and test my theory.

As a side note, they are doing themselves no favors with a 1 minute avail duration condition and a 1 minute availability check interval. A longer avail duration period would have avoided the initial alert firing in the first place. As would a shorter avail check period assuming the avail checking is fast on the agent.

Comment 5 Larry O'Leary 2013-09-03 20:31:53 UTC

> As a side note, they are doing themselves no favors with a 1 minute avail
> duration condition and a 1 minute availability check interval.  A longer
> avail duration period would have avoided the initial alert firing in the
> first place.  As would a shorter avail check period assuming the avail
> checking is fast on the agent.

The 1 minute duration is for the purpose of reproducing this. In the actual use-case, it is a 10 minute duration.

Comment 6 Thomas Segismont 2013-09-05 12:21:13 UTC

Jay,

Just an idea, how about reloading the availability cache conditions when an inventory report is merged (just for resources reporting an availability change)?

This could also be done with Metric/Trait conditions when a Measurement report is merged.

And so on.

Comment 7 Jay Shaughnessy 2013-09-05 15:51:53 UTC

jshaughn@JSHAUGHN-THINK /c/work/rhq (master)

commit 69bfdc8c7f4ef28e58cbd0bdc92018dd2efc1afa
Author: Jay Shaughnessy <jshaughn>
Date:   Thu Sep 5 11:43:28 2013 -0400

The issue here is the lag time between processing recovery actions (like disable
of the to-be-recovered alert def, enable of the recovery alert def, etc) and
the alerting cache reloads, which can be up to 30s.  The 30s [max] time
is reasonable for other HA nodes, but for the server serving the agent, and\
firing the alert that triggers the recovery handling, it can be too long
because the recovery action (if automated, like a restart) may happen
before the cache reload, effectively missing the recovery alert.

Note that given the current architecture, which executes the alert
notifications prior to committing the alert (this should likely change),
means that the cache reload must still take place after in-band notification
processing. So, even though this fix will execute the reload as soon as
possible, it is feasible that a small window of vulnerability will still
exist if automated recovery actions are performed very quickly.

Comment 8 Larry O'Leary 2013-09-13 15:44:27 UTC

The steps used to test this against 3.2 had to be updated a bit due to the auto-availability refresh that occurs when you are sitting on the resource page. Specifically, the synchronized restart loop used in the original steps needs to be changed to this:

        availReportCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] \(rhq\.core\.pc\.inventory\.InventoryManager\)- Availability report content: AV:.*' "${RHQ_AGENT_HOME}/logs/agent.log")
        availChangeCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] \(rhq\.core\.pc\.inventory\.AvailabilityExecutor\)- Scan Ended   : .* \[startTime=[0-9]+, endTime=[0-9]+, runtime=[0-9]+, isFull=true, isForced=false, numResources=45, numGetAvailabilityCalls=[0-9]+, numScheduledRandomly=0, numPushedByInterval=[0-9]+, numAvailabilityChanges=45, numDeferToParent=44\]$' "${RHQ_AGENT_HOME}/logs/agent.log")
        cd "${JBOSS_HOME}/bin"
        ./shutdown.sh -S -s jnp://localhost:1099 -u admin -p admin
        {
            echo -n "Waiting for availability change."
            
            while [ true ]; do
                newAvailReportCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] \(rhq\.core\.pc\.inventory\.InventoryManager\)- Availability report content: AV:.*' "${RHQ_AGENT_HOME}/logs/agent.log")
                newAvailChangeCount=$(grep -cE '.*\[InventoryManager\.availability-.*\] \(rhq\.core\.pc\.inventory\.AvailabilityExecutor\)- Scan Ended   : .* \[startTime=[0-9]+, endTime=[0-9]+, runtime=[0-9]+, isFull=true, isForced=false, numResources=45, numGetAvailabilityCalls=[0-9]+, numScheduledRandomly=0, numPushedByInterval=[0-9]+, numAvailabilityChanges=45, numDeferToParent=44\]$' "${RHQ_AGENT_HOME}/logs/agent.log")
                [ ${availReportCount} -eq ${newAvailReportCount} -a ${availChangeCount} -eq ${newAvailChangeCount} ] && {
                    echo -n "."
                    sleep 1s
                } || {
                    break
                }
            done
            echo ""
        }; sleep 40s; ./run.sh -c default -b 0.0.0.0 &


This assumes that the test environment only has a single EAP 5.2.0 server running the all profile. This yields 45 resources in the availability scan with 44 being deferred to parent when the availability of the EAP server goes up/down.

Comment 9 Armine Hovsepyan 2013-10-22 15:26:30 UTC

Created attachment 815062 [details]
alertRecovery

Comment 10 Armine Hovsepyan 2013-10-23 13:56:31 UTC

Created attachment 815421 [details]
eap_down

Comment 11 Armine Hovsepyan 2013-10-23 13:57:18 UTC

Created attachment 815422 [details]
screenshot_15_51_00

Comment 12 Armine Hovsepyan 2013-10-23 13:58:36 UTC

Created attachment 815423 [details]
eap_up

Comment 13 Armine Hovsepyan 2013-10-23 13:59:14 UTC

Created attachment 815424 [details]
eap-avail

Comment 14 Armine Hovsepyan 2013-10-23 14:03:01 UTC

Verified

EAP resource was down at 15:51:07 (eap_down screenshot - hour on server is 09, it's 15 on local)
EAP resource availability graph shows eap was up at 15:51:09 (eap_avail screenshot)
Recovery alert fired at 15:51:12 (eap_up screenshot)

checked for both default and call eap profiles, recovery alert is called/fired in <5 secs after resource is available again

Comment 15 Jay Shaughnessy 2013-11-08 20:27:06 UTC

Follow-Up note:

This is still more of an issue in an HA environment than previously stated. 

Here's the thing,  consider an HA env with servers S1 and S2.  Agent A monitors resource R and is talking to S1.  R has an avail duration alert def for StaysDown5Minutes.  R goes down.  Agent A reports R down to S1 and S1 schedules a quartz job to fire in 5 minutes.  5 minutes later the quartz job fires, R is still down. So, the duration alert fires, the global cache on the server handling the quartz job is immediately updated and the other server is updated within 30 seconds.

But which server handled the quartz job?  For some reason the assumption has been that it would be S1.  When this was originally designed it didn't matter which server processed the quartz job doing the avail duration check.  Because cache reloads were handled always by the 30s recurring jobs executing on each server. But the work for this issue, where we updated the global cache immediately, I think made a bad assumption that it would be updating on, in the case above, S1 - the same server that scheduled the job.

That is false.  It could be S1 or S2 as Quartz gives no guarantee which clustered scheduler will pick up the job.  So, even with no failover, the server servicing the relevant agent may not get the fast cache reload.

Looking at more ideas, probably for a new BZ...

Comment 16 Jay Shaughnessy 2013-11-12 18:45:01 UTC

See Bug 1029553 for the scenario in comment 15.