Bug 829962

Summary: platform "goes down" alert doesn't fire the first time
Product: [Other] RHQ Project Reporter: John Mazzitelli <mazz>
Component: AlertsAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED NOTABUG QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.4CC: hrupp
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 830299 (view as bug list) Environment:
Last Closed: 2012-06-08 19:12:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 830299    

Description John Mazzitelli 2012-06-07 21:45:19 UTC
1) start a server and a new agent
2) import the new platform
3) Create an alert on the platform resource - a "Goes Down" availability alert.
4) in the agent prompt, invoke "shutdown" (or just kill the agent)
5) notice no alert is fired - this is the bug
6) restart the agent (or type "start" if you are still at the agent prompt)
7) repeat step 4 (shutdown the agent)
8) notice that an alert IS fired.

Why does the alert fire the second time, but not the first?

Comment 1 Mike Foley 2012-06-07 22:40:20 UTC
documenting this is OK in JON 3.1

<mfoley_> trying this now
<mfoley_> ok ... it worked for me 1st time in JON 3.1
<mfoley_> but i can retest
<mfoley_> this is working for me in JON 3.1
<viet> it worked for me too first time in CR3

Comment 2 John Mazzitelli 2012-06-08 17:50:45 UTC
I am seeing this, but not 100% of the time. I just tried again, started with fresh DB, newly imported platform. I start the server, when its up, I start the agent. I import the RHQ Agent and the platform. On the platform, I create a Going Down alert. I shutdown the agent. In the server logs, I see this:

13:45:34,901 INFO  [CoreServerServiceImpl] Agent [mazztower][4.5.0-SNAPSHOT(c96fb05)] would like to connect to this server
13:45:35,018 INFO  [CoreServerServiceImpl] Agent [mazztower] has connected to this server at Fri Jun 08 13:45:35 EDT 2012
13:45:52,170 INFO  [CoreServerServiceImpl] Got agent registration request for existing agent: mazztower[192.168.1.2:16163][4.5.0-SNAPSHOT(c96fb05)] - Will not regenerate a new token
13:46:30,143 INFO  [CacheConsistencyManagerBean] localhost took [49]ms to reload cache for 1 agents
13:46:41,767 INFO  [AgentManagerBean] Agent with name [mazztower] just went down
13:47:00,200 INFO  [CacheConsistencyManagerBean] localhost took [43]ms to reload global cache
13:47:00,258 INFO  [CacheConsistencyManagerBean] localhost took [43]ms to reload cache for 1 agents

I think it might have something to do wiht the reloading of the caches.

Comment 3 John Mazzitelli 2012-06-08 18:15:57 UTC
I just tried again - clean DB, new agent. This time, the alert fired. But here's something different, I did not see the alert caches get reloaded:

14:11:54,500 INFO  [CoreServerServiceImpl] Got agent registration request for existing agent: mazztower[192.168.1.2:16163][4.5.0-SNAPSHOT(c96fb05)] - Will not regenerate a new token
14:12:38,094 INFO  [CacheConsistencyManagerBean] localhost took [51]ms to reload global cache
14:12:38,158 INFO  [CacheConsistencyManagerBean] localhost took [49]ms to reload cache for 1 agents
14:12:56,487 INFO  [AgentManagerBean] Agent with name [mazztower] just went down

Notice in comment #2, when the alert didn't fire, you notice that after the agent went down, the two alert caches reloaded.

Comment 4 John Mazzitelli 2012-06-08 19:12:16 UTC
This is to be expected. See the new FAQ I added so I don't forget this again 3 years from now :)

http://rhq-project.org/display/JOPR2/FAQ#FAQ-IcreatedanalertdefinitionandIknowimmediatelythereaftermyagentshouldhavereporteddatathatshouldhavetriggeredthealertbutmyalertdidnotfire.Wheredidmyalertgo%3F