Bug 829962

Summary:	platform "goes down" alert doesn't fire the first time
Product:	[Other] RHQ Project	Reporter:	John Mazzitelli <mazz>
Component:	Alerts	Assignee:	RHQ Project Maintainer <rhq-maint>
Status:	CLOSED NOTABUG	QA Contact:	Mike Foley <mfoley>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	hrupp
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	830299 (view as bug list)		Environment:
Last Closed:	2012-06-08 19:12:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	830299

Description John Mazzitelli 2012-06-07 21:45:19 UTC

1) start a server and a new agent
2) import the new platform
3) Create an alert on the platform resource - a "Goes Down" availability alert.
4) in the agent prompt, invoke "shutdown" (or just kill the agent)
5) notice no alert is fired - this is the bug
6) restart the agent (or type "start" if you are still at the agent prompt)
7) repeat step 4 (shutdown the agent)
8) notice that an alert IS fired.

Why does the alert fire the second time, but not the first?

Comment 1 Mike Foley 2012-06-07 22:40:20 UTC

documenting this is OK in JON 3.1

<mfoley_> trying this now
<mfoley_> ok ... it worked for me 1st time in JON 3.1
<mfoley_> but i can retest
<mfoley_> this is working for me in JON 3.1
<viet> it worked for me too first time in CR3

Comment 2 John Mazzitelli 2012-06-08 17:50:45 UTC

I am seeing this, but not 100% of the time. I just tried again, started with fresh DB, newly imported platform. I start the server, when its up, I start the agent. I import the RHQ Agent and the platform. On the platform, I create a Going Down alert. I shutdown the agent. In the server logs, I see this:

13:45:34,901 INFO  [CoreServerServiceImpl] Agent [mazztower][4.5.0-SNAPSHOT(c96fb05)] would like to connect to this server
13:45:35,018 INFO  [CoreServerServiceImpl] Agent [mazztower] has connected to this server at Fri Jun 08 13:45:35 EDT 2012
13:45:52,170 INFO  [CoreServerServiceImpl] Got agent registration request for existing agent: mazztower[192.168.1.2:16163][4.5.0-SNAPSHOT(c96fb05)] - Will not regenerate a new token
13:46:30,143 INFO  [CacheConsistencyManagerBean] localhost took [49]ms to reload cache for 1 agents
13:46:41,767 INFO  [AgentManagerBean] Agent with name [mazztower] just went down
13:47:00,200 INFO  [CacheConsistencyManagerBean] localhost took [43]ms to reload global cache
13:47:00,258 INFO  [CacheConsistencyManagerBean] localhost took [43]ms to reload cache for 1 agents

I think it might have something to do wiht the reloading of the caches.

Comment 3 John Mazzitelli 2012-06-08 18:15:57 UTC

I just tried again - clean DB, new agent. This time, the alert fired. But here's something different, I did not see the alert caches get reloaded:

14:11:54,500 INFO  [CoreServerServiceImpl] Got agent registration request for existing agent: mazztower[192.168.1.2:16163][4.5.0-SNAPSHOT(c96fb05)] - Will not regenerate a new token
14:12:38,094 INFO  [CacheConsistencyManagerBean] localhost took [51]ms to reload global cache
14:12:38,158 INFO  [CacheConsistencyManagerBean] localhost took [49]ms to reload cache for 1 agents
14:12:56,487 INFO  [AgentManagerBean] Agent with name [mazztower] just went down

Notice in comment #2, when the alert didn't fire, you notice that after the agent went down, the two alert caches reloaded.

Comment 4 John Mazzitelli 2012-06-08 19:12:16 UTC

This is to be expected. See the new FAQ I added so I don't forget this again 3 years from now :)

http://rhq-project.org/display/JOPR2/FAQ#FAQ-IcreatedanalertdefinitionandIknowimmediatelythereaftermyagentshouldhavereporteddatathatshouldhavetriggeredthealertbutmyalertdidnotfire.Wheredidmyalertgo%3F