Bug 725445

Summary:	connected Agent count is sometimes incorrect with multiple agents even though agent still up and reporting.
Product:	[Other] RHQ Project	Reporter:	Simeon Pinder <spinder>
Component:	Agent	Assignee:	Jay Shaughnessy <jshaughn>
Status:	CLOSED WONTFIX	QA Contact:	Mike Foley <mfoley>
Severity:	medium	Docs Contact:
Priority:	high
Version:	3.0.1	CC:	hrupp, jshaughn
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-07-27 21:38:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	678340, 725459

Description Simeon Pinder 2011-07-25 14:11:20 UTC

Description of problem: The connected agent count as listed by http://localhost:7080/rhq/ha/listAgents.xhtml can occasionally show less agents than are successfully reporting back to the JON server.  With one JON server and three agents connected, one of the remote agents was no longer being displayed in the connected agents list despite continuous UP availability when actually navigating to that same agent via the left nav tree. 

This behavior can cause off alert and agent synchronization issues. 


Version-Release number of selected component (if applicable):
2.4.1

How reproducible:
Occasionally.

Steps to Reproduce:
1. Add three agents to JON server.
2. Monitor usage over time and http://localhost:7080/rhq/ha/listAgents.xhtml will occasionally show less agents being listed despite actual availability.
3.
  
Actual results:
Agent list incorrectly reports the number of agents actually and successfully reporting back to the JON.

Expected results:
No discrepancy between listAgents and available alert data.

Additional info:

Comment 1 Jay Shaughnessy 2011-07-26 17:58:58 UTC

So far Simeon and I have not been able to recreate this.

Note that when an agent registers, or connects, the rhq_agent.server_id field is set to the serverid handling the agent.  When the agent exits or backfills this value is not nulled out.  So, in general the rhq_agent entry will always refer to the most recent server that handled its requests.

This is why the in the server admin screen, the agent count for a server may not reflect the live count of agents as some of the agents may be down.

It also means that the agent.server_id field is typically not null after it is set the first time.  I have yet to see a use case, or code, that sets it null, so if it is null at some point that should be investigated.

Comment 2 Jay Shaughnessy 2011-07-26 20:18:17 UTC

OK, there seems to be a plausible explanation for this and it has to do
with a server changing state.  Typically this means a server restart.  If
a server restarts quickly (say, probably less than a minute) then it is 
possible that the server restart is completely missed by an agent.  This
is bad because the agent will not re-register with the server. 

On startup we clear the server references (rhq-agent.server_id) for the
server being started up.  The logic being that if the server was down
these references must be stale because the server obviously was not
servicing the agents.

But that means that a running agent may no longer have a server
reference in the agent table.  This is bad because it means that the
agent will be passed over when populating the alert condition cache
for the server. And, of course, it also messes up the agent count for in
the server listing, as described in this BZ.

There may also be an analogous situation with Maintenance Mode, but he
restart would likely be the more common case, for servers with enough
horsepower to restart quickly.

The workaround is to restart an affected agent.  But a preventative
workaround is to ensure a slower server startup, by say 300 seconds,
in rhq-server.properties (rhq.server.ensure-down-time-secs=300).

A slow server startup is not really great, I think we should solve
this in a different way going forward.  I'll create a new BZ for this.

So, no code change recommended for this.

Comment 3 Simeon Pinder 2011-07-27 20:25:50 UTC

*** Bug 725320 has been marked as a duplicate of this bug. ***

Comment 4 Charles Crouch 2011-07-27 21:24:36 UTC

Jay added https://bugzilla.redhat.com/show_bug.cgi?id=725881 regarding his comment2

Comment 5 Charles Crouch 2011-07-27 21:38:04 UTC

Fix will be done as part of 725881