725445 – connected Agent count is sometimes incorrect with multiple agents even though agent still up and reporting.

Bug 725445 - connected Agent count is sometimes incorrect with multiple agents even though agent still up and reporting.

Summary: connected Agent count is sometimes incorrect with multiple agents even though...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Agent
Sub Component:
Version:	3.0.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	725320 (view as bug list)
Depends On:
Blocks:	jon3 rhq41beta
TreeView+	depends on / blocked

Reported:	2011-07-25 14:11 UTC by Simeon Pinder
Modified:	2011-08-11 03:05 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-07-27 21:38:04 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	725320	0	high	CLOSED	alerts will spontaneously stop firing for some of the agents	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	725881	0	high	NEW	Ensure server restarts are safe for running agents	2022-03-31 04:27:39 UTC

Internal Links: 725320 725881

Description Simeon Pinder 2011-07-25 14:11:20 UTC

Description of problem: The connected agent count as listed by http://localhost:7080/rhq/ha/listAgents.xhtml can occasionally show less agents than are successfully reporting back to the JON server.  With one JON server and three agents connected, one of the remote agents was no longer being displayed in the connected agents list despite continuous UP availability when actually navigating to that same agent via the left nav tree. 

This behavior can cause off alert and agent synchronization issues. 


Version-Release number of selected component (if applicable):
2.4.1

How reproducible:
Occasionally.

Steps to Reproduce:
1. Add three agents to JON server.
2. Monitor usage over time and http://localhost:7080/rhq/ha/listAgents.xhtml will occasionally show less agents being listed despite actual availability.
3.
  
Actual results:
Agent list incorrectly reports the number of agents actually and successfully reporting back to the JON.

Expected results:
No discrepancy between listAgents and available alert data.

Additional info:

Comment 1 Jay Shaughnessy 2011-07-26 17:58:58 UTC

So far Simeon and I have not been able to recreate this.

Note that when an agent registers, or connects, the rhq_agent.server_id field is set to the serverid handling the agent.  When the agent exits or backfills this value is not nulled out.  So, in general the rhq_agent entry will always refer to the most recent server that handled its requests.

This is why the in the server admin screen, the agent count for a server may not reflect the live count of agents as some of the agents may be down.

It also means that the agent.server_id field is typically not null after it is set the first time.  I have yet to see a use case, or code, that sets it null, so if it is null at some point that should be investigated.

Comment 2 Jay Shaughnessy 2011-07-26 20:18:17 UTC

OK, there seems to be a plausible explanation for this and it has to do
with a server changing state.  Typically this means a server restart.  If
a server restarts quickly (say, probably less than a minute) then it is 
possible that the server restart is completely missed by an agent.  This
is bad because the agent will not re-register with the server. 

On startup we clear the server references (rhq-agent.server_id) for the
server being started up.  The logic being that if the server was down
these references must be stale because the server obviously was not
servicing the agents.

But that means that a running agent may no longer have a server
reference in the agent table.  This is bad because it means that the
agent will be passed over when populating the alert condition cache
for the server. And, of course, it also messes up the agent count for in
the server listing, as described in this BZ.

There may also be an analogous situation with Maintenance Mode, but he
restart would likely be the more common case, for servers with enough
horsepower to restart quickly.

The workaround is to restart an affected agent.  But a preventative
workaround is to ensure a slower server startup, by say 300 seconds,
in rhq-server.properties (rhq.server.ensure-down-time-secs=300).

A slow server startup is not really great, I think we should solve
this in a different way going forward.  I'll create a new BZ for this.

So, no code change recommended for this.

Comment 3 Simeon Pinder 2011-07-27 20:25:50 UTC

*** Bug 725320 has been marked as a duplicate of this bug. ***

Comment 4 Charles Crouch 2011-07-27 21:24:36 UTC

Jay added https://bugzilla.redhat.com/show_bug.cgi?id=725881 regarding his comment2

Comment 5 Charles Crouch 2011-07-27 21:38:04 UTC

Fix will be done as part of 725881

Note You need to log in before you can comment on or make changes to this bug.