Description of problem: The connected agent count as listed by http://localhost:7080/rhq/ha/listAgents.xhtml can occasionally show less agents than are successfully reporting back to the JON server. With one JON server and three agents connected, one of the remote agents was no longer being displayed in the connected agents list despite continuous UP availability when actually navigating to that same agent via the left nav tree.
This behavior can cause off alert and agent synchronization issues.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Add three agents to JON server.
2. Monitor usage over time and http://localhost:7080/rhq/ha/listAgents.xhtml will occasionally show less agents being listed despite actual availability.
Agent list incorrectly reports the number of agents actually and successfully reporting back to the JON.
No discrepancy between listAgents and available alert data.
So far Simeon and I have not been able to recreate this.
Note that when an agent registers, or connects, the rhq_agent.server_id field is set to the serverid handling the agent. When the agent exits or backfills this value is not nulled out. So, in general the rhq_agent entry will always refer to the most recent server that handled its requests.
This is why the in the server admin screen, the agent count for a server may not reflect the live count of agents as some of the agents may be down.
It also means that the agent.server_id field is typically not null after it is set the first time. I have yet to see a use case, or code, that sets it null, so if it is null at some point that should be investigated.
OK, there seems to be a plausible explanation for this and it has to do
with a server changing state. Typically this means a server restart. If
a server restarts quickly (say, probably less than a minute) then it is
possible that the server restart is completely missed by an agent. This
is bad because the agent will not re-register with the server.
On startup we clear the server references (rhq-agent.server_id) for the
server being started up. The logic being that if the server was down
these references must be stale because the server obviously was not
servicing the agents.
But that means that a running agent may no longer have a server
reference in the agent table. This is bad because it means that the
agent will be passed over when populating the alert condition cache
for the server. And, of course, it also messes up the agent count for in
the server listing, as described in this BZ.
There may also be an analogous situation with Maintenance Mode, but he
restart would likely be the more common case, for servers with enough
horsepower to restart quickly.
The workaround is to restart an affected agent. But a preventative
workaround is to ensure a slower server startup, by say 300 seconds,
in rhq-server.properties (rhq.server.ensure-down-time-secs=300).
A slow server startup is not really great, I think we should solve
this in a different way going forward. I'll create a new BZ for this.
So, no code change recommended for this.
*** Bug 725320 has been marked as a duplicate of this bug. ***
Jay added https://bugzilla.redhat.com/show_bug.cgi?id=725881 regarding his comment2
Fix will be done as part of 725881