Bug 725445
Summary: | connected Agent count is sometimes incorrect with multiple agents even though agent still up and reporting. | ||
---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Simeon Pinder <spinder> |
Component: | Agent | Assignee: | Jay Shaughnessy <jshaughn> |
Status: | CLOSED WONTFIX | QA Contact: | Mike Foley <mfoley> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 3.0.1 | CC: | hrupp, jshaughn |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-07-27 21:38:04 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 678340, 725459 |
Description
Simeon Pinder
2011-07-25 14:11:20 UTC
So far Simeon and I have not been able to recreate this. Note that when an agent registers, or connects, the rhq_agent.server_id field is set to the serverid handling the agent. When the agent exits or backfills this value is not nulled out. So, in general the rhq_agent entry will always refer to the most recent server that handled its requests. This is why the in the server admin screen, the agent count for a server may not reflect the live count of agents as some of the agents may be down. It also means that the agent.server_id field is typically not null after it is set the first time. I have yet to see a use case, or code, that sets it null, so if it is null at some point that should be investigated. OK, there seems to be a plausible explanation for this and it has to do with a server changing state. Typically this means a server restart. If a server restarts quickly (say, probably less than a minute) then it is possible that the server restart is completely missed by an agent. This is bad because the agent will not re-register with the server. On startup we clear the server references (rhq-agent.server_id) for the server being started up. The logic being that if the server was down these references must be stale because the server obviously was not servicing the agents. But that means that a running agent may no longer have a server reference in the agent table. This is bad because it means that the agent will be passed over when populating the alert condition cache for the server. And, of course, it also messes up the agent count for in the server listing, as described in this BZ. There may also be an analogous situation with Maintenance Mode, but he restart would likely be the more common case, for servers with enough horsepower to restart quickly. The workaround is to restart an affected agent. But a preventative workaround is to ensure a slower server startup, by say 300 seconds, in rhq-server.properties (rhq.server.ensure-down-time-secs=300). A slow server startup is not really great, I think we should solve this in a different way going forward. I'll create a new BZ for this. So, no code change recommended for this. *** Bug 725320 has been marked as a duplicate of this bug. *** Jay added https://bugzilla.redhat.com/show_bug.cgi?id=725881 regarding his comment2 Fix will be done as part of 725881 |