Bug 918205

Summary: Agent appears UNKNOWN in inventory but still active; possibly caused by drop of availability report
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: AgentAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED DUPLICATE QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.5CC: hrupp, loleary
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-18 18:39:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1094540    
Bug Blocks:    

Description Elias Ross 2013-03-05 17:30:30 UTC
Description of problem:

I have an installation with many agents. Some of the RHQ Agents and Platforms appear down but are in fact perfectly functional and actually reporting metrics. But they are no longer reporting availability. It could be their availability report was either dropped by the server or somehow lost and their state is then marked UNKNOWN. I'm not sure how this transition works.


Version-Release number of selected component (if applicable):

4.5.1


How reproducible:

It's fairly hard to reproduce. However, I believe it is caused when the communication subsystem drops the availability report. In any case, it seems to coincide with messages dropped by the server.


Additional info:

The work around is to manually disable then enable the resource. Then the agent will report it is working again. This can be done using a RHQ CLI script.

Comment 1 Elias Ross 2013-07-01 17:16:20 UTC
What reproduces this fairly reliably is if the database goes down for a time period (10-15 minutes) and comes back. The server will reconnect to the database and recover but platforms appear down. You do see metrics being inserted still.

Comment 2 Elias Ross 2013-11-06 19:26:59 UTC
I've also seen this in 4.9, to a lesser degree. About 10% of the servers after a fairly long network outage (30 minutes) still appeared down, even though they were clearly sending traffic to the server and functioning fine.

The fix as described above still works.

Comment 3 Jay Shaughnessy 2014-04-04 18:55:09 UTC
I failed in an initial attempt to reproduce this using master (4.10+).  But I tried only with one agent.  It came up immediately after the server reconnected to the database.

I'd be surprised to see an avail report get dropped. I don't think that is the issue.  It may be more along the lines of a full avail report not getting requested on re-connect, or something like that.  Since the agent has been running it's avail checks may not have resulted in any changed avail.

There has been a bunch of sync work don in 4.10. Wondering if this is still seen in 4.10.

Comment 4 Larry O'Leary 2014-05-23 02:08:05 UTC
As suggested by https://bugzilla.redhat.com/show_bug.cgi?id=1094540#c17 this issue appears to have been resolved in downstream and was committed to master in: https://github.com/rhq-project/rhq/commit/94008542694eef157289e6f9884669480021b565 and https://github.com/rhq-project/rhq/commit/dcc27a2c1f1acbf9fb818c92eb27ce278ef6db99

@Jay, do you concur?

Comment 5 Jay Shaughnessy 2014-05-27 22:53:28 UTC
Larry, +1. I think this is resolved.

Comment 6 Jay Shaughnessy 2014-07-18 18:39:31 UTC
This is resolved in Bug 1094540, marking as duplicate.

*** This bug has been marked as a duplicate of bug 1094540 ***