Bug 918205 - Agent appears UNKNOWN in inventory but still active; possibly caused by drop of availability report
Summary: Agent appears UNKNOWN in inventory but still active; possibly caused by drop ...
Keywords:
Status: CLOSED DUPLICATE of bug 1094540
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified vote
Target Milestone: ---
: ---
Assignee: Jay Shaughnessy
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On: 1094540
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-05 17:30 UTC by Elias Ross
Modified: 2014-07-18 18:39 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-07-18 18:39:31 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1056562 None None None Never

Internal Links: 1056562

Description Elias Ross 2013-03-05 17:30:30 UTC
Description of problem:

I have an installation with many agents. Some of the RHQ Agents and Platforms appear down but are in fact perfectly functional and actually reporting metrics. But they are no longer reporting availability. It could be their availability report was either dropped by the server or somehow lost and their state is then marked UNKNOWN. I'm not sure how this transition works.


Version-Release number of selected component (if applicable):

4.5.1


How reproducible:

It's fairly hard to reproduce. However, I believe it is caused when the communication subsystem drops the availability report. In any case, it seems to coincide with messages dropped by the server.


Additional info:

The work around is to manually disable then enable the resource. Then the agent will report it is working again. This can be done using a RHQ CLI script.

Comment 1 Elias Ross 2013-07-01 17:16:20 UTC
What reproduces this fairly reliably is if the database goes down for a time period (10-15 minutes) and comes back. The server will reconnect to the database and recover but platforms appear down. You do see metrics being inserted still.

Comment 2 Elias Ross 2013-11-06 19:26:59 UTC
I've also seen this in 4.9, to a lesser degree. About 10% of the servers after a fairly long network outage (30 minutes) still appeared down, even though they were clearly sending traffic to the server and functioning fine.

The fix as described above still works.

Comment 3 Jay Shaughnessy 2014-04-04 18:55:09 UTC
I failed in an initial attempt to reproduce this using master (4.10+).  But I tried only with one agent.  It came up immediately after the server reconnected to the database.

I'd be surprised to see an avail report get dropped. I don't think that is the issue.  It may be more along the lines of a full avail report not getting requested on re-connect, or something like that.  Since the agent has been running it's avail checks may not have resulted in any changed avail.

There has been a bunch of sync work don in 4.10. Wondering if this is still seen in 4.10.

Comment 4 Larry O'Leary 2014-05-23 02:08:05 UTC
As suggested by https://bugzilla.redhat.com/show_bug.cgi?id=1094540#c17 this issue appears to have been resolved in downstream and was committed to master in: https://github.com/rhq-project/rhq/commit/94008542694eef157289e6f9884669480021b565 and https://github.com/rhq-project/rhq/commit/dcc27a2c1f1acbf9fb818c92eb27ce278ef6db99

@Jay, do you concur?

Comment 5 Jay Shaughnessy 2014-05-27 22:53:28 UTC
Larry, +1. I think this is resolved.

Comment 6 Jay Shaughnessy 2014-07-18 18:39:31 UTC
This is resolved in Bug 1094540, marking as duplicate.

*** This bug has been marked as a duplicate of bug 1094540 ***


Note You need to log in before you can comment on or make changes to this bug.