Description of problem:
I have an installation with many agents. Some of the RHQ Agents and Platforms appear down but are in fact perfectly functional and actually reporting metrics. But they are no longer reporting availability. It could be their availability report was either dropped by the server or somehow lost and their state is then marked UNKNOWN. I'm not sure how this transition works.
Version-Release number of selected component (if applicable):
It's fairly hard to reproduce. However, I believe it is caused when the communication subsystem drops the availability report. In any case, it seems to coincide with messages dropped by the server.
The work around is to manually disable then enable the resource. Then the agent will report it is working again. This can be done using a RHQ CLI script.
What reproduces this fairly reliably is if the database goes down for a time period (10-15 minutes) and comes back. The server will reconnect to the database and recover but platforms appear down. You do see metrics being inserted still.
I've also seen this in 4.9, to a lesser degree. About 10% of the servers after a fairly long network outage (30 minutes) still appeared down, even though they were clearly sending traffic to the server and functioning fine.
The fix as described above still works.
I failed in an initial attempt to reproduce this using master (4.10+). But I tried only with one agent. It came up immediately after the server reconnected to the database.
I'd be surprised to see an avail report get dropped. I don't think that is the issue. It may be more along the lines of a full avail report not getting requested on re-connect, or something like that. Since the agent has been running it's avail checks may not have resulted in any changed avail.
There has been a bunch of sync work don in 4.10. Wondering if this is still seen in 4.10.
As suggested by https://bugzilla.redhat.com/show_bug.cgi?id=1094540#c17 this issue appears to have been resolved in downstream and was committed to master in: https://github.com/rhq-project/rhq/commit/94008542694eef157289e6f9884669480021b565 and https://github.com/rhq-project/rhq/commit/dcc27a2c1f1acbf9fb818c92eb27ce278ef6db99
@Jay, do you concur?
Larry, +1. I think this is resolved.
This is resolved in Bug 1094540, marking as duplicate.
*** This bug has been marked as a duplicate of bug 1094540 ***