it's just my suggestion, and why this jira is a feature request and not a bug report, but i don't think we should be using the availability report mechanism for agent heartbeats.
right now, if the availability doesn't change for any of the resources on a particular agent, an availability report is still sent with size=1. after talking with mazz, i understand that the availability for the platform is always sent because it is used to determine the last time we've heard from the agent (the heartbeat). however, this is against the RLE concept that the availability reports are supposed to represent.
although it's true that we're currently minimizing traffic over the wire by doing it this way, the savings are absolutely negligible in the face of other reporting mechanisms like content, measurements, events. i think if we separate the heartbeat from availability, then the log messages become that much more accurate, and thus that much more relevant. if i see a heartbeat message, i know my agent is still up (and i have the option to disable this log msg independently); if i see an availability report, i know that the availability for some resource has changed (and i can disable this log msg independently too). this implicitly means that availability reports are ONLY sent when some availability changes, instead of all the time with at least one record that has a special overridden meaning.
I'd like to propose a new solution to monitor the health of our agents. First, a little bit of background is in order. Below are some of the points of contention I see with the current design:
1) Availability reporting is doubling for agent heartbeats (see above)
This is bad because an availability report, although small, may be expensive to compute. If you're monitoring dozens or (more likely) hundreds of resources on a box, the agent must contact all of the managed resources on that box to get their current availabilities. This takes time and CPU. Granted, the availability report may be small when it is eventually sent over the line (because it's RLE data), but it could be relatively expensive to compute, and becomes an even large concern if you wanted a very fast heartbeat.
update: this is mitigate by the more recent support for async availability checks, but it does not completely eliminate the points made above.
2) The agent and server have disconnected countdowns
At the time of this writing, the agent is currently collecting availabilities for all resources it's managing every 60 seconds, while the server is checking for suspect agents every 30 seconds. Depending on when the agent and server were respectively started, as well as each of their respective loads, the best guarantee a heuristic based off of this can make is that some action will be taken within a time range (not earlier than <beginTime>, no later than <endTime>). A stronger guarantee is possible.
update: precisely as a result of contention on tables from availability reports coming across the wire from hundreds of agents, we've had to scale back these numbers, which increases the amount of time resources appear to have the wrong status (UP/DOWN) in the UI.
3) Total configurability of a heuristic is dangerous
The time range spoken of above varies, of course, depending on the heuristic as well as the periodic check interval. We can let the user futz with how the heuristic is configured, but realistically that could be dangerous. First of all, the heuristic timeout is lower-bounded by the check interval. For instances, a timeout of 10 seconds doesn't make sense if the heartbeat is 30 seconds; it needs to be greater than 30 seconds. But, if it's too close to 30 seconds, like 31 or 32 seconds, the user may find the number of false positives frustrating.
Furthermore, the agent's heart is less meaningful if we let the user configure the heuristic to trigger only after 5 minutes, because surely some other report would have made it up to the server by then in the case of a healthy agent. So, a value too high almost completely negates the value of having a heartbeat at all because the beat can be instead derived from the subsystems JON is managing at the layer above the heart.
So, to remedy some of these issues, I've given this topic quite a bit of thought and come up with a relatively complete solution that will either mitigate or completely solve them. I've laid it out in several, relatively isolated parts to make it easier to follow, and I've also included sequence diagrams for most (but not all) of the major use cases that this solution will handle.
1) Separate the heartbeat mechanism from the availability reporting
This will allow us to send heartbeats much more frequently. The size won't be much smaller than the common one-element availability report, so we're not saving much in terms of data across the line, but we *will* save a lot in terms of computation time. Since a heartbeat doesn't depend on anything, it can be generated for almost no cost and immediately sent up to the server. This 0-time generation is what will allow us to send heartbeats every 10 seconds without fear of a performance penalty on the agent-side.
2) Support a per-agent timer
Instead of having one global job on the server side that runs every so often (today it runs every 30 seconds), we have one timer per agent. The presumption is that the number of agents that would ever talk to a particular server is more or less limited by the amount of metrics we can collect; for boxes with a moderate amount of data flowing, I'd say 75 agents would be a nice, high average.
update: some figures coming out of the perf environment has shown we can handle well over 200 agents per server. this might cause us to rethink a per-agent timer and instead uses a per-server timer that manages the beat-timers for the agents "connected" to that server (the connected concept was added for more stringent coordination with per-agent alert-caches.
3) Choose in-memory timers
Furthermore, to save the database hits (query and then updates), we can go with an in-memory timer solution (such as EJB3 timers, or something similar). Realize that I'm not looking for a system-wide / global solution here. I don't expect a single server to be responsible for computing the availabilities of all agents in the system. I'm aiming at a solution where each server is responsible for only its own agents.
Here, unlike the periodic check interval we use today, the timers would be correlated with their heartbeat senders. So, when some agent sends a heartbeat message, the appropriate timer is woken up and reset. For example, if the heartbeat is sent every 10 seconds, an appropriate countdown might be 15 seconds. If the heartbeat arrives a little late (perhaps due to heavy load on the agent side, network congestion, etc) it might take a few extra seconds to get there. However, as soon as it does, the server finds the counter that relates to that agent and resets / updates it.
Assuming that the system is chugging along and that all counters are appropriately being reset, everything stays green. However, as *soon* as a timer reaches zero, our first threshold has been meet and the availability should immediately go to the UNKNOWN state. Here, a different countdown takes over. Perhaps after another 30 seconds (a.k.a., our second counter) the server doesn't receive a heartbeat from the agent, the agent's UNKNOWN time is backfilled as DOWN.
So, using a system like this we can assert much stronger guarantees. First, we can tell people that the server will be aware of suspect agents within 15 seconds of the last time the agent contacted it, and that that agent will be marked as DOWN after 45 total seconds of missing heartbeat messages. These numbers, 15 and 45, are not estimates nor time ranges - they are absolute and precise.
4) Cluster consideration
This can work in a similar manner to how the per-agent alert-caches work. Essentially, when an agent comes online it "connects" to one server in the system. This connection message tells the server that it is now responsible for managing that agent. Consequently, the server has a chance to warm caches that are specific for data coming from that agent. The heartbeat is simply another form of this data.
Now, if you're wondering whether this proposal will be as configurable as the current implementation - well, sort of. See, the major concern in either case is the existence of dependencies between particular elements of the algorithm and, more specifically, how dangerous it was to let the user have full control over the configurability of the beat interval and heuristic countdown.
To remedy that, I propose we allow the users to choose from a pre-existing set of valid configurations. So, this is pseudo-configurable from the users perspective because they still have control over changing it to a degree, just not complete control.
I suggest that after we implement this solution we conduct simulations to determine what the reasonable configurations for this algorithm are. I'm thinking that are target heartbeat interval range is somewhere between 10 and 60 seconds. If it were any faster than 10 seconds, it would be difficult for us to come up with a satisfactory set of values for the heuristic; any slower than 60 seconds wouldn't be very helpful because, like I mentioned in the background of this case, some other reporting mechanism *would* have made it to the server by then. And anyway, now that we've separated heartbeats from availability reporting, I contend this solution will easily be able to handle even very large groups of agents (say 150) without issue, which is about double what I expect the relatively high usage case to be.
update: yes, this solution will certainly be able to manage 150 beat timers, but we already know other types of reports have far greater payloads. in short, this would not be a bottleneck.
6) Handling server failures
If the server crashes, it's not important to have recorded the last time we spoke with the agent in the database. We we're not allowing a heartbeat to be greater than 60 seconds (from the Pseudo-configurability section), and JON currently takes close to 60 seconds to startup, so the probability of needing a persisted last-heard-from time is slim. Thus, all we have to do when the server starts up is start the beat timers for the agent's that are registered against this server. The agents should still be sending their asynchronous heartbeats, so things should continue working as if the server had never gone down (characterized as the "normal operations" diagram)
It's true that it's possible some agent could have gone down while the server was down, but that's OK because the aforementioned heartbeat algorithm will work on that just fine. The first timer will expire based on the suspect multiplier, which creates the second timer that will expire on the downed multiplier, and that's the server's cue for marking the agent as down.
update: on second thought, if a server starts up and the agent is already down, these beat timers will never be created. so we still need a periodic, global job that runs as a last-ditch catch-all. this means that we *will* need to periodically persist data, but certainly not as fast as we collect beats. for instance, if i'm collecting beats every 10s, i may only want to record the last-heard-from times every 60s. in the vast majority of cases, in an HA environment, and because of agent failover procedures, the steady state will render this global job a no-op most of the time.
7) Heartbeats are special priority messages
The human body can sometimes function with impairments, but it can't function if the heart isn't beating. We should, for the sake of this algorithm's robustness, always push the agent-side heartbeat to the front of the sending p-queue, or put it on an isolated sending pipe altogether.
If priority (or isolation) isn't given to the beat over other things such as content reports, measurement reports, event logs, operation results, etc there is a significantly higher probability that it won't work correctly in all cases of system error. The delays that might be incurred by choosing to send other pieces of data ahead of the heartbeats are unacceptable.
8) Concurrent considerations
So, with two different types of timers per agent in this solution, and considering this is a multi-threaded environment we need to consider concurrent access in the design as well. A single use case will illustrate this.
Say that the timer on the server-side just trigged, and so the TimerManager wakes up and starts doing it's thing. At this point, while the manager is setting the agent to unknown (and waiting for whatever other consequences would be part of that call to the AvailabilityManager to finish), a heartbeat for this agent comes in. Another thread starts to handle this beat, but at this point any number of things can happen depending on how these threads have their executions woven together. Depending on whether the downed timer was created or not, this newer beat handler thread may or may not delete it, which would affect how and whether it would interact with the AvailabilityManager. If it does enter a code path where it makes a call to the AvailabilityManager to attempt to notify the backing store of the healthiness of the agent, it might conflict with the other thread which would be operating on the same resources. Depending on the transactional boundaries and the isolation level of the connection to the database, any number of things can happen to put the availabilities for the resource's this agent manages into an inconsistent state.
The simple remedy to this is to analyze the proposed algorithm and figure out what must be done atomically per agent. Then, the TimerManager can act as Locksmith and allow callers to acquire and release agent-specific locks during heartbeat processing (this is depicted on all of the attached sequence diagrams).
update: turns out this analysis was correct. today we have the AvailabilityReportSerializer for precisely this reason, to ensure that only one thread of work is happen at a time per agent.
9) Separate agent health from resource availability
This entire solution has thus far been leading up to one relatively radical change of thought: the separation of the health of an agent from the availability of some resource managed by it. This conceptual separation actually helps simplify some of the semantics we want to present to end users.
Now instead of just having UP and DOWN availability states for a resource, we will also have AGENT_UNKNOWN. This is a special construct that might be rendered visually as our classic "?", or it might have a different color altogether. The important thing to realize is that AGENT_UNKNOWN is *not* the same as the DOWN state. Thus, if the user had an alert on some resource that said "if resource goes down", this alert would not trigger. In my mind, a downed resource is a resource that we know is down, which implies that the agent managing that resource is up, healthy, and has sent back an availability report saying the resource was down (according to the plugin writer's availability semantics for the resourceType in question).
On the other side of the world, agent health is a separate concept from resource availability. Granted, they sound similar, they even look similar, but from all other angles they are distinct. Essentially, what we're doing here is pulling the heartbeat mechanism up, out, and away from the realm of resource availability, and making it into a first-class program construct. This construct would then be visually weaved into the server administration part of the UI. The "heartbeat" subsection would allow the user to choose which of the pre-simulated heuristics they want each individual agent to run, as well as would allow them to configure alerts on this special construct.
Thus, with this new separation, instead of getting a potential firestorm of availability alerts from all of your resources going down at relatively the same time, *none* of the resources technical go down at all. They are put into the AGENT_UNKNOWN state. The only alerts that should fire in this case are the ones the customer has set up against the agent itself.
When the agent rejoins the server, the sequence diagrams that I've laid out will take over and the agent will be marked as healthy agent. Depending on whether the agent has crashed or if there had been a network partition, the resources managed by this agent may or may not have their availabilities backfilled...but this is the correct logic. If the agent had crashed, then the agent didn't really know what the availabilities were during the time it was down; thus it's appropriate to leave the resource availabilities in the AGENT_UNKNOWN state. On the other hand, if it was just a temporary network blip but the agent was still collecting data during that whole time, then the AvailabilityManager we have today will have the responsibility of backfilling the AGENT_UNKNOWN state with the "real" state of the resource as recorded by the agent during the time it had temporarily gotten severed from the server.
update: this is predicated on turning the availability report into an async, guaranteed delivery message to the server (otherwise intermediate up/down state changes while the agent is disconnect from some server might be lost)
Created attachment 434004 [details]
sequence diagram for steady state heartbeat operation
Created attachment 434005 [details]
sequence diagram for handling suspect agents
Created attachment 434006 [details]
sequence diagram for handling presumed to be downed agents
ccrouch: would you like to put this in jon 3?
The gist of this has basically been implemented in the jshaughn/avail branch.
Agent availability is no longer tied to avail reporting, a separate ping is
For more see:
This is in master.
Testing for this may simply be implicit. Perf testing will be more
relevant to ensure that we don't see unwanted backfilling due to high
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.