There are several ways in which avail reporting and handling could possibly be improved. This is a tracker bug for all of it. Below is an etherpad clipping for various thoughts and ideas. ====================================================================== Fixing Availability Availability is the RHQ way of reporting whether a resource is up, down or in an unknown state. It is a critical component in monitoring and alerting. Unfortunately, RHQ today has serious issues with availability handling: * Very slow reporting of availability changes. * availability reports are not sent frequently from the agent * 5 or more minutes can pass before avail is actually updated * Very disconcerting in the UI when we report a status the user knows is incorrect. * Perhaps unusable for SLA outages. * Resource cycling can actually be missed. * Availability reports are waiting behind other long duration transfers * We alert only on availability change, not availability duration. * Only offer avail conditions "goes up: and "goes down", not "is up" and and "is down" * No applicable dampening, which confuses, and then dismays, users * No notion of "admin down" * Poor handling when the RHQ Agent goes down, even gracefully. * Slow recognition * Avail set to down for all monitored resources on that agent. * This can be misleading, unknown probably makes more sense The main blocker to doing this better is performance. We need to be able to monitor and report availability for a large number of resources, over a large number of agents, with little delay and litte overhead. And for alerting purposes, we need to enhance the reporting to support dampening (on avail periods). There are a few mechanisms in play: * The agent must ask the resource container for its avail state. * the avail checking code is plugin dependent and has no guarantees on efficiency. * the avail check may hang for a resource that is truly down. * Note that we do support an asynch avail reporting mechanism, which is useful for fast avail checking if implemented by the plugin. * The agent must report availability to the server. * The server must respond to availability changes on the resources. * The server must perform alerting. * Tables/Domain classes * rhq_availability (resource avail history) * rhq_resource_avail (resource current avail) * AvailabilityReport (the object passed from agent to server) * Checking avail on all resources: Parents, children, grandchildren... Ideas: * Certainly fix https://bugzilla.redhat.com/show_bug.cgi?id=701092 * Different frequencies for different categories * More frequent for servers, less frequent avail checking for services * Or, even more granularity * No avail checking for dependent types * deferral to the parent's avail * Detach agent avail from avail reporting * agent ping to server to prove it's alive * server ping to agent before backfilling * Only report avail changes (means keeping/comparing against previous report agent-side) * Store lastUpdateTime on ResourceAvailability to be able to calculate the length a resource has been at the current avail type * Agent side, process avail checks top down because any parent down implies all children are down. In fact, perhaps server side we assume that we won't get down avail below a down parent, and we handle (backfill) the children server-side. * make all avail checking async on the agent. meaning, the pc maintains the last known avail and checks agains that (is that possible, efficient, do we do it already) * Divide up server side work among HA servers, each handling its connected agents (in memory timers) * probability-based approaches? * Have avail reports be sent out-of band or as 1st prio message to the server * Cache ResourceAvail table / last entry in Avail table * base avail on PID (PC could always do that for processes and only call getAvail() when pid is up) * doable where? * push avail from plugin? Relevant Configuration Properties: * Availability Report Limit (rhq.server.concurrency-limit.availability-report) * Number of availability reports that can be processed concurrently; if zero or less, there is no limit * default 25 * Agent Max Quiet Time * Number of minutes agent has to provide avail report before being backfilled * default 15 minutes * availability-scan.period-secs * the period between availability scans on the agent * default 5 minutes * rhq.agent.plugins.availability-scan.timeout * default 5 seconds Relevant Wikis * http://rhq-project.org/display/RHQ/Design-Asynchronous+Availability+Collector * http://rhq-project.org/display/JOPR2/FAQ#FAQ-Myserverlogsareshowingthemessage%22Havenotheardfromagent...Willbebackfilledsincewesuspectitisdown%22.Whatdoesthatmean%3F * http://rhq-project.org/display/JOPR2/FAQ#FAQ-Explainhowtheagentscansforresources * http://rhq-project.org/display/JOPR2/FAQ#FAQ-WhenIshutdowntheagent%2CtheRHQServertakesmorethan14minutestodetecttheagentwasdown.CanIconfigureittonottakesolong%3F * http://rhq-project.org/display/RHQ/Ideas+about+Caching#IdeasaboutCaching-Availability Relevant BZs (this section removed, bugs are now linked) Questions When do you get unknown availability? * When a resource is first imported before any data has been reported for the resource TODOs * Decide whether communications between servers is required. * Allow users to denote "important" resources for small checking intervals? Priorities High: Agent updates to reduce CPU utilization Medium/High: Server side avail reporting 2k Resources + 2 AS , 1min checks => 6k availabilities per minute coming in? and vice-versa - 1 agent with 50 ASes
*** Bug 795915 has been marked as a duplicate of this bug. ***
just documenting the testing that has been done on this feature: 1) TCMS testcase development https://engineering.redhat.com/trac/jon/ticket/39 2) Test Day 3/28/2012. details below. Hi, Today was a Test Day for JON QE on the Availability feature. FYI: we do these Test Days to get some targetted ad-hoc coverage on new features or features that may be risky. The approach is effective at raising BZs, and compliments the more structured approaches of automation and TCMS. Just sharing the results .... Regards, Michael (on behalf of the JON QE Team) p.s. Extremely nice job Sunil coordinating things in Pune!! JBoss Opeartion Network Test Day - 28-Mar-2012 Feature: Availability Test Day Date : 28-Mar-2012 Goal: - Adhoc Testing on availability checking. - Raising BZs.. Agenda: 10:30 am - 10:45 am - Brief on the Feature 10:45 am - 01:30 pm - Testing on the feature 01:30 pm - 02:30 pm - Lunch 02:30 pm - 05:00 pm - Testing on the feature 05:00 pm - 07:00 pm - Analysing / Reporting bugs Who's available: GSS: Sumit, Ravish, jay, Dashrath QE: Spandey, Akarol, Sunil, Jeeva Temp IRC Channel: #JON_QE Resources available: 1. Spandey 2. Akarol 3. Sunil 4. Jeeva 5..Jay 6 ..Ravish 7..Sumit 8..Dashrath 9.. 10.. TCMS : Test Plan : ( https://tcms.engineering.redhat.com/plan/5700/ ) Docs to refer: 1.Link for documentaion: http://rhq-project.org/display/RHQ/Design-Availability+Checking 2. BZs related: https://bugzilla.redhat.com/show_bug.cgi?id=787227 https://bugzilla.redhat.com/show_bug.cgi?id=701092 https://bugzilla.redhat.com/show_bug.cgi?id=701092 https://bugzilla.redhat.com/show_bug.cgi?id=536173 https://bugzilla.redhat.com/show_bug.cgi?id=534286 https://bugzilla.redhat.com/show_bug.cgi?id=535678 https://bugzilla.redhat.com/show_bug.cgi?id=617648 https://bugzilla.redhat.com/show_bug.cgi?id=536250 https://bugzilla.redhat.com/show_bug.cgi?id=807803 3. Known Issues Tracker Bug : https://bugzilla.redhat.com/show_bug.cgi?id=741450 Test Environment links: 10.65.201.124:7080 : spandey http://10.65.201.170:7080 : akarol http://10.65.193.38:7080 : sunil Admin Login: rhqadmin/rhqadmin Where to see the Server Log: ssh 10.65.201.124/redhat ssh 10.65.201.170/redhat tailf /install/rhqbuilds/master/build1185/rhq-server-4.4.0-SNAPSHOT/logs/rhq-server-log4j.log Please add your observations below: Sunil 1. Duration field accepts hyphen character (Ex: -2 ) https://bugzilla.redhat.com/show_bug.cgi?id=807660 2. Reenabling resources displays server as unknown..But even after waiting for more that 10 mins, the status does not go up Existing Bug#701092 3. Compatible group count of number of children and descendents goes wrong if a member resource is disabled. https://bugzilla.redhat.com/show_bug.cgi?id=807671 Spandey: Akarol: 1, email validation required. email wilth only xyz@abc can be added. hence the notitications will failed. 2.UI ambiquity. - for editing double click works only for editing notifications.it should also work for editing alerts also etc.. 3.status not displayed correctly for cron spandey : 1 minute availabilty check is not workng for HTTPd server 1) Created Alert for httpd server start , Scheduled start from operatons, it shows sucessfully started. Alert notfcation not received even after 5 min . displaying wrong server status in UI. Existing Bug#701092 Jeeva: https://bugzilla.redhat.com/show_bug.cgi?id=807629 Mike Avail icon on resource detail page is not refreshed https://bugzilla.redhat.com/show_bug.cgi?id=807803 Summary -- Bugs Added on test day on 28th March: https://bugzilla.redhat.com/show_bug.cgi?id=807660 https://bugzilla.redhat.com/show_bug.cgi?id=807671 https://bugzilla.redhat.com/show_bug.cgi?id=807629 https://bugzilla.redhat.com/show_bug.cgi?id=807803 Resources: TCMS Test Cases https://tcms.engineering.redhat.com/plan/5700/#reviewcases Some testcases and ideas for testing are here: http://jbosson.etherpad.corp.redhat.com/88 Design document: http://rhq-project.org/display/RHQ/Design-Availability+Checking