Bug 741450

Summary: RFE: Improve Availability Handling (Tracker)
Product: [Other] RHQ Project Reporter: Jay Shaughnessy <jshaughn>
Component: Core Server, Core UI, Alerts, Plugin Container, PluginsAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED NOTABUG QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: medium    
Version: 4.1CC: hrupp
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: All   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=534286
https://bugzilla.redhat.com/show_bug.cgi?id=534375
https://bugzilla.redhat.com/show_bug.cgi?id=534725
https://bugzilla.redhat.com/show_bug.cgi?id=535352
https://bugzilla.redhat.com/show_bug.cgi?id=535676
https://bugzilla.redhat.com/show_bug.cgi?id=535678
https://bugzilla.redhat.com/show_bug.cgi?id=536173
https://bugzilla.redhat.com/show_bug.cgi?id=617648
https://bugzilla.redhat.com/show_bug.cgi?id=701092
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-15 17:58:37 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On: 743727, 534286, 534375, 534721, 534979, 535435, 535676, 535678, 535827, 535924, 536173, 536250, 584442, 617648, 701092, 729296, 811287    
Bug Blocks:    

Description Jay Shaughnessy 2011-09-26 16:33:21 EDT
There are several ways in which avail reporting and handling could 
possibly be improved.  This is a tracker bug for all of it.  
Below is an etherpad clipping for various thoughts and ideas.

======================================================================


Fixing Availability

Availability is the RHQ way of reporting whether a resource is up, down or in an unknown state.  It is a critical  component in monitoring and alerting.

Unfortunately, RHQ today has serious issues with availability handling:

    * Very slow reporting of availability changes.  

    * availability reports are not sent frequently from the agent

    * 5 or more minutes can pass before avail is actually updated

    * Very disconcerting in the UI when we report a status the user knows is incorrect.

    * Perhaps unusable for SLA outages.

    * Resource cycling can actually be missed.

    * Availability reports are waiting behind other long duration transfers

    * We alert only on availability change, not availability duration.

    * Only offer avail conditions "goes up: and "goes down", not "is up" and and "is down"

    * No applicable dampening, which confuses, and then dismays, users

    * No notion of "admin down"

    * Poor handling when the RHQ Agent goes down, even gracefully.

    * Slow recognition

    * Avail set to down for all monitored resources on that agent.

    * This can be misleading, unknown probably makes more sense



The main blocker to doing this better is performance.  We need to be able to monitor and report availability for a large number of resources, over a large number of agents, with little delay and litte overhead.  And for alerting purposes, we need to enhance the reporting to support dampening (on avail periods).

There are a few mechanisms in play:

    * The agent must ask the resource container for its avail state.

    * the avail checking code is plugin dependent and has no guarantees on efficiency.

    * the avail check may hang for a resource that is truly down.

    * Note that we do support an asynch avail reporting mechanism, which is useful for fast avail checking if implemented by the plugin.

    * The agent must report availability to the server.

    * The server must respond to availability changes on the resources.

    * The server must perform alerting.

    * Tables/Domain classes

    * rhq_availability (resource avail history)

    * rhq_resource_avail (resource current avail)

    * AvailabilityReport (the object passed from agent to server)

    * Checking avail on all resources: Parents, children, grandchildren...


Ideas:

    * Certainly fix https://bugzilla.redhat.com/show_bug.cgi?id=701092

    * Different frequencies for different categories

    * More frequent for servers, less frequent avail checking for services

    * Or, even more granularity

    * No avail checking for dependent types

    * deferral to the parent's avail

    * Detach agent avail from avail reporting

    * agent ping to server to prove it's alive

    * server ping to agent before backfilling

    * Only report avail changes (means keeping/comparing against previous report agent-side)

    * Store lastUpdateTime on ResourceAvailability to be able to calculate the length a resource has been at the current avail type

    * Agent side, process avail checks top down because any parent down implies all children are down. In fact, perhaps server side we assume that we won't get down avail below a down parent, and we handle (backfill) the children server-side.

    * make all avail checking async on the agent. meaning, the pc maintains the last known avail and checks agains that (is that possible, efficient, do we do it already)

    * Divide up server side work among HA servers, each handling its connected agents (in memory timers)

    * probability-based approaches?

    * Have avail reports be sent out-of band or as 1st prio message to the server

    * Cache ResourceAvail table / last entry in Avail table

    * base avail on PID (PC could always do that for processes and only call getAvail() when pid is up)

    * doable where?

    * push avail from plugin?



Relevant Configuration Properties:

    * Availability Report Limit  (rhq.server.concurrency-limit.availability-report)

    * Number of availability reports that can be processed concurrently; if zero or less, there is no limit

    * default  25

    * Agent Max Quiet Time

    * Number of minutes agent has to provide avail report before being backfilled

    * default 15 minutes

    * availability-scan.period-secs

    * the period between availability scans on the agent

    * default 5 minutes

    * rhq.agent.plugins.availability-scan.timeout

    * default 5 seconds


Relevant Wikis

    * http://rhq-project.org/display/RHQ/Design-Asynchronous+Availability+Collector

    * http://rhq-project.org/display/JOPR2/FAQ#FAQ-Myserverlogsareshowingthemessage%22Havenotheardfromagent...Willbebackfilledsincewesuspectitisdown%22.Whatdoesthatmean%3F

    * http://rhq-project.org/display/JOPR2/FAQ#FAQ-Explainhowtheagentscansforresources

    * http://rhq-project.org/display/JOPR2/FAQ#FAQ-WhenIshutdowntheagent%2CtheRHQServertakesmorethan14minutestodetecttheagentwasdown.CanIconfigureittonottakesolong%3F

    * http://rhq-project.org/display/RHQ/Ideas+about+Caching#IdeasaboutCaching-Availability


Relevant BZs
(this section removed, bugs are now linked)


Questions
When do you get unknown availability?

    * When a resource is first imported before any data has been reported for the resource


TODOs

    * Decide whether communications between servers is required.

    * Allow users to denote "important" resources for small checking intervals?


Priorities
High: Agent updates to reduce CPU utilization 
Medium/High: Server side avail reporting

2k Resources + 2 AS , 1min checks => 6k availabilities per minute coming in?
    and vice-versa - 1 agent with 50 ASes
Comment 1 Jay Shaughnessy 2012-02-21 17:16:10 EST
*** Bug 795915 has been marked as a duplicate of this bug. ***
Comment 2 Mike Foley 2012-03-29 11:44:40 EDT
just documenting the testing that has been done on this feature:

1) TCMS testcase development
https://engineering.redhat.com/trac/jon/ticket/39

2) Test Day 3/28/2012.  details below.

Hi,

Today was a Test Day for  JON QE on the Availability feature.    

FYI: we do these Test Days to get some targetted ad-hoc coverage on new features or features that may be risky.  The approach is effective at raising BZs, and compliments the more structured approaches of automation and TCMS.

Just sharing the results ....

Regards,

Michael (on behalf of the JON QE Team)

p.s.  Extremely nice job Sunil coordinating things in Pune!!   






                     JBoss Opeartion Network Test Day  - 28-Mar-2012


Feature: Availability 

Test Day Date : 28-Mar-2012

Goal:  

- Adhoc Testing on availability checking.

- Raising  BZs..

Agenda:

10:30 am - 10:45 am  - Brief on the Feature
10:45 am - 01:30 pm - Testing on the feature
01:30 pm - 02:30 pm - Lunch
02:30  pm - 05:00 pm - Testing on the feature
05:00 pm -  07:00 pm - Analysing / Reporting bugs

Who's available:

GSS: Sumit, Ravish, jay, Dashrath
QE:  Spandey, Akarol, Sunil, Jeeva
 
Temp  IRC Channel: #JON_QE
 
Resources  available:
1. Spandey
2. Akarol
3. Sunil
4. Jeeva
5..Jay
6 ..Ravish
7..Sumit
8..Dashrath
9..
10..

TCMS :

Test Plan : ( https://tcms.engineering.redhat.com/plan/5700/ )
 
Docs  to refer:

1.Link  for documentaion:

http://rhq-project.org/display/RHQ/Design-Availability+Checking

2. BZs related:

https://bugzilla.redhat.com/show_bug.cgi?id=787227
https://bugzilla.redhat.com/show_bug.cgi?id=701092
https://bugzilla.redhat.com/show_bug.cgi?id=701092
https://bugzilla.redhat.com/show_bug.cgi?id=536173
https://bugzilla.redhat.com/show_bug.cgi?id=534286
https://bugzilla.redhat.com/show_bug.cgi?id=535678
https://bugzilla.redhat.com/show_bug.cgi?id=617648
https://bugzilla.redhat.com/show_bug.cgi?id=536250
https://bugzilla.redhat.com/show_bug.cgi?id=807803


3. Known Issues Tracker Bug :     https://bugzilla.redhat.com/show_bug.cgi?id=741450


Test   Environment links:

10.65.201.124:7080           :  spandey

http://10.65.201.170:7080   :  akarol

http://10.65.193.38:7080    :   sunil

Admin Login:  rhqadmin/rhqadmin

Where to see the Server Log:

ssh  10.65.201.124/redhat

ssh 10.65.201.170/redhat


tailf    /install/rhqbuilds/master/build1185/rhq-server-4.4.0-SNAPSHOT/logs/rhq-server-log4j.log

Please  add your observations below:

 
Sunil
1. Duration field accepts hyphen character (Ex: -2 )
https://bugzilla.redhat.com/show_bug.cgi?id=807660

2. Reenabling resources displays server as unknown..But even after waiting for more that 10 mins, the status does not go up Existing Bug#701092

3. Compatible group count of number of children and descendents goes wrong if a member resource is disabled.
https://bugzilla.redhat.com/show_bug.cgi?id=807671


Spandey:


Akarol:
1, email validation required. email wilth only xyz@abc can be added.
 hence the notitications will failed.
 2.UI ambiquity. - for editing double click works  only for editing  notifications.it should also work for editing alerts also etc..
 3.status not displayed correctly for cron




spandey : 
1 minute availabilty check is not workng for HTTPd server 

1) Created Alert for httpd server start ,
Scheduled start  from operatons, it shows sucessfully  started.

Alert notfcation not received even after 5 min . 
displaying wrong server status in UI.

Existing Bug#701092


Jeeva:

https://bugzilla.redhat.com/show_bug.cgi?id=807629

Mike

Avail icon on resource detail page is not refreshed
https://bugzilla.redhat.com/show_bug.cgi?id=807803




Summary -- Bugs Added on test day on 28th March:

https://bugzilla.redhat.com/show_bug.cgi?id=807660
https://bugzilla.redhat.com/show_bug.cgi?id=807671
https://bugzilla.redhat.com/show_bug.cgi?id=807629
https://bugzilla.redhat.com/show_bug.cgi?id=807803



Resources:
TCMS Test Cases  https://tcms.engineering.redhat.com/plan/5700/#reviewcases
Some testcases and ideas for testing are here:  http://jbosson.etherpad.corp.redhat.com/88
Design document:  http://rhq-project.org/display/RHQ/Design-Availability+Checking