741450 – RFE: Improve Availability Handling (Tracker)

Bug 741450 - RFE: Improve Availability Handling (Tracker)

Summary: RFE: Improve Availability Handling (Tracker)

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server, Core UI, Alerts, Plugin Container, Plugins
Sub Component:
Version:	4.1
Hardware:	All
OS:	All
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	RHQ Project Maintainer
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	795915 (view as bug list)
Depends On:	743727 RHQ-1098 RHQ-1178 RHQ-1490 RHQ-1722 RHQ-2130 RHQ-2347 RHQ-2349 RHQ-2483 RHQ-327 RHQ-551 RHQ-620 584442 617648 701092 729296 811287
Blocks:
TreeView+	depends on / blocked

Reported:	2011-09-26 20:33 UTC by Jay Shaughnessy
Modified:	2014-05-15 21:58 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-05-15 21:58:37 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	534286	medium	CLOSED	make availability report interval longer	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	534375	high	CLOSED	clean up agent stuff in CoreServerService.agentIsShuttingDown	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	534725	medium	CLOSED	how to handle known/to-be-expected outages?	2023-09-14 01:18:39 UTC
Red Hat Bugzilla	535352	high	CLOSED	runaway avail reporting	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	535676	medium	CLOSED	avail reports should be delayed if nothing changed	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	535678	urgent	CLOSED	get check-suspect-agents job to get the new backfiller data in a single query	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	536173	medium	CLOSED	need way to configure availability timeout	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	617648	medium	CLOSED	RFE: separate heartbeat from availability report / revamp agent health system	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	701092	medium	CLOSED	Turning off agent doesn't cause platform to show as down for 15mins	2021-02-22 00:41:40 UTC

Internal Links: 534286 534375 534725 535352 535676 535678 536173 617648 701092 729296 729298

Description Jay Shaughnessy 2011-09-26 20:33:21 UTC

There are several ways in which avail reporting and handling could 
possibly be improved.  This is a tracker bug for all of it.  
Below is an etherpad clipping for various thoughts and ideas.

======================================================================


Fixing Availability

Availability is the RHQ way of reporting whether a resource is up, down or in an unknown state.  It is a critical  component in monitoring and alerting.

Unfortunately, RHQ today has serious issues with availability handling:

    * Very slow reporting of availability changes.  

    * availability reports are not sent frequently from the agent

    * 5 or more minutes can pass before avail is actually updated

    * Very disconcerting in the UI when we report a status the user knows is incorrect.

    * Perhaps unusable for SLA outages.

    * Resource cycling can actually be missed.

    * Availability reports are waiting behind other long duration transfers

    * We alert only on availability change, not availability duration.

    * Only offer avail conditions "goes up: and "goes down", not "is up" and and "is down"

    * No applicable dampening, which confuses, and then dismays, users

    * No notion of "admin down"

    * Poor handling when the RHQ Agent goes down, even gracefully.

    * Slow recognition

    * Avail set to down for all monitored resources on that agent.

    * This can be misleading, unknown probably makes more sense



The main blocker to doing this better is performance.  We need to be able to monitor and report availability for a large number of resources, over a large number of agents, with little delay and litte overhead.  And for alerting purposes, we need to enhance the reporting to support dampening (on avail periods).

There are a few mechanisms in play:

    * The agent must ask the resource container for its avail state.

    * the avail checking code is plugin dependent and has no guarantees on efficiency.

    * the avail check may hang for a resource that is truly down.

    * Note that we do support an asynch avail reporting mechanism, which is useful for fast avail checking if implemented by the plugin.

    * The agent must report availability to the server.

    * The server must respond to availability changes on the resources.

    * The server must perform alerting.

    * Tables/Domain classes

    * rhq_availability (resource avail history)

    * rhq_resource_avail (resource current avail)

    * AvailabilityReport (the object passed from agent to server)

    * Checking avail on all resources: Parents, children, grandchildren...


Ideas:

    * Certainly fix https://bugzilla.redhat.com/show_bug.cgi?id=701092

    * Different frequencies for different categories

    * More frequent for servers, less frequent avail checking for services

    * Or, even more granularity

    * No avail checking for dependent types

    * deferral to the parent's avail

    * Detach agent avail from avail reporting

    * agent ping to server to prove it's alive

    * server ping to agent before backfilling

    * Only report avail changes (means keeping/comparing against previous report agent-side)

    * Store lastUpdateTime on ResourceAvailability to be able to calculate the length a resource has been at the current avail type

    * Agent side, process avail checks top down because any parent down implies all children are down. In fact, perhaps server side we assume that we won't get down avail below a down parent, and we handle (backfill) the children server-side.

    * make all avail checking async on the agent. meaning, the pc maintains the last known avail and checks agains that (is that possible, efficient, do we do it already)

    * Divide up server side work among HA servers, each handling its connected agents (in memory timers)

    * probability-based approaches?

    * Have avail reports be sent out-of band or as 1st prio message to the server

    * Cache ResourceAvail table / last entry in Avail table

    * base avail on PID (PC could always do that for processes and only call getAvail() when pid is up)

    * doable where?

    * push avail from plugin?



Relevant Configuration Properties:

    * Availability Report Limit  (rhq.server.concurrency-limit.availability-report)

    * Number of availability reports that can be processed concurrently; if zero or less, there is no limit

    * default  25

    * Agent Max Quiet Time

    * Number of minutes agent has to provide avail report before being backfilled

    * default 15 minutes

    * availability-scan.period-secs

    * the period between availability scans on the agent

    * default 5 minutes

    * rhq.agent.plugins.availability-scan.timeout

    * default 5 seconds


Relevant Wikis

    * http://rhq-project.org/display/RHQ/Design-Asynchronous+Availability+Collector

    * http://rhq-project.org/display/JOPR2/FAQ#FAQ-Myserverlogsareshowingthemessage%22Havenotheardfromagent...Willbebackfilledsincewesuspectitisdown%22.Whatdoesthatmean%3F

    * http://rhq-project.org/display/JOPR2/FAQ#FAQ-Explainhowtheagentscansforresources

    * http://rhq-project.org/display/JOPR2/FAQ#FAQ-WhenIshutdowntheagent%2CtheRHQServertakesmorethan14minutestodetecttheagentwasdown.CanIconfigureittonottakesolong%3F

    * http://rhq-project.org/display/RHQ/Ideas+about+Caching#IdeasaboutCaching-Availability


Relevant BZs
(this section removed, bugs are now linked)


Questions
When do you get unknown availability?

    * When a resource is first imported before any data has been reported for the resource


TODOs

    * Decide whether communications between servers is required.

    * Allow users to denote "important" resources for small checking intervals?


Priorities
High: Agent updates to reduce CPU utilization 
Medium/High: Server side avail reporting

2k Resources + 2 AS , 1min checks => 6k availabilities per minute coming in?
    and vice-versa - 1 agent with 50 ASes

Comment 1 Jay Shaughnessy 2012-02-21 22:16:10 UTC

*** Bug 795915 has been marked as a duplicate of this bug. ***

Comment 2 Mike Foley 2012-03-29 15:44:40 UTC

just documenting the testing that has been done on this feature:

1) TCMS testcase development
https://engineering.redhat.com/trac/jon/ticket/39

2) Test Day 3/28/2012. details below.

Hi,

Today was a Test Day for JON QE on the Availability feature.

FYI: we do these Test Days to get some targetted ad-hoc coverage on new features or features that may be risky. The approach is effective at raising BZs, and compliments the more structured approaches of automation and TCMS.

Just sharing the results ....

Regards,

Michael (on behalf of the JON QE Team)

p.s. Extremely nice job Sunil coordinating things in Pune!!

JBoss Opeartion Network Test Day - 28-Mar-2012

Feature: Availability

Test Day Date : 28-Mar-2012

Goal:

- Adhoc Testing on availability checking.

- Raising BZs..

Agenda:

10:30 am - 10:45 am - Brief on the Feature
10:45 am - 01:30 pm - Testing on the feature
01:30 pm - 02:30 pm - Lunch
02:30 pm - 05:00 pm - Testing on the feature
05:00 pm - 07:00 pm - Analysing / Reporting bugs

Who's available:

GSS: Sumit, Ravish, jay, Dashrath
QE: Spandey, Akarol, Sunil, Jeeva

Temp IRC Channel: #JON_QE

Resources available:
1. Spandey
2. Akarol
3. Sunil
4. Jeeva
5..Jay
6 ..Ravish
7..Sumit
8..Dashrath
9..
10..

TCMS :

Test Plan : ( https://tcms.engineering.redhat.com/plan/5700/ )

Docs to refer:

1.Link for documentaion:

http://rhq-project.org/display/RHQ/Design-Availability+Checking

2. BZs related:

https://bugzilla.redhat.com/show_bug.cgi?id=787227
https://bugzilla.redhat.com/show_bug.cgi?id=701092
https://bugzilla.redhat.com/show_bug.cgi?id=701092
https://bugzilla.redhat.com/show_bug.cgi?id=536173
https://bugzilla.redhat.com/show_bug.cgi?id=534286
https://bugzilla.redhat.com/show_bug.cgi?id=535678
https://bugzilla.redhat.com/show_bug.cgi?id=617648
https://bugzilla.redhat.com/show_bug.cgi?id=536250
https://bugzilla.redhat.com/show_bug.cgi?id=807803

3. Known Issues Tracker Bug : https://bugzilla.redhat.com/show_bug.cgi?id=741450

Test Environment links:

10.65.201.124:7080 : spandey

http://10.65.201.170:7080 : akarol

http://10.65.193.38:7080 : sunil

Admin Login: rhqadmin/rhqadmin

Where to see the Server Log:

ssh 10.65.201.124/redhat

ssh 10.65.201.170/redhat

tailf /install/rhqbuilds/master/build1185/rhq-server-4.4.0-SNAPSHOT/logs/rhq-server-log4j.log

Please add your observations below:

Sunil
1. Duration field accepts hyphen character (Ex: -2 )
https://bugzilla.redhat.com/show_bug.cgi?id=807660

2. Reenabling resources displays server as unknown..But even after waiting for more that 10 mins, the status does not go up Existing Bug#701092

3. Compatible group count of number of children and descendents goes wrong if a member resource is disabled.
https://bugzilla.redhat.com/show_bug.cgi?id=807671

Spandey:

Akarol:
1, email validation required. email wilth only xyz@abc can be added.
hence the notitications will failed.
2.UI ambiquity. - for editing double click works only for editing notifications.it should also work for editing alerts also etc..
3.status not displayed correctly for cron

spandey :
1 minute availabilty check is not workng for HTTPd server

1) Created Alert for httpd server start ,
Scheduled start from operatons, it shows sucessfully started.

Alert notfcation not received even after 5 min .
displaying wrong server status in UI.

Existing Bug#701092

Jeeva:

https://bugzilla.redhat.com/show_bug.cgi?id=807629

Mike

Avail icon on resource detail page is not refreshed
https://bugzilla.redhat.com/show_bug.cgi?id=807803

Summary -- Bugs Added on test day on 28th March:

https://bugzilla.redhat.com/show_bug.cgi?id=807660
https://bugzilla.redhat.com/show_bug.cgi?id=807671
https://bugzilla.redhat.com/show_bug.cgi?id=807629
https://bugzilla.redhat.com/show_bug.cgi?id=807803

Resources:
TCMS Test Cases https://tcms.engineering.redhat.com/plan/5700/#reviewcases
Some testcases and ideas for testing are here: http://jbosson.etherpad.corp.redhat.com/88
Design document: http://rhq-project.org/display/RHQ/Design-Availability+Checking

Note You need to log in before you can comment on or make changes to this bug.