Bug 862893 - When availability check fails due to rhq.agent.plugins.availability-scan.timeout being exceeded, the event should be logged at DEBUG
When availability check fails due to rhq.agent.plugins.availability-scan.time...
Status: CLOSED CURRENTRELEASE
Product: JBoss Operations Network
Classification: JBoss
Component: Monitoring -- Other (Show other bugs)
JON 3.1.0
All All
unspecified Severity medium
: ER02
: JON 3.2.0
Assigned To: Thomas Segismont
Mike Foley
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-03 16:26 EDT by Larry O'Leary
Modified: 2014-01-02 15:43 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-02 15:43:25 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
GUI Availability Timeout Message (16.53 KB, image/png)
2013-06-24 05:57 EDT, Thomas Segismont
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 38765 None None None 2012-10-03 16:26:14 EDT

  None (edit)
Description Larry O'Leary 2012-10-03 16:26:14 EDT
Description of problem:
In cases that a plug-in or the availability scan "times out" a DEBUG message needs to be logged to identify how long we waited and to clearly identify that we simply didn't get a specific response in the allowed time frame. This can be used to distinguish the difference between a DOWN availability when a Connection Refused message is produced versus the plug-in or availability check just being aborted because we have waited long enough. 

From a diagnostic and user point of view these messages should be clearly identifiable in the logs and perhaps even as a message in the UI. If my HEAD request to an Apache HTTP Server is taking more then 5 seconds and is therefore being marked DOWN, I want to know that so I can distinguish the difference between "Connection Refused", "No route to host", "Server Error", etc.

Version-Release number of selected component (if applicable):
3.1.0
Comment 1 Thomas Segismont 2013-06-24 05:57:34 EDT
Created attachment 764533 [details]
GUI Availability Timeout Message
Comment 2 Thomas Segismont 2013-06-24 06:02:03 EDT
There's a debug message already from the InventoryManager (see example below). And in the GUI, the resource appears with a warning message. See the screenshot attached for the message content.

The plugin container has no way to determine why an avail check is too long.

----------
2013-06-24 11:52:56,475 DEBUG [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Failed to collect availability on Resource[id=10009, uuid=a7f728b2-c8ec-42d2-a8b3-3c93ed924aff, type={Platforms}CPU, key=2, name=CPU 2, parent=localhost.localdomain, version=Core(TM) i7-3520M CPU @ 2.90GHz] (call timed out)
Comment 3 Larry O'Leary 2013-06-24 06:17:41 EDT
I don't think this takes care of the issue. Perhaps this is a bigger issue then simply a pc thread timeout. Basically, it appears that each plug-in is using its own timeout and the pc tread timeout doesn't happen. In the case of the Apache plug-in it was 5 seconds (hard-coded) and when such a timeout would occur no message or indication that a timeout had occurred was visible. We had already done some work on this for the Apache plug-in (if memory serves me right) so that better logging happens and we are no longer using a hard-coded timeout but instead the availability check timeout set on the agent. But this BZ is to fix this for all of our supported plug-ins to ensure that timeouts can not occur unless:

 1. they result in a log message at debug level
 2. some kind of UI indication that the availability check is timing out
Comment 4 Thomas Segismont 2013-08-28 08:46:53 EDT
My understanding is that you'd like to see in the UI in the logs the same kind of message that we have for Create Resource Requests / Operations ... etc

Is that right?  If so, then we can't do this simply in the 3.2 time-frame.

The problem is that org.rhq.core.pluginapi.availability.AvailabilityFacet#getAvailability only allows plugin developers to return the state of the resource (enum org.rhq.core.domain.measurement.AvailabilityType), nothing else.

org.rhq.core.pluginapi.inventory.CreateChildResourceFacet#createResource and org.rhq.core.pluginapi.operation.OperationFacet#invokeOperation return org.rhq.core.pluginapi.inventory.CreateResourceReport and org.rhq.core.pluginapi.operation.OperationResult which are complex objects that can hold error messages.
Comment 5 Larry O'Leary 2013-09-03 12:15:42 EDT
These are two separate things. At a minimum, this BZ should result in all timeouts for all supported JBoss ON plug-ins to clearly log, at debug level, when a timeout occurs. For instance, the availability check for the Apache plug-in will return DOWN in the event that the resource can not be reached within rhq.agent.plugins.availability-scan.timeout period. However, the log needs to indicate that the reason for the DOWN availability report was due to "timeout" or "took too long" or something that can serve as a clear diagnostic to why availability is down. This should occur for all plug-ins.

The second portion of this is to improve overall usability by also providing this information to the user via the UI. This portion of the request is probably best suited for a separate BZ if we don't currently have the capability to send UI notifications pertaining to the state of a resource's availability.
Comment 6 Thomas Segismont 2013-09-05 09:25:04 EDT
(In reply to Larry O'Leary from comment #5)
> These are two separate things. At a minimum, this BZ should result in all
> timeouts for all supported JBoss ON plug-ins to clearly log, at debug level,
> when a timeout occurs. For instance, the availability check for the Apache
> plug-in will return DOWN in the event that the resource can not be reached
> within rhq.agent.plugins.availability-scan.timeout period. However, the log
> needs to indicate that the reason for the DOWN availability report was due
> to "timeout" or "took too long" or something that can serve as a clear
> diagnostic to why availability is down. This should occur for all plug-ins.
> 

Would you be fine if change I the existing debug log message I quoted in #c2 into a warning message saying:

"Failed to collect availability on Resource[xxxxx] : call timed out and availability will be reported down"


> The second portion of this is to improve overall usability by also providing
> this information to the user via the UI. This portion of the request is
> probably best suited for a separate BZ if we don't currently have the
> capability to send UI notifications pertaining to the state of a resource's
> availability.

I created BZ1004783
Comment 7 Larry O'Leary 2013-09-05 09:47:39 EDT
(In reply to Thomas Segismont from comment #6)
> Would you be fine if change I the existing debug log message I quoted in #c2
> into a warning message saying:
> 
> "Failed to collect availability on Resource[xxxxx] : call timed out and
> availability will be reported down"

This is only logged by the plug-in container itself. The issue here is that the plug-ins are also timing out on their own (either using a hard-coded timeout value as in the past or using the plug-in containers time out). When they timeout on their own, they return DOWN meaning nothing gets logged.

This request is to ensure that this doesn't happen and that when a plug-in returns a DOWN state due to a timeout, it is logged at DEBUG level.
Comment 8 Thomas Segismont 2013-09-05 09:57:46 EDT
(In reply to Larry O'Leary from comment #7)
> This is only logged by the plug-in container itself. The issue here is that
> the plug-ins are also timing out on their own (either using a hard-coded
> timeout value as in the past or using the plug-in containers time out). When
> they timeout on their own, they return DOWN meaning nothing gets logged.

I'll have a look at the Apache server availability code and see what I can do.

> This request is to ensure that this doesn't happen and that when a plug-in
> returns a DOWN state due to a timeout, it is logged at DEBUG level.

When BZ1004783 will be implemented, we'll be able to go deal with that problem case by case.
Comment 9 Thomas Segismont 2013-09-06 11:08:37 EDT
I've made some changes in the Apache plugin and in the plugin container code.

These are the types of messages we'll see in the agent log:

2013-09-06 17:07:07,792 WARN  [ResourceContainer.invoker.daemon-21] (org.rhq.plugins.apache.ApacheServerComponent)- Connection to Apache server Resource[id=10007, uuid=a9f22a0f-c1ab-4af4-86c7-ccbe9e6bef19, type={Apache}Apache HTTP Server, key=/etc/httpd||/etc/httpd/conf/httpd.conf, name=127.0.0.1:80, parent=localhost.localdomain, version=2.2.23] timed out, availability will be reported as DOWN

2013-09-06 16:26:58,256 WARN  [ResourceContainer.invoker.daemon-2] (org.rhq.plugins.apache.ApacheVirtualHostServiceComponent)- Could not connect to Virtual Host Resource[id=10111, uuid=6c2305f7-e98f-42fc-ab7a-84d117fb51f1, type={Apache}Apache Virtual Host, key=MainServer, name=Main, parent=127.0.0.1:80], availability will be reported as DOWN


2013-09-06 16:25:50,510 WARN  [ResourceContainer.invoker.daemon-4] (org.rhq.plugins.apache.ApacheServerComponent)- Could not connect to Apache server Resource[id=10007, uuid=a9f22a0f-c1ab-4af4-86c7-ccbe9e6bef19, type={Apache}Apache HTTP Server, key=/etc/httpd||/etc/httpd/conf/httpd.conf, name=127.0.0.1:80, parent=localhost.localdomain, version=2.2.23], availability will be reported as DOWN

2013-09-06 16:36:51,420 WARN  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Availability collection timed out on Resource[id=10007, uuid=a9f22a0f-c1ab-4af4-86c7-ccbe9e6bef19, type={Apache}Apache HTTP Server, key=/etc/httpd||/etc/httpd/conf/httpd.conf, name=127.0.0.1:80, parent=localhost.localdomain, version=2.2.23], availability will be reported as DOWN


I will push the changes as soon after the end of 4.9 code freeze
Comment 10 Thomas Segismont 2013-09-16 08:26:57 EDT
Fixed in master

commit 937f14fa6e2e83e6f5fb77df38cba7f8bed11049
Author: Thomas Segismont <tsegismo@redhat.com>
Date:   Fri Sep 6 17:13:56 2013 +0200

Note You need to log in before you can comment on or make changes to this bug.