Bug 1088264

Summary: AvailabilityExecutor stops calling getAvailability() on ResourceComponent after it previously failed with exception
Product: [Other] RHQ Project Reporter: Libor Zoubek <lzoubek>
Component: AgentAssignee: Libor Zoubek <lzoubek>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: hrupp, jshaughn, theute
Target Milestone: ---   
Target Release: RHQ 4.11   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-21 10:14:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Libor Zoubek 2014-04-16 10:40:27 UTC
Description of problem:

I am facing a weird issue (or feature) when AvailabilityExecutor stops executing getAvailability() for resource if getAvailability() threw exception previously.

When getAvailability() throws Exception for some reason, user can see it in UI, thats good. We expect him to fix managed resource or plugin configuration.

Currently I did not find a way to recover from getAvailability() no longer being called - only restarting agent helps. Even after updating plugin config in UI, I can still see warning messages in agent log which should not appear because I already fixed my pluginConfig and avail should be passing now (if it was called)


Version-Release number of selected component (if applicable):
RHQ 4.11-master

How reproducible:
Always


Steps to Reproduce:
1. Have a resource is UP
2. Turn it DOWN and change managed resource to cause avail exception in plugin
3. Turn your managed resource on

Repro steps apply to following (from Bug 1015334):
1. have EAP6 domain mode UP and imported
2. stop EAP6 domain
3. edit EAP6's host.xml change <host name="master" to name="master1"
4. start EAP6 domain again 

Actual results:

You can see avail error in UI and WARN messages about avail check failed.
Now .. when you stop EAP6 and revert your change in name attribute and start it again, EAP6 resource should get back UP right? But it doesn't. You still get outdated WARN messages and resource stays DOWN


Expected results:
After reverting back changes in host.xml resource must go back UP, AvailabilityExecutor must be calling getAvailability() of ResourceComponent no mater if it previously failed or not.

Additional info:

Comment 1 Libor Zoubek 2014-04-17 10:22:40 UTC
in master 
commit 937cb29ee5450da0bcf04d8e9952310de400e90b
Author: Libor Zoubek <lzoubek>
Date:   Thu Apr 17 11:47:43 2014 +0200

    [BZ 1088264] AvailabilityExecutor stops calling getAvailability() on
    ResourceComponent after it previously failed with exception

    The issue was in handling exception comming from future. When availability
    check failed with exception we cought it, next run, just by calling
    future.get() raises the very same exception. We forgot to mark future to be
    rescheduled next time = setting it to null. This commit also makes exception
    message more verbose so we know more what happened in plugin

Comment 2 Jay Shaughnessy 2014-04-21 16:08:43 UTC
I'm not sure, we may have done this on purpose originally, to prevent repeated failures. The component's getAvailability() method should not, in general, throw exceptions.  It should return DOWN if it can't connect due to poor plugin configuration.  So, I'd say the use case above indicates a bad implementation of getAvailability().

Having said that, this change is probably acceptable. It's more just an implementation decision and perhaps people will prefer it this way.

Comment 3 Heiko W. Rupp 2014-07-21 10:14:00 UTC
Bulk closing of RHQ 4.11 issues, now that RHQ 4.12 is out.

If you find an issue with those, please open a new BZ, linking to the old one.