Bug 822955

Summary: Resource upgrade never disabled if agent started in offline mode
Product: [Other] RHQ Project Reporter: Lukas Krejci <lkrejci>
Component: Plugin ContainerAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Filip Brychta <fbrychta>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.4CC: ccrouch, fbrychta, hrupp, jsanda
Target Milestone: ---   
Target Release: JON 3.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 822905 Environment:
Last Closed: 2013-09-03 15:19:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 822905    
Bug Blocks: 782579    

Description Lukas Krejci 2012-05-18 15:41:50 UTC
+++ This bug was initially created as a clone of Bug #822905 +++

Description of problem:

If the agent is started before any of the server in the failover list, it is going to start in offline mode.

During the the resource upgrade phase during inventory manager startup we try to sync with the server to get the latest inventory state. That call fails with an exception which prevents the resource upgrade to disable itself.

This in turn has several side-effects throughout the inventory manager, one of them being that no resource can ever enter a synchronized state. Availability is collected only for the synchronized resources and thus is never collected for any resource until the plugin container is restated while at least one of the servers is online even if the availability check is manually invoked using the 'avail' prompt command.

Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. Start an agent prior to starting any of the servers
2. Start a server
3. Import a resource
  
Actual results:
The newly imported resource remains in unkown state until the agent (or at least the plugin container) is restarted

Expected results:
The availability collection works normally

Additional info:

--- Additional comment from fbrychta on 2012-05-18 11:10:10 EDT ---

*** Bug 822483 has been marked as a duplicate of this bug. ***

--- Additional comment from lkrejci on 2012-05-18 11:41:14 EDT ---

master http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=43770f14dd308618f47b4d2940366cb86082aef0
Author: Lukas Krejci <lkrejci>
Date:   Fri May 18 13:43:21 2012 +0200

    [BZ 822880] - Make sure to NOT discard any pending DB changes before
    masking the configuration.

Comment 1 Charles Crouch 2012-05-21 02:12:00 UTC
Lukas
From your description I would generally be +1 on including this. What tests can we put in place to show this doesn't introduce a regression or in fact make matters worse.

Comment 2 Charles Crouch 2012-05-22 01:43:16 UTC
Setting Assigned To correctly

Comment 3 Lukas Krejci 2012-05-22 08:54:56 UTC
The agent and plugin container are designed to gracefully handle the loss of comms to the server during normal mode of operation. 

So I think the only test we'd want to add is that the agent enters the normal mode of operation in all circumstances that may occur during the startup (which are: primary server up, at least one of the servers up and no servers up). 

Because the communication layer is separate from the plugin container, this basically boils down to ensuring that resource upgrade is switched off regardless of the exceptions thrown during making the contact with the server (note that exceptions thrown during the upgrade of the individual resources are handled on per-resource basis by making sure that the resource that failed to upgrade never starts up during the plugin container lifetime and report an upgrade error to the server).

The fix moves the switching logic into a finally block to make sure that the resource upgrade mode is always switched off regardless of the upgrade result.

Our current test suite tests quite comprehensively the failures during the upgrade itself but doesn't test the behavior when there is an error in the agent-server communication - which is exactly what this BZ is about. I am going to add a test for that but I generally trust Java to handle the finally block right ;)

Comment 4 Lukas Krejci 2012-05-22 14:49:01 UTC
I added a test (in master only) for the resource upgrade mode being inactive into the existing resource upgrade unit tests (that actually include a test for handling the server unavailability during upgrade, but did not catch this nuance).

As "to show this doesn't introduce a regression or in fact make matters worse" - I think we just have to trust our current test suite that it would discover the side effects. As I wrote in the previous comments, I think the risk is low because the upgrade was always meant to only occur during inventory manager startup and the fix makes it so.

Comment 5 Charles Crouch 2012-05-22 16:09:27 UTC
Lukas, lets get the test and fix pushed to the release branch

Comment 6 Lukas Krejci 2012-05-22 17:22:25 UTC
release/jon3.1.x http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=19b728129a36cfaab4dcb7d8c7a9b52ff219a64a
Author: Lukas Krejci <lkrejci>
Date:   Fri May 18 15:30:19 2012 +0200

    [BZ 822905] - Make sure to always disable the resource upgrade after it
    has finished, even after a failure.
    (cherry picked from commit 5ebacc87cb292004186c0b8f6876e00649b18391)

release/jon3.1.x http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=47d0f7a9eb9f2d171948e3f4ecfbc649109ee083
Author: Lukas Krejci <lkrejci>
Date:   Tue May 22 12:07:02 2012 +0200

    [BZ 822905] - modified the unit tests for checking that the resource upgrade
    is actually disabled after resource upgrade.
    (cherry picked from commit 5e5b9d9581e18b389046753b3c5e8d03067da42f)

Comment 7 Charles Crouch 2012-05-22 20:09:24 UTC
Setting this back to modified since its not available in a JON build yet, should be in ER5

Comment 8 Filip Brychta 2012-06-05 09:04:13 UTC
Verified on 3.1.0.CR2

Comment 9 Heiko W. Rupp 2013-09-03 15:19:19 UTC
Bulk closing of old issues in VERIFIED state.