880562 – ResourceContainer can enter an inconsistent state after resource upgrade during agent startup

Bug 880562 - ResourceContainer can enter an inconsistent state after resource upgrade during agent startup

Summary: ResourceContainer can enter an inconsistent state after resource upgrade duri...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Agent, Plugin -- Apache
Sub Component:
Version:	JON 3.1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	JON 3.1.2
Assignee:	Lukas Krejci
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:	879655
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-27 10:43 UTC by Lukas Krejci
Modified:	2013-09-11 10:59 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:	879655
Environment:
Last Closed:	2013-09-11 10:59:08 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description Lukas Krejci 2012-11-27 10:43:59 UTC

+++ This bug was initially created as a clone of Bug #879655 +++

Description of problem:

There are 2 conditions that need to be true for this to happen:

1) There must be a resource in the inventory that has not been upgraded yet (i.e. inventoried using an older version of RHQ, agent starting up with new plugins for the first time and a plugin code determines that the resource needs to be upgraded).
2) The stop() method on the resource component must throw an exception - this can happen if it uses some API that is not initialized in the resource upgrade phase - like event subsystem (this is a bug in itself, see bug 863092).

If the above conditions are true, the following will happen:

1) The component is start()ed during resource upgrade
2) Once the upgrade is processed, the resource component is restarted to pick up the new state. This will try to stop() the resource component during the InventoryManager.prepareForResourceActivation() call.
3) The stop call fails with an exception
4) The exception is handled up in the call chain, leaving the resource container in an inconsistent state:

ResourceContainer.getResourceComponentState() returns STARTED, while the resource component itself has been torn down using its stop() method (which failed, but still could have managed to put the resource component in an weird state - neither up nor down).

Version-Release number of selected component (if applicable):
4.6.0-SNAPSHOT

How reproducible:
always

Steps to Reproduce:
1. Install RHQ 3.0.0, inventory an Apache server using it
2. Upgrade to RHQ 4.6.0-SNAPSHOT
3. Do plugins update on the agent prompt or restart the agent
4. Watch the agent.log for errors during resource upgrade
  
Actual results:
The apache server resource is marked down until the agent (or at least plugin container) restart.

Expected results:
availability reporting works

Additional info:

--- Additional comment from Lukas Krejci on 2012-11-23 11:28:22 EST ---

I'm quite confident the below fixes the issue - i.e. a failure in stop() can no longer put the resource container in an invalid state.

Jay, can you see anything wrong with it or does it remind you of some other areas that would need similar treatment?

master http://git.fedorahosted.org/cgit/rhq/rhq.git/diff/?id=844f016ee8b2608496d063c94f38461e997bcabe
Author: Lukas Krejci <lkrejci>
Date:   Fri Nov 23 17:18:35 2012 +0100

    [BZ 879655] - Properly deactivate the resource instead of just calling the
    component's stop() method when forcing reactivation of a started component.
    This will ensure the PC can't enter an inconsistent state in the case the
    stop() method fails.

--- Additional comment from Jay Shaughnessy on 2012-11-26 12:21:45 EST ---


This change looks correct to me.  There were some relatively recent changes in this code, made by myself and Ian I think, to better handle a "starting" state in the component lifecycle.  This looks like an issue resulting from, or more exposed by, those changes.

--- Additional comment from Lukas Krejci on 2012-11-27 05:43:15 EST ---

Moving to ON_QA.

Comment 1 Lukas Krejci 2012-12-05 14:10:46 UTC

release/jon3.1.x  http://git.fedorahosted.org/cgit/rhq/rhq.git/diff/?id=6e84f520cd1ca11b9e7eb784fc6ded2af557ed37
Author: Lukas Krejci <lkrejci>
Date:   Fri Nov 23 17:18:35 2012 +0100

    [BZ 879655] - Properly deactivate the resource instead of just calling the
    component's stop() method when forcing reactivation of a started component.
    This will ensure the PC can't enter an inconsistent state in the case the
    stop() method fails.
    (cherry picked from commit 844f016ee8b2608496d063c94f38461e997bcabe)

Comment 2 Simeon Pinder 2012-12-10 12:40:40 UTC

Moving to ON_QA as available for test in 3.1.2.ER4 or greater: https://brewweb.devel.redhat.com//buildinfo?buildID=246861

Comment 3 Sunil Kondkar 2012-12-21 13:50:53 UTC

Verified with upgrade from JON 2.4.0 to JON 3.1.2 ER6

The apache server remains up after upgrade to JON 3.1.2 ER6 without restarting the agent.

Note You need to log in before you can comment on or make changes to this bug.