Bug 720674 - RHQ Agent plugin changed resource key format between RHQ 1.3.1 and RHQ 3.0.0
RHQ Agent plugin changed resource key format between RHQ 1.3.1 and RHQ 3.0.0
Status: CLOSED CURRENTRELEASE
Product: RHQ Project
Classification: Other
Component: Plugins (Show other bugs)
4.1
Unspecified Unspecified
urgent Severity urgent (vote)
: ---
: ---
Assigned To: RHQ Project Maintainer
Mike Foley
:
Depends On: 736168
Blocks: jon3 720716 rhq41beta
  Show dependency treegraph
 
Reported: 2011-07-12 09:05 EDT by Lukas Krejci
Modified: 2012-02-07 14:23 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-02-07 14:23:33 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
image of jopr 2.3.1 to RHQ 4.1 upgrade failure (68.68 KB, image/png)
2011-09-06 17:23 EDT, Mike Foley
no flags Details
RHQ Server log of upgrade failure (69.99 KB, application/octet-stream)
2011-09-06 17:24 EDT, Mike Foley
no flags Details
rhq-installer-dbupgrade.log (83.90 KB, application/octet-stream)
2011-09-06 17:25 EDT, Mike Foley
no flags Details
image of RHQ-AGENT table showing missing BACKFILLED column (83.36 KB, image/png)
2011-09-29 09:59 EDT, Mike Foley
no flags Details
rhq server log with BACKFILLED error (6.70 KB, text/plain)
2011-09-29 10:04 EDT, Mike Foley
no flags Details
verification image (46.73 KB, image/png)
2011-10-14 13:37 EDT, Mike Foley
no flags Details
verification of upgrade log (45.76 KB, application/octet-stream)
2011-10-14 13:38 EDT, Mike Foley
no flags Details

  None (edit)
Description Lukas Krejci 2011-07-12 09:05:35 EDT
Description of problem:

The agent plugin mistakenly changed the resource key format with the RHQ 3.0.0 release causing there to be 2 rhq agent resources in inventories (which is an error because this is a singleton resource).

Version-Release number of selected component (if applicable):

We have 2 options:

1) Keep the "new" resource key format as it is in RHQ 3.0.x and add the
resource upgrade facet to convert the resource key for users upgrading from
RHQ 1.3.x.

2) Rollback the resource key format to RHQ 1.3.x version and add the
ResourceUpgradeFacet to handle the upgrades from RHQ 3.0.x.

In either case we need to make sure that if there is a resource upgrade
failure (which WILL happen in inventories that have 2 agent resources) the
user is able to take some action - they have to be able to see both agent
resources in the inv. tree / search results so that they can uninventory one
of them.

Alternatively we can do nothing and have some kind of CLI script (or dbupgrade
task) that will handle this. It has to be quite clever though because it has
to guess which of the agent resources the user would like to delete, because
they might have alerts defined on either (or both) and deleting a resource
silently either using a script or the dbupgrade task wouldn't too user
friendly.


How reproducible:
always

Steps to Reproduce:
1. Install RHQ 1.3.1, inventory an RHQ agent resource
2. Upgrade to the latest RHQ version
  
Actual results:
2 agent resources in the inventory

Expected results:
1 agent resource in the inventory

Additional info:
Comment 1 Charles Crouch 2011-07-25 13:30:02 EDT
So I think we should add a ResourceUpgradeFacet here, i.e. not do nothing, and try to deal with the situation. Assuming we fix this in RHQ4.1 either option 1) or 2) seems fine regarding people who:

i)installed rhq1.3.1 (or earlier) fresh and upgrade to rhq4.1
ii)installed rhq3.0 fresh and upgrade to rhq4.1, similar for fresh rhq4.0 installs presumably

There should be no upgrade failures correct? and never any duplicate agents?

The difficulty is for the situation where someone did a fresh install of rhq1.3.1 (or earlier) and then upgraded to rhq3.0 (or later). There are two scenarios right?:

a) the user deleted the duplicate agents somehow (which is hard given they both presumably still show green?). We've got to assume that they were not able to perfectly differentiate between the "old" agents and the "new" agents, so we could potentially have a mix of resources keys. 

b) the user didn't delete any duplicates and they still have multiple agent resources per platform.[I guess they don't show up in the tree view, because the agent is a singleton resource type?]

For people who now upgrade to rhq4.1 is there better option between 1) and 2) when it comes to dealing with a) or b) above?
Comment 2 Lukas Krejci 2011-07-25 14:03:54 EDT
(In reply to comment #1)
> So I think we should add a ResourceUpgradeFacet here, i.e. not do nothing, and
> try to deal with the situation. Assuming we fix this in RHQ4.1 either option 1)
> or 2) seems fine regarding people who:
> 
> i)installed rhq1.3.1 (or earlier) fresh and upgrade to rhq4.1
> ii)installed rhq3.0 fresh and upgrade to rhq4.1, similar for fresh rhq4.0
> installs presumably
> 
> There should be no upgrade failures correct? and never any duplicate agents?
> 

Yes, these scenarios should upgrade cleanly.

> The difficulty is for the situation where someone did a fresh install of
> rhq1.3.1 (or earlier) and then upgraded to rhq3.0 (or later). There are two
> scenarios right?:
> 
> a) the user deleted the duplicate agents somehow (which is hard given they both
> presumably still show green?). We've got to assume that they were not able to
> perfectly differentiate between the "old" agents and the "new" agents, so we
> could potentially have a mix of resources keys. 
>

If the user managed to delete one of the agent resources (so that there remains just 1 in the inventory) then the resource will be upgraded cleanly. The plugin code needs to be able to upgrade both kind of resource keys so it shouldn't be "surprised" by seeing just one resource of either RK format in the inventory. Basically by removing one of the agent resources, the problem "degrades" to one of the i), ii) or iii). 

I assume that by the mix of resource keys you mean the situation where on some platforms a "new" agent was deleted and on other platforms "old" agent was deleted. This is not a problem though because the upgrade is performed on the agent and therefore the code has no knowledge of other platforms. If there is only a single agnet resource per platform, we can never have RK clashes, because the RK only has to be unique under its parent resource (the platform in this case).
 
> b) the user didn't delete any duplicates and they still have multiple agent
> resources per platform.[I guess they don't show up in the tree view, because
> the agent is a singleton resource type?]
> 
> For people who now upgrade to rhq4.1 is there better option between 1) and 2)
> when it comes to dealing with a) or b) above?

Yes, this is the problematic scenario. I think the options are pretty equivalent even from the user point of view though. No matter the RK format, the end result will be the same. Both of the agent resources will be down until one of them is removed by the user and the agent restarted (which obviously can't be done from the UI, because the resource would be down at the time we need to restart the agent).

We need to make sure though that the UI gives people the possibility to view the duplicate resources even if the resource type says that it is a singleton. Other option would be to document the actions one needs to take in the CLI to delete one of the agent resources.
Comment 3 Charles Crouch 2011-08-03 12:58:29 EDT
(In reply to comment #2)

> 
> If the user managed to delete one of the agent resources (so that there remains
> just 1 in the inventory) then the resource will be upgraded cleanly. The plugin
> code needs to be able to upgrade both kind of resource keys so it shouldn't be
> "surprised" by seeing just one resource of either RK format in the inventory.
> Basically by removing one of the agent resources, the problem "degrades" to one
> of the i), ii) or iii). 

Whats option iii) I don't see it specified? 

> > b) the user didn't delete any duplicates and they still have multiple agent
> > resources per platform.[I guess they don't show up in the tree view, because
> > the agent is a singleton resource type?]
> > 
> > For people who now upgrade to rhq4.1 is there better option between 1) and 2)
> > when it comes to dealing with a) or b) above?
> 
> Yes, this is the problematic scenario. I think the options are pretty
> equivalent even from the user point of view though. No matter the RK format,
> the end result will be the same. Both of the agent resources will be down until
> one of them is removed by the user and the agent restarted (which obviously
> can't be done from the UI, because the resource would be down at the time we
> need to restart the agent).
> 
> We need to make sure though that the UI gives people the possibility to view
> the duplicate resources even if the resource type says that it is a singleton.
> Other option would be to document the actions one needs to take in the CLI to
> delete one of the agent resources.

Its only the tree that does anything with the singleton property? You should be to see multiple agents in the inventory view right?
Part of our upgrade steps could be to manually remove the duplicate agents (either one, it doesn't matter) from the inventory. The should have the same resource name right? So they should be pretty easy to spot.

So my vote, is for...
"2) Rollback the resource key format to RHQ 1.3.x version and add the
ResourceUpgradeFacet to handle the upgrades from RHQ 3.0.x."
That way we can say RHQ3.0 was the aberration and everything else has a resource key that looks like XYZ.
Comment 4 Lukas Krejci 2011-08-04 08:58:53 EDT
(In reply to comment #3)
> (In reply to comment #2)
> 
> > 
> > If the user managed to delete one of the agent resources (so that there remains
> > just 1 in the inventory) then the resource will be upgraded cleanly. The plugin
> > code needs to be able to upgrade both kind of resource keys so it shouldn't be
> > "surprised" by seeing just one resource of either RK format in the inventory.
> > Basically by removing one of the agent resources, the problem "degrades" to one
> > of the i), ii) or iii). 
> 
> Whats option iii) I don't see it specified? 
> 
> > > b) the user didn't delete any duplicates and they still have multiple agent
> > > resources per platform.[I guess they don't show up in the tree view, because
> > > the agent is a singleton resource type?]
> > > 
> > > For people who now upgrade to rhq4.1 is there better option between 1) and 2)
> > > when it comes to dealing with a) or b) above?
> > 
> > Yes, this is the problematic scenario. I think the options are pretty
> > equivalent even from the user point of view though. No matter the RK format,
> > the end result will be the same. Both of the agent resources will be down until
> > one of them is removed by the user and the agent restarted (which obviously
> > can't be done from the UI, because the resource would be down at the time we
> > need to restart the agent).
> > 
> > We need to make sure though that the UI gives people the possibility to view
> > the duplicate resources even if the resource type says that it is a singleton.
> > Other option would be to document the actions one needs to take in the CLI to
> > delete one of the agent resources.
> 
> Its only the tree that does anything with the singleton property? You should be
> to see multiple agents in the inventory view right?
> Part of our upgrade steps could be to manually remove the duplicate agents
> (either one, it doesn't matter) from the inventory. The should have the same
> resource name right? So they should be pretty easy to spot.
> 

-1. We can't know which one of those resources the user "uses" - i.e. has alerts on, added it to some groups, etc. I think we have no other option here but to admit that we screwed up and that a manual intervention is needed to get things right.

> So my vote, is for...
> "2) Rollback the resource key format to RHQ 1.3.x version and add the
> ResourceUpgradeFacet to handle the upgrades from RHQ 3.0.x."
> That way we can say RHQ3.0 was the aberration and everything else has a
> resource key that looks like XYZ.

Yep that would be my vote, too.
Comment 5 Charles Crouch 2011-08-04 09:31:00 EDT
(In reply to comment #4)
> > Its only the tree that does anything with the singleton property? You should  > > be to see multiple agents in the inventory view right?
> > Part of our upgrade steps could be to manually remove the duplicate agents
> > (either one, it doesn't matter) from the inventory. The should have the same
> > resource name right? So they should be pretty easy to spot.
> > 
> 
> -1. We can't know which one of those resources the user "uses" - i.e. has
> alerts on, added it to some groups, etc. I think we have no other option here
> but to admit that we screwed up and that a manual intervention is needed to get
> things right.
> 

Sorry that's what I meant. That our upgrade instructions could ask the user to do this manual step. I was just check that this manual step would be possible, i.e. that we're not hiding one of the agents from all parts of the UI, only the tree.
Comment 6 Lukas Krejci 2011-08-04 11:26:28 EDT
The resource key format has been reverted to the one used by RHQ 1.3.x and before.

When there is only 1 agent in the inventory (on a given platform), the upgrade works seamlessly regardless of the RK format of the agent in the inventory.

If there are 2 agents in the inventory (on a given platform), which means that 1 has the RHQ 1.3.x RK format and the other has RHQ3 RK format, the agent with the RHQ3 RK format is disabled with an upgrade resource error attached.

The situation can then be resolved in 2 ways depending on what agent resource the user wants to get rid of (see the examples of why the user might pick either of the two in the above discussion):

1) Uninventory the RHQ3 based agent. The other agent resource is still available and working, because it has the required RK format so there is nothing more to do.

2) Uninventory the RHQ1.3.x based agent. This is more complicated scenario. For the agent to recover from an upgrade failure its plugin container needs to be restarted. Because the agent resource that remained in the inventory is stopped at that moment and therefore it is not possible to invoke operations on it, it cannot be used to do that. The user therefore has to manually restart the agent on the agent machine (or if the user has access to that agent's commandline, it is enough to issue 'pc stop' and 'pc start' commands on it).

Note that I created an RFE bug 728288 that would help dealing with the second scenario.

commit 862baa01eeeb72f884a58d185ac1c78cb03e89f1
Author: Lukas Krejci <lkrejci@redhat.com>
Date:   Thu Aug 4 16:20:39 2011 +0200

    BZ720674 - adding a ResourceUpgradeFacet on the Agent discovery component to deal with the RK format change mistakenly introduced in RHQ3
Comment 7 Mike Foley 2011-09-06 17:22:19 EDT
verification is defined as upgrading from jopr 2.3.1 to RHQ 4.1.

This upgrade failed (not related to the original issue).  

attaching screenshot and server log.
Comment 8 Mike Foley 2011-09-06 17:23:10 EDT
Created attachment 521762 [details]
image of jopr 2.3.1 to RHQ 4.1 upgrade failure
Comment 9 Mike Foley 2011-09-06 17:24:02 EDT
Created attachment 521763 [details]
RHQ Server log of upgrade failure
Comment 10 Mike Foley 2011-09-06 17:25:12 EDT
Created attachment 521764 [details]
rhq-installer-dbupgrade.log
Comment 11 Lukas Krejci 2011-09-07 02:13:58 EDT
Mike, could you please create a new bug for that upgrade failure? It seems to be a database upgrade failure, which is a quite serious problem.

We can revisit this issue once the upgrade of RHQ server works again.
Comment 12 Lukas Krejci 2011-09-07 02:25:12 EDT
As to the upgrade failure. The error message suggests, that the upgrade failed when applying changes defined in database version 2.60 that defines a brand new table and a sequence for it.

Mike, are you sure you used a completely clear database when you originally  installed JON 2.3.1?
Comment 13 Mike Foley 2011-09-07 07:38:06 EDT
Upgrade failure is now https://bugzilla.redhat.com/show_bug.cgi?id=736168
We will track the upgrade failure on this BZ.  Marking the verification of this BZ blocked by 736168.

Yes, I am sure I used a completely clear database.
Comment 14 Mike Foley 2011-09-26 14:50:27 EDT
BZ 736168 is no longer blocking this BZ
Comment 15 Mike Foley 2011-09-26 15:00:16 EDT
this is verified, as follows:

1) installed RHQ 1.3.  downloaded agent.  installed agent.  inventoried agent.
2) upgraded to RHQ 3.  1 agent was in inventory.
Comment 16 Mike Foley 2011-09-26 16:15:26 EDT
additionally verified upgrading RHQ 3 to RHQ 4.1 ... with aforementioned rhqagent in inventory.
Comment 17 Mike Foley 2011-09-29 09:58:26 EDT
documenting additional information.

on 9/29/2011 i reported as a result of upgrading (RHQ 1.3 --> RHQ 3 --> RHQ 4 master) an error occurred in the server log (attached to this BZ).  after discussion i learned that the RHQ-AGENT table is missing a column named BACKFILLED (image attached).  post-scrum discussion on 9/29 on this information.  i learned that this upgrade path is not a valid test case.
Comment 18 Mike Foley 2011-09-29 09:59:34 EDT
Created attachment 525576 [details]
image of RHQ-AGENT table showing missing BACKFILLED column
Comment 19 Mike Foley 2011-09-29 10:04:57 EDT
Created attachment 525579 [details]
rhq server log with BACKFILLED error
Comment 20 Lukas Krejci 2011-10-12 05:10:12 EDT
(In reply to comment #15)

I'm putting this back to ON_QA just to make sure we don't miss anything.

> this is verified, as follows:
> 
> 1) installed RHQ 1.3.  downloaded agent.  installed agent.  inventoried agent.
> 2) upgraded to RHQ 3.  1 agent was in inventory.

This is what this whole BZ is about. After step 2, you should have had 2 agents in the inventory.

Let me list the steps I think we need to perform. Also, because RHQ actually contains an DB upgrade discrepancy, you need to use JON 2.4.x instead of RHQ 3.0.x.

1) Install JON 2.3.1
2) Install an agent from JON 2.3.1, run it
3) Inventory the agent in JON 2.3.1 inventory
4) Upgrade to JON 2.4.1
5) Run this SQL command against the JON database:

SELECT COUNT(*) FROM RHQ_RESOURCE r, RHQ_RESOURCE_TYPE rt WHERE r.resource_type_id = rt.id AND rt.name = 'RHQ Agent' and rt.plugin='RHQAgent';

This should return "2".
The UI will still show only a single agent.

6) Upgrade to RHQ 4 (I know this currently fails due to db upgrade errors, but this will get resolved eventually)

7) The UI should show both agent resources.


Expected results:

The agent resource w/ resource key in the form "<hostname> RHQ Agent" should have availability up (if the agent is actually running of course).
The agent resource w/ resource key "RHQ Agent" should be permanently down and an upgrade error should be attached to it.
Comment 21 Mike Foley 2011-10-14 10:47:56 EDT
Verification as follows:

step #1) Install JON 2.3.1  = PASS
step #2) install JON 2.3.1 agent, run it = PASS
step #3) Inventory the agent in JON 2.3.1 = PASS
step #4) upgrade to JON 2.4.1 = PASS
step #5) ran the SQL statement ... the result = "1"  ... not "2"  = FAIL.  

discussing with Lukas
Comment 22 Mike Foley 2011-10-14 10:52:22 EDT
additional steps after step #5)  

step 5.1)  download and install JON 2.4.1 agent ... and run it.
step 5.2)  after that you should see another rhq agent in the discovery queue even though you already have one in the inventory
Comment 23 Mike Foley 2011-10-14 11:03:04 EDT
Verification continues as follows:

Step 5.1) download and install JON 2.4.1 agent and run it = PASS
Step 5.2) did not see another rhq agent in the discovery queue   = FAIL
Comment 24 Mike Foley 2011-10-14 12:03:29 EDT
Verification continues as follows:

Step 5.2) *did* see another rhq agent in the discovery queue = PASS

Step 5)  SELECT COUNT(*) FROM RHQ_RESOURCE r, RHQ_RESOURCE_TYPE rt WHERE
r.resource_type_id = rt.id AND rt.name = 'RHQ Agent' and rt.plugin='RHQAgent';

returns "2"  = PASS
Comment 25 Mike Foley 2011-10-14 12:05:13 EDT
additional steps after step 5.2)

5.3)  import into inventory the 2nd RHQ Agent (the one that is in the auto-discovery queue)   


= PASS
Comment 26 Mike Foley 2011-10-14 13:37:11 EDT
step 6 = PASS
step 7 = PASS


attaching image documenting the verification of step #7.  
attaching log file of dbupgrade to document success.
Comment 27 Mike Foley 2011-10-14 13:37:55 EDT
Created attachment 528247 [details]
verification image
Comment 28 Mike Foley 2011-10-14 13:38:32 EDT
Created attachment 528248 [details]
verification of upgrade log
Comment 29 Mike Foley 2012-02-07 14:23:33 EST
changing status of VERIFIED BZs for JON 2.4.2 and JON 3.0 to CLOSED/CURRENTRELEASE

Note You need to log in before you can comment on or make changes to this bug.