Bug 881848

Summary:	Resource availability type remains in state unknown after application of workaround from bug 865166
Product:	[Other] RHQ Project	Reporter:	Filip Brychta <fbrychta>
Component:	Agent, Core Server	Assignee:	Jay Shaughnessy <jshaughn>
Status:	ON_QA ---	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	4.5	CC:	hrupp, loleary, mazz
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	916373 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	916373

Description Filip Brychta 2012-11-29 16:54:50 UTC

Description of problem:
See bug 865166. When you apply provided query which fixes inconsistent RHQ_RESOURCE_AVAIL table, relevant resource availability type remains in unknown state. Invoking 'avail -f' operation on the agent doesn't fix the state. 

Version-Release number of selected component (if applicable):
3.1.2.ER2

How reproducible:
Always

Steps to Reproduce:
1. clean installation of JON, the agent is running and imported to inventory
2. remove row relevant for RHQ Agent from RHQ_RESOURCE_AVAIL table (you can use step 4 from bug 877176) which brings database to inconsisten state described in bug 865166
3. apply workaround from bug 865166 (INSERT INTO RHQ_RESOURCE_AVAIL... query)
4. refresh page showing RHQ Agent availability -> availability unknown 
5. invoke agent's command 'avail -f'
6. refresh page showing RHQ Agent availability -> availability unknown
  
Actual results:
RHQ Agent availability remains unknown

Expected results:
After avail -f RHQ Agent's availability is UP

Additional info:
This is an edge scenario and this issue can be simply workarounded by restarting the agent or uninvetory/inventory the resource. But is there any chance this could lead to some general issue in availability scan?

The same state can be reached by following steps in bug 877176 with one difference. Upgrade to 3.1.2.ER2, not to JON 3.1.1

Comment 1 John Mazzitelli 2013-02-25 20:54:46 UTC

this query (followed by executing "Execute Availaiblity Scan" changes-only=false operation) will fix it too:

INSERT INTO RHQ_AVAILABILITY ( ID, RESOURCE_ID, START_TIME, END_TIME, AVAILABILITY_TYPE )
SELECT RHQ_AVAILABILITY_ID_SEQ.nextval, res.ID, 0, NULL, 2
 FROM RHQ_RESOURCE res
WHERE NOT EXISTS ( SELECT * FROM RHQ_AVAILABILITY WHERE RESOURCE_ID = res.ID )

Comment 2 John Mazzitelli 2013-02-25 21:07:28 UTC

the preivous query was for Oracle, this is for postgres:

INSERT INTO RHQ_AVAILABILITY ( ID, RESOURCE_ID, START_TIME, END_TIME, AVAILABILITY_TYPE )
   SELECT nextval('RHQ_AVAILABILITY_ID_SEQ'::text), res.ID, 0, NULL, 2
   FROM RHQ_RESOURCE res
  WHERE NOT EXISTS ( SELECT * FROM RHQ_AVAILABILITY WHERE RESOURCE_ID = res.ID )

we need to add a new dbupgrade step for this

Comment 3 John Mazzitelli 2013-02-25 21:29:20 UTC

assigning to jay

Comment 4 Jay Shaughnessy 2013-02-25 21:39:39 UTC

Post-upgrade the query above must be executed to repair the situation.  Otherwise the db-upgrade will take care of it (for versions including the forthcoming commit).

The full avail report is required (or an agent restart) to move the avail from UNKNOWN to the actual avail state.

Comment 5 Jay Shaughnessy 2013-02-26 01:14:29 UTC

master commit 068f664483a2013fa84123cb6b6ba85b54ee7c5c
Jay Shaughnessy <jshaughn>
Mon Feb 25 16:36:41 2013 -0500

Ensure after upgrade that all resources, including those in the ADQ, have
at least an initial UNKNOWN Availability.


Test Notes:
See above comments.

Comment 6 John Mazzitelli 2013-02-26 14:49:23 UTC

for the record:

I tried this upgrade from JON 3.0.0 -> 3.1.2 and, rather than do the SQL query manual to correct the Db, I simply tried to restart the agent with the hope it would send up a full avail report and self-correct the DB.

However, this did not happen. The RHQ_AVAILABILITY table is still missing the rows even after the agent restart. The server also spit out these messages when it got the restarted agent's avail report:

09:38:20,048 INFO  [AvailabilityManagerBean] Skipping mergeAvailabilityReport() for stale resource [Resource[id=10001, uuid=null, type=<null>, key=null, name=null, parent=<null>]]. These messages should go away after the next agent synchronization with the server.

In short, to fix this problem, you have to execute that SQL statement manually after you upgrade to 3.1.2. You only need to do this IF you committed resources AFTER the 3.1.2 upgrade where those resources were in the discovery queue pre-upgrade (that is, when 3.0.0 was running.)