Bug 865167

Summary: Resource with AvailabilityType of NULL is being returned to UI resulting in ResourceDatasource.copyValues throwing org_rhq_core_domain_resource_Resource_currentAvailability is null
Product: [Other] RHQ Project Reporter: Larry O'Leary <loleary>
Component: Core Server, DatabaseAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.4CC: hrupp, jshaughn, mazz
Target Milestone: ---   
Target Release: RHQ 4.6   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 865166 Environment:
Last Closed: 2013-09-03 10:45:31 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 865166    

Description Larry O'Leary 2012-10-10 18:22:49 EDT
+++ This bug was initially created as a clone of JBoss ON Bug #865166 +++

Description of problem:
When accessing the *Inventory* pages such as *Platforms*, the list of resources is not created and instead the *Message Center* displays an error indicating `Globally uncaught exception` with the following stack:

    com.google.gwt.core.client.JavaScriptException:(TypeError): from.org_rhq_core_domain_resource_Resource_currentAvailability is null
    --- STACK TRACE FOLLOWS ---
    (TypeError): from.org_rhq_core_domain_resource_Resource_currentAvailability is null
    stack: org_rhq_enterprise_gui_coregui_client_inventory_resource_ResourceDatasource_$copyValues__Lorg_rhq_enterprise_gui_coregui_client_inventory_resource_ResourceDatasource_2Lorg_rhq_core_domain_resource_Resource_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@https://myjon.server:7443/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/916AC7929C193E9E602839783B9B0857.cache.html:149635
        org_rhq_enterprise_gui_coregui_client_inventory_resource_ResourceDatasource_copyValues__Lorg_rhq_core_domain_resource_Resource_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@https://myjon.server:7443/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/916AC7929C193E9E602839783B9B0857.cache.html:149754
        org_rhq_enterprise_gui_coregui_client_inventory_resource_ResourceDatasource_copyValues__Ljava_lang_Object_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@https://myjon.server:7443/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/916AC7929C193E9E602839783B9B0857.cache.html:149758
        org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_copyValues__Ljava_lang_Object_2ZLcom_smartgwt_client_widgets_grid_ListGridRecord_2@https://myjon.server:7443/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/916AC7929C193E9E602839783B9B0857.cache.html:64843
        org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_$buildRecords__Lorg_rhq_enterprise_gui_coregui_client_util_RPCDataSource_2Ljava_util_Collection_2Z_3Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@https://myjon.server:7443/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/916AC7929C193E9E602839783B9B0857.cache.html:64557
        org_rhq_enterprise_gui_coregui_client_inventory_resource_ResourceDatasource$2_onTypesLoaded__Ljava_util_Map_2V@https://myjon.server:7443/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/916AC7929C193E9E602839783B9B0857.cache.html:149846
        ...
        @https://myjon.server:7443/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/916AC7929C193E9E602839783B9B0857.cache.html:2078
        fileName: https://myjon.server:7443/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/916AC7929C193E9E602839783B9B0857.cache.html
        lineNumber: 149635
           at Unknown.anonymous(Unknown Source)
           at Unknown.anonymous(Unknown Source)
           ...



This error is due to one or more platform resources in the RHQ_RESOURCE table not having a corresponding row in the RHQ_RESOURCE_AVAIL or RHQ_AVAILABILITY tables.

Version-Release number of selected component (if applicable):
3.1.1

How reproducible:
Always

Steps to Reproduce:
Not certain at this point but the hypothesis is that the impacted resource had been committed to inventory. Then, the resource or its parent resource was removed from inventory using *UNINVENTORY*. Then, the server was shutdown resulting in a partial deletion of the affected resource from the database.

A similar or the same scenario was reported when an upgrade of the JBoss ON server was being performed. First, with the server running (not sure if the agent was running or not) the agent's platform was removed from inventory and the server was immediately shutdown in preparation for an upgrade. The upgrade was performed. After the upgrade, the inventory pages would no longer work.

  
Additional info:
We know that in JBoss ON 3.1 `null` availability was no longer a valid AvailabilityType and a column constraint was added to the database table to prevent `null` from being a valid value. However, what we did not take into consideration is that many of the queries retrieve availability for a resource using a LEFT OUTER JOIN. The result of a LEFT OUTER JOIN if the resource is on the left and the resource's availability is on the right would be a result that contained an availability type value of `null`. So, perhaps this underlying partial delete issue has been present for quite some time but not detected because previously we had expected availability to be `null` and treat it as UNKNOWN.

However, the underlying issue is why the partial delete. How does the RHQ_RESOURCE table contain a resource row that does not have a corresponding RHQ_RESOURCE_AVAIL or RHQ_AVAILABILITY row. Perhaps:

 - Availability has not yet been collected for this resource or an availability report has not yet been processed by the server?

   However, in that case, we would expect the inventory status to be something other then COMMITTED. Perhaps, NEW or UNCOMMITED or DELETED, but not COMMITTED. But maybe this can and does happen? When I first import a resource I would expect that not only do I add a row to RHQ_RESOURCE but also a row to both RHQ_RESOURCE_AVAIL and RHQ_AVAILABILITY with an availability of UNKNOWN to indicate that we don't know what it is yet.

 - When removing/deleting a resource from inventory, the last place we delete is the RHQ_RESOURCE table

   The expectation would be a transaction would prevent data from being removed from dependent tables but we know that the deletion of a resource and its data (metrics, calltime, events, availability, content, etc) happens asynchronously and most likely in chunks to prevent transaction timeouts. The process is very time consuming.
Comment 1 Jay Shaughnessy 2012-10-16 14:17:11 EDT
master commit c2b37fd00701018fd5df3f99879c6cf4c2fddc4c
Jay Shaughnessy <jshaughn@redhat.com>
Tue Oct 16 14:04:41 2012 -0400

I believe this is the cause of the failure.  In certain cases a legacy db
may not have an rhq_resource_avail record for a resource.  It is now eagerly
created but in the past it was possible to have resource without the
record. The old codepath for initializing resource avail for these resources
had an issue.


** NOTE!!! Once the issue occurs, the code fix will will not automatically solve
** the problem.  This workaround code below should be applied.  The fix will
** only help future upgrades, not upgrades that have already taken place.

Workaround SQL that can be applied to databases suffering from this issue:

postgres:
INSERT INTO RHQ_RESOURCE_AVAIL ( ID, RESOURCE_ID, AVAILABILITY_TYPE )
    SELECT nextval('RHQ_RESOURCE_AVAIL_ID_SEQ'::text), res.ID, 2
      FROM RHQ_RESOURCE res
 LEFT JOIN RHQ_RESOURCE_AVAIL avail ON res.ID = avail.RESOURCE_ID
     WHERE avail.ID IS NULL


oracle:
INSERT INTO RHQ_RESOURCE_AVAIL ( ID, RESOURCE_ID, AVAILABILITY_TYPE )
    SELECT RHQ_RESOURCE_AVAIL_ID_SEQ.nextval, res.ID, 2
      FROM RHQ_RESOURCE res
 LEFT JOIN RHQ_RESOURCE_AVAIL avail ON res.ID = avail.RESOURCE_ID
     WHERE avail.ID IS NULL


Test Notes:
Not really easy to test. I'd suggest probably just moving this to VERIFIED if the relevant customer cases are resolved.
Comment 2 Jay Shaughnessy 2012-10-16 14:18:16 EDT
DevNote:  If this resolves the issues, the query above could be moved to
the db-upgrade.xml script and the codepath could be eliminated completely.
Comment 3 Charles Crouch 2012-11-06 16:08:44 EST
As per 865166 db upgrade needs to be updated with the sql below
Comment 4 John Mazzitelli 2012-11-13 11:03:15 EST
(In reply to comment #3)
> As per 865166 db upgrade needs to be updated with the sql below

git commit to master 4b888be16b1ae2ceb876e85cc82001bbdc79e516 adds the sql to the db-upgrade script.
Comment 5 Heiko W. Rupp 2013-09-03 10:45:31 EDT
Bulk closing of issues in old RHQ releases that are in production for a while now.

Please open a new issue when running into an issue.