Bug 815341

Summary: groups display availability as green when member resources are in unknown state
Product: [Other] RHQ Project Reporter: Sunil Kondkar <skondkar>
Component: Core UIAssignee: John Mazzitelli <mazz>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: urgent    
Version: 4.4CC: hrupp, jshaughn, mazz
Target Milestone: ---   
Target Release: RHQ 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=819897
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-31 05:55:16 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 782579    
Attachments:
Description Flags
Screenshot
none
Screenshot_GroupListView
none
db performance tests before and after fix none

Description Sunil Kondkar 2012-04-23 08:17:40 EDT
Description of problem:

The auto-groups like 'File Systems' or 'JBossAS Servers' display the availability as green when member resources are in unknown state (when platform is down and resources in unknown state after agent shudown)

Please refer the attached screenshot.

Version-Release number of selected component (if applicable):

Build#1386 (Version: 4.4.0-SNAPSHOT Build Number: 28e565c)

How reproducible:

Always

Steps to Reproduce:

1. Shut down the agent through UI (RHQ Agent->Operations->New->Shutdown Agent)
2. The platform goes down and servers and services go unknown immediately.
3. Navigate to resource tree and click on auto-groups (Ex: CPUs , 'File Systems', 'JBossAS Servers' etc )
4. Auto-groups display availability as green.
  
Actual results:

Auto-groups display availability as green when platform is down (member resources in unknown state) 

Expected results:

Auto-groups should display correct availability status (unknown)

Additional info:
Comment 1 Sunil Kondkar 2012-04-23 08:18:44 EDT
Created attachment 579518 [details]
Screenshot
Comment 2 Sunil Kondkar 2012-04-24 08:16:50 EDT
Additional info:

Observed the same issue with compatible groups.

Created compatible group of resources. Shutdown the agent and when platform is down (member resources in unknown state), availability is displayed as green on the compatible group list view in Children and Descendants columns.

Please refer Screenshot_GroupListView.
Comment 3 Sunil Kondkar 2012-04-24 08:17:24 EDT
Created attachment 579836 [details]
Screenshot_GroupListView
Comment 4 John Mazzitelli 2012-04-25 14:17:12 EDT
there is clearly a problem for both (I verified the behavior on master). Obviously due to the changes in the avail subsystem.

the question is - what should it be? do we show red/down? Or do we show (?) in the UI (that is, the big green check mark is wrong - but should it be the big red circle or should it be a big grey ?)

Note that we are talking about groups. In this case, ALL members are unknown avail - which is where the issue is. I think if at least one is down, you'll see a yellow triangle. I need to confirm this. But the fact is - this BZ is reporting what happens when ALL resources are unknown. A second issue would be to confirm that the behavior is OK if only SOME but not all are unknown.
Comment 5 John Mazzitelli 2012-04-25 15:18:21 EDT
i think the issue is in:

org.rhq.core.domain.resource.group.composite.ResourceGroupComposite.getAvailabilityType(boolean)

when the count > 0 but down/disabled is both 0 as is the case that we have here:

ResourceGroupComposite[name=all agents, implicit[count/down/disabled=,1/0/0], explicit[count/down/disabled=,1/0/0], facets=[CONFIGURATION, PLUGIN_CONFIGURATION, MEASUREMENT, OPERATION, SUPPORT, EVENT]]

(that's the toString of the composite that I see in the debugger)
Comment 6 John Mazzitelli 2012-04-26 11:23:59 EDT
I created a compatible group of 4 CPUs (my platform is a quad-core machine so I have 4 CPUs). I shutdown my agent and so all 4 CPUs are UNKNOWN. I then use a SQL tool to query and update  RHQ_RESOURCE_AVAIL table - I just change the availaiblity for one or two of my CPU resources - changing them from 2 (UNKNOWN) to either 0 (DOWN), 1 (UP) or 3 (DISABLED) and I refresh by browser to see how the changes look in the UI. Here's what we see:

* CPU #3 and CPU #4 remain UNKNOWN in all tests - I only change #1 or both #1 and #2.
* GroupAvailIcon is the big avail icon you see in the summary area when viewing the group (go to the group's Inventory tab for example and look at the top right of the page)
* GroupListAvailIcons is the set of icons you see in the table/list when viewing all compatible groups (go to Inventory>CompatibleGroups link)

CPU #1    CPU #2   GroupAvailIcon   GroupListAvailIcons
=======================================================
UNKNOWN   UNKNOWN  GREEN            GREEN (4)
DISABLED  UNKNOWN  DISABLED         GREEN (3) DISABLED (1)
RED       UNKNOWN  YELLOW TRIANGLE  GREEN (3) RED (1)
GREEN     UNKNOWN  GREEN            GREEN (4)
GREEN     RED      YELLOW TRIANGLE  GREEN (3) RED (1)
GREEN     DISABLED DISABLED         GREEN (3) DISABLED (1)
RED       DISABLED YELLOW TRIANGLE  GREEN (2) RED (1) DISABLED (1)
DISABLED  DISABLED DISABLED         GREEN (2) DISABLED (2)

So its clear that any resource that is UNKNOWN is considered, for all intents and purposes, GREEN or at best ignored and not considered when determining what icons/state to use.
Comment 7 John Mazzitelli 2012-04-26 11:44:24 EDT
just wanted to document this - this is the SQL used to query the resource group composite:

SELECT  new org.rhq.core.domain.resource.group.composite.ResourceGroupComposite(

( SELECT COUNT(avail) FROM resourcegroup.explicitResources res JOIN res.currentAvailability avail ) AS explicitCount,
( SELECT COUNT(avail) FROM resourcegroup.explicitResources res JOIN res.currentAvailability avail WHERE avail.availabilityType = 0 ) AS explicitDown,
( SELECT COUNT(avail) FROM resourcegroup.explicitResources res JOIN res.currentAvailability avail WHERE avail.availabilityType = 3 ) AS explicitDisabled,
( SELECT COUNT(avail) FROM resourcegroup.implicitResources res JOIN res.currentAvailability avail ) AS implicitCount,
( SELECT COUNT(avail) FROM resourcegroup.implicitResources res JOIN res.currentAvailability avail WHERE avail.availabilityType = 0 ) AS implicitDown,
( SELECT COUNT(avail) FROM resourcegroup.implicitResources res JOIN res.currentAvailability avail WHERE avail.availabilityType = 3 ) AS implicitDisabled,
resourcegroup ) 

FROM ResourceGroup resourcegroup
WHERE ( resourcegroup.groupCategory = :groupCategory 
AND resourcegroup.visible = :visible )

(NOTE :visible = true and :groupCategory is the "Compatible Group" entity/enum)
Comment 8 John Mazzitelli 2012-04-26 12:20:58 EDT
this is the actual SQL (not JPQL):

select

(select count(resourceav3_.ID) from RHQ_RESOURCE_GROUP_RES_EXP_MAP explicitre1_, RHQ_RESOURCE resource2_ inner join RHQ_RESOURCE_AVAIL resourceav3_ on resource2_.ID=resourceav3_.RESOURCE_ID where resourcegr0_.ID=explicitre1_.RESOURCE_GROUP_ID and explicitre1_.RESOURCE_ID=resource2_.ID) as col_0_0_,

(select count(resourceav6_.ID) from RHQ_RESOURCE_GROUP_RES_EXP_MAP explicitre4_, RHQ_RESOURCE resource5_ inner join RHQ_RESOURCE_AVAIL resourceav6_ on resource5_.ID=resourceav6_.RESOURCE_ID where resourcegr0_.ID=explicitre4_.RESOURCE_GROUP_ID and explicitre4_.RESOURCE_ID=resource5_.ID and resourceav6_.AVAILABILITY_TYPE=0) as col_1_0_,

(select count(resourceav9_.ID) from RHQ_RESOURCE_GROUP_RES_EXP_MAP explicitre7_, RHQ_RESOURCE resource8_ inner join RHQ_RESOURCE_AVAIL resourceav9_ on resource8_.ID=resourceav9_.RESOURCE_ID where resourcegr0_.ID=explicitre7_.RESOURCE_GROUP_ID and explicitre7_.RESOURCE_ID=resource8_.ID and resourceav9_.AVAILABILITY_TYPE=3) as col_2_0_,

(select count(resourceav12_.ID) from RHQ_RESOURCE_GROUP_RES_IMP_MAP implicitre10_, RHQ_RESOURCE resource11_ inner join RHQ_RESOURCE_AVAIL resourceav12_ on resource11_.ID=resourceav12_.RESOURCE_ID where resourcegr0_.ID=implicitre10_.RESOURCE_GROUP_ID and implicitre10_.RESOURCE_ID=resource11_.ID) as col_3_0_,

(select count(resourceav15_.ID) from RHQ_RESOURCE_GROUP_RES_IMP_MAP implicitre13_, RHQ_RESOURCE resource14_ inner join RHQ_RESOURCE_AVAIL resourceav15_ on resource14_.ID=resourceav15_.RESOURCE_ID where resourcegr0_.ID=implicitre13_.RESOURCE_GROUP_ID and implicitre13_.RESOURCE_ID=resource14_.ID and resourceav15_.AVAILABILITY_TYPE=0) as col_4_0_,

(select count(resourceav18_.ID) from RHQ_RESOURCE_GROUP_RES_IMP_MAP implicitre16_, RHQ_RESOURCE resource17_ inner join RHQ_RESOURCE_AVAIL resourceav18_ on resource17_.ID=resourceav18_.RESOURCE_ID where resourcegr0_.ID=implicitre16_.RESOURCE_GROUP_ID and implicitre16_.RESOURCE_ID=resource17_.ID and resourceav18_.AVAILABILITY_TYPE=3) as col_5_0_,

resourcegr0_.ID as col_6_0_

from RHQ_RESOURCE_GROUP resourcegr0_
where resourcegr0_.CATEGORY='COMPATIBLE' and resourcegr0_.visible = true
Comment 9 John Mazzitelli 2012-04-27 10:32:10 EDT
To repeat the current behavior:

CPU #1    CPU #2   GroupAvailIcon   GroupListAvailIcons
=======================================================
UNKNOWN   UNKNOWN  GREEN            GREEN (4)
DISABLED  UNKNOWN  DISABLED         GREEN (3) DISABLED (1)
RED       UNKNOWN  YELLOW TRIANGLE  GREEN (3) RED (1)
GREEN     UNKNOWN  GREEN            GREEN (4)
GREEN     RED      YELLOW TRIANGLE  GREEN (3) RED (1)
GREEN     DISABLED DISABLED         GREEN (3) DISABLED (1)
RED       DISABLED YELLOW TRIANGLE  GREEN (2) RED (1) DISABLED (1)
DISABLED  DISABLED DISABLED         GREEN (2) DISABLED (2)

Ignoring implementation details, here's what I will try to get the UI to do (remember, this group has *four* members - in all of my tests, CPU resources #3 and #4 are UNKNOWN)

CPU #1    CPU #2   GroupAvailIcon   GroupListAvailIcons
=======================================================
UNKNOWN   UNKNOWN  YELLOW TRIANGLE* UNKNOWN (4)*
DISABLED  UNKNOWN  YELLOW TRIANGLE* UNKNOWN (3)* DISABLED (1)
RED       UNKNOWN  YELLOW TRIANGLE  UNKNOWN (3)* RED (1)
GREEN     UNKNOWN  YELLOW TRIANGLE* UNKNOWN (3)* GREEN (1)*
GREEN     RED      YELLOW TRIANGLE  UNKNOWN (2)* GREEN (1)* RED (1)
GREEN     DISABLED YELLOW TRIANGLE* UNKNOWN (2)* GREEN (1)* DISABLED (1)
RED       DISABLED YELLOW TRIANGLE  UNKNOWN (2)* RED (1) DISABLED (1)
DISABLED  DISABLED YELLOW TRIANGLE* UNKNOWN (2)* DISABLED (2)

Those marked with (*) is the new behavior, different from the original behavior.
So you can see here that UNKNOWN is no longer ignored or considered GREEN. The group icon is now going to show the yellow triangle "Warning" icon - which seems more appropriate. If a resource is in an UNKNOWN state, we should visually warn the user that something seems amiss (since, presumably, the user will want to know the state of all his resources - having something whose state is unknown to the monitoring system seems like something the user will not consider normal (or green) - the user should be visually alerted to such a condition.)

If nothing is UNKNOWN, this is the behavior:

IF....                              THEN THE GROUP AVAIL ICON IS....
=============================================================================
All resources DISABLED              ... DISABLED
All resources RED                   ... RED
All resources GREEN                 ... GREEN
Some resources DISABLED, some RED   ... YELLOW TRIANGLE
Some resources GREEN, some RED      ... YELLOW TRIANGLE
Some resources GREEN, some DISABLED ... YELLOW TRIANGLE
Some resources GREEN, RED, DISABLED ... YELLOW TRIANGLE

If all are UNKNOWN, the group avail icon should be UNKNOWN (today it shows as GREEN, this is something we should change).
Comment 10 John Mazzitelli 2012-04-27 12:25:01 EDT
I just want to remember this JPQL so I will put it here. I fiddled with group by queries to see if there is a better way, and though I didn't figure out a better way, here's a group by query that we can build on later if we want to further try to get a faster/better set of queries to get group avail:

SELECT g.id, ra.availabilityType, count(ra.availabilityType)
FROM ResourceGroup g LEFT JOIN g.implicitResources res LEFT JOIN res.currentAvailability ra
WHERE g.groupCategory = 'COMPATIBLE' AND g.visible = true
GROUP BY g.id, ra.availabilityType
Comment 11 John Mazzitelli 2012-04-30 10:20:07 EDT
if we do change the avail icons to display, make sure we change the information on it here:

http://rhq-project.org/display/RHQ/Design-Availability+Checking#Design-AvailabilityChecking-GroupAvailability

http://www.rhq-project.org/display/RHQ/Release+Notes+4.4.0
Comment 12 Charles Crouch 2012-05-01 10:29:45 EDT
Making this block the 4.4 release
Comment 13 John Mazzitelli 2012-05-01 11:11:11 EDT
Created attachment 581415 [details]
db performance tests before and after fix

I attached a LibreOffice spreadsheet containing data from a series of database performance tests I ran. These were based off of a unit test that will be going in with the fix - I just tweeked the unit test to create different sizes/numbers of groups and the users used to query the data.

There is a lot of data in that spreadsheet, but the sheets that are of interest are those whose names are prefixed by "BOTH_". These contain charts of data for testing both with and without the fix for this BZ. With the fix, it was expected that database performance would suffer (since we are adding new subselects to the queries). This data shows how much of a performance hit there is.

I ran testing on both Oracle and Postgres and the database vendor doesn't make a difference (which is good and to be expected). Ignore the absolute values of the Oracle tests - Oracle is not tuned and running on my laptop over a wireless network - so the absolute numbers will be slow. The important thing is to look at the relative numbers between testing with and without the fix (this is important for the Postgres data as well - though my Postgres was running on a more powerful machine and was running on the same machine as the unit test JVM so no network latency is involved in the Postgres data. But again, just look at the relative numbers to get a feel for the impact of this fix).

In short, the fix introduces some additional slowdown, but nothing that I would consider bad enough for this fix to be rejected.

Spreadsheet with the full data is attached for you to examine yourself.
Comment 14 Jay Shaughnessy 2012-05-01 12:38:13 EDT
Mazz, I agree, I think for the more intelligent behavior the perf hit is
acceptable.  It does slow things down, sometimes a little more than I would
have guessed, but overall performance is still reasonable, I think.
Comment 15 John Mazzitelli 2012-05-01 12:54:33 EDT
master commits:

0c1f0a03e4709b89890bedd85b75af9a13dd9142
2567366ed179b4b896b1b501d8d0d50007000ba8
eef406b0d827e8027a81f2ec20ad5e4eba5dfaca
Comment 16 Sunil Kondkar 2012-05-15 07:31:53 EDT
Verified on Version: 3.1.0.BETA1 Build Number: 95ef567:68f5518

Verified that when member resources are in unknown state, the GroupAvailIcon shows the yellow triangle "Warning" icon and the GroupListAvailIcons in the compatible groups list view displays 'unknown' icon.