Bug 815341 - groups display availability as green when member resources are in unknown state
Summary: groups display availability as green when member resources are in unknown state
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RHQ Project
Classification: Other
Component: Core UI
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: RHQ 4.4.0
Assignee: John Mazzitelli
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks: jon310-sprint11, rhq44-sprint11
TreeView+ depends on / blocked
 
Reported: 2012-04-23 12:17 UTC by Sunil Kondkar
Modified: 2013-08-31 09:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-08-31 09:55:16 UTC
Embargoed:


Attachments (Terms of Use)
Screenshot (118.16 KB, image/png)
2012-04-23 12:18 UTC, Sunil Kondkar
no flags Details
Screenshot_GroupListView (92.93 KB, image/png)
2012-04-24 12:17 UTC, Sunil Kondkar
no flags Details
db performance tests before and after fix (211.17 KB, application/vnd.oasis.opendocument.spreadsheet)
2012-05-01 15:11 UTC, John Mazzitelli
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 819897 0 urgent CLOSED platform recursive groups include non-committed resources as implicit members 2021-02-22 00:41:40 UTC

Internal Links: 819897

Description Sunil Kondkar 2012-04-23 12:17:40 UTC
Description of problem:

The auto-groups like 'File Systems' or 'JBossAS Servers' display the availability as green when member resources are in unknown state (when platform is down and resources in unknown state after agent shudown)

Please refer the attached screenshot.

Version-Release number of selected component (if applicable):

Build#1386 (Version: 4.4.0-SNAPSHOT Build Number: 28e565c)

How reproducible:

Always

Steps to Reproduce:

1. Shut down the agent through UI (RHQ Agent->Operations->New->Shutdown Agent)
2. The platform goes down and servers and services go unknown immediately.
3. Navigate to resource tree and click on auto-groups (Ex: CPUs , 'File Systems', 'JBossAS Servers' etc )
4. Auto-groups display availability as green.
  
Actual results:

Auto-groups display availability as green when platform is down (member resources in unknown state) 

Expected results:

Auto-groups should display correct availability status (unknown)

Additional info:

Comment 1 Sunil Kondkar 2012-04-23 12:18:44 UTC
Created attachment 579518 [details]
Screenshot

Comment 2 Sunil Kondkar 2012-04-24 12:16:50 UTC
Additional info:

Observed the same issue with compatible groups.

Created compatible group of resources. Shutdown the agent and when platform is down (member resources in unknown state), availability is displayed as green on the compatible group list view in Children and Descendants columns.

Please refer Screenshot_GroupListView.

Comment 3 Sunil Kondkar 2012-04-24 12:17:24 UTC
Created attachment 579836 [details]
Screenshot_GroupListView

Comment 4 John Mazzitelli 2012-04-25 18:17:12 UTC
there is clearly a problem for both (I verified the behavior on master). Obviously due to the changes in the avail subsystem.

the question is - what should it be? do we show red/down? Or do we show (?) in the UI (that is, the big green check mark is wrong - but should it be the big red circle or should it be a big grey ?)

Note that we are talking about groups. In this case, ALL members are unknown avail - which is where the issue is. I think if at least one is down, you'll see a yellow triangle. I need to confirm this. But the fact is - this BZ is reporting what happens when ALL resources are unknown. A second issue would be to confirm that the behavior is OK if only SOME but not all are unknown.

Comment 5 John Mazzitelli 2012-04-25 19:18:21 UTC
i think the issue is in:

org.rhq.core.domain.resource.group.composite.ResourceGroupComposite.getAvailabilityType(boolean)

when the count > 0 but down/disabled is both 0 as is the case that we have here:

ResourceGroupComposite[name=all agents, implicit[count/down/disabled=,1/0/0], explicit[count/down/disabled=,1/0/0], facets=[CONFIGURATION, PLUGIN_CONFIGURATION, MEASUREMENT, OPERATION, SUPPORT, EVENT]]

(that's the toString of the composite that I see in the debugger)

Comment 6 John Mazzitelli 2012-04-26 15:23:59 UTC
I created a compatible group of 4 CPUs (my platform is a quad-core machine so I have 4 CPUs). I shutdown my agent and so all 4 CPUs are UNKNOWN. I then use a SQL tool to query and update  RHQ_RESOURCE_AVAIL table - I just change the availaiblity for one or two of my CPU resources - changing them from 2 (UNKNOWN) to either 0 (DOWN), 1 (UP) or 3 (DISABLED) and I refresh by browser to see how the changes look in the UI. Here's what we see:

* CPU #3 and CPU #4 remain UNKNOWN in all tests - I only change #1 or both #1 and #2.
* GroupAvailIcon is the big avail icon you see in the summary area when viewing the group (go to the group's Inventory tab for example and look at the top right of the page)
* GroupListAvailIcons is the set of icons you see in the table/list when viewing all compatible groups (go to Inventory>CompatibleGroups link)

CPU #1    CPU #2   GroupAvailIcon   GroupListAvailIcons
=======================================================
UNKNOWN   UNKNOWN  GREEN            GREEN (4)
DISABLED  UNKNOWN  DISABLED         GREEN (3) DISABLED (1)
RED       UNKNOWN  YELLOW TRIANGLE  GREEN (3) RED (1)
GREEN     UNKNOWN  GREEN            GREEN (4)
GREEN     RED      YELLOW TRIANGLE  GREEN (3) RED (1)
GREEN     DISABLED DISABLED         GREEN (3) DISABLED (1)
RED       DISABLED YELLOW TRIANGLE  GREEN (2) RED (1) DISABLED (1)
DISABLED  DISABLED DISABLED         GREEN (2) DISABLED (2)

So its clear that any resource that is UNKNOWN is considered, for all intents and purposes, GREEN or at best ignored and not considered when determining what icons/state to use.

Comment 7 John Mazzitelli 2012-04-26 15:44:24 UTC
just wanted to document this - this is the SQL used to query the resource group composite:

SELECT  new org.rhq.core.domain.resource.group.composite.ResourceGroupComposite(

( SELECT COUNT(avail) FROM resourcegroup.explicitResources res JOIN res.currentAvailability avail ) AS explicitCount,
( SELECT COUNT(avail) FROM resourcegroup.explicitResources res JOIN res.currentAvailability avail WHERE avail.availabilityType = 0 ) AS explicitDown,
( SELECT COUNT(avail) FROM resourcegroup.explicitResources res JOIN res.currentAvailability avail WHERE avail.availabilityType = 3 ) AS explicitDisabled,
( SELECT COUNT(avail) FROM resourcegroup.implicitResources res JOIN res.currentAvailability avail ) AS implicitCount,
( SELECT COUNT(avail) FROM resourcegroup.implicitResources res JOIN res.currentAvailability avail WHERE avail.availabilityType = 0 ) AS implicitDown,
( SELECT COUNT(avail) FROM resourcegroup.implicitResources res JOIN res.currentAvailability avail WHERE avail.availabilityType = 3 ) AS implicitDisabled,
resourcegroup ) 

FROM ResourceGroup resourcegroup
WHERE ( resourcegroup.groupCategory = :groupCategory 
AND resourcegroup.visible = :visible )

(NOTE :visible = true and :groupCategory is the "Compatible Group" entity/enum)

Comment 8 John Mazzitelli 2012-04-26 16:20:58 UTC
this is the actual SQL (not JPQL):

select

(select count(resourceav3_.ID) from RHQ_RESOURCE_GROUP_RES_EXP_MAP explicitre1_, RHQ_RESOURCE resource2_ inner join RHQ_RESOURCE_AVAIL resourceav3_ on resource2_.ID=resourceav3_.RESOURCE_ID where resourcegr0_.ID=explicitre1_.RESOURCE_GROUP_ID and explicitre1_.RESOURCE_ID=resource2_.ID) as col_0_0_,

(select count(resourceav6_.ID) from RHQ_RESOURCE_GROUP_RES_EXP_MAP explicitre4_, RHQ_RESOURCE resource5_ inner join RHQ_RESOURCE_AVAIL resourceav6_ on resource5_.ID=resourceav6_.RESOURCE_ID where resourcegr0_.ID=explicitre4_.RESOURCE_GROUP_ID and explicitre4_.RESOURCE_ID=resource5_.ID and resourceav6_.AVAILABILITY_TYPE=0) as col_1_0_,

(select count(resourceav9_.ID) from RHQ_RESOURCE_GROUP_RES_EXP_MAP explicitre7_, RHQ_RESOURCE resource8_ inner join RHQ_RESOURCE_AVAIL resourceav9_ on resource8_.ID=resourceav9_.RESOURCE_ID where resourcegr0_.ID=explicitre7_.RESOURCE_GROUP_ID and explicitre7_.RESOURCE_ID=resource8_.ID and resourceav9_.AVAILABILITY_TYPE=3) as col_2_0_,

(select count(resourceav12_.ID) from RHQ_RESOURCE_GROUP_RES_IMP_MAP implicitre10_, RHQ_RESOURCE resource11_ inner join RHQ_RESOURCE_AVAIL resourceav12_ on resource11_.ID=resourceav12_.RESOURCE_ID where resourcegr0_.ID=implicitre10_.RESOURCE_GROUP_ID and implicitre10_.RESOURCE_ID=resource11_.ID) as col_3_0_,

(select count(resourceav15_.ID) from RHQ_RESOURCE_GROUP_RES_IMP_MAP implicitre13_, RHQ_RESOURCE resource14_ inner join RHQ_RESOURCE_AVAIL resourceav15_ on resource14_.ID=resourceav15_.RESOURCE_ID where resourcegr0_.ID=implicitre13_.RESOURCE_GROUP_ID and implicitre13_.RESOURCE_ID=resource14_.ID and resourceav15_.AVAILABILITY_TYPE=0) as col_4_0_,

(select count(resourceav18_.ID) from RHQ_RESOURCE_GROUP_RES_IMP_MAP implicitre16_, RHQ_RESOURCE resource17_ inner join RHQ_RESOURCE_AVAIL resourceav18_ on resource17_.ID=resourceav18_.RESOURCE_ID where resourcegr0_.ID=implicitre16_.RESOURCE_GROUP_ID and implicitre16_.RESOURCE_ID=resource17_.ID and resourceav18_.AVAILABILITY_TYPE=3) as col_5_0_,

resourcegr0_.ID as col_6_0_

from RHQ_RESOURCE_GROUP resourcegr0_
where resourcegr0_.CATEGORY='COMPATIBLE' and resourcegr0_.visible = true

Comment 9 John Mazzitelli 2012-04-27 14:32:10 UTC
To repeat the current behavior:

CPU #1    CPU #2   GroupAvailIcon   GroupListAvailIcons
=======================================================
UNKNOWN   UNKNOWN  GREEN            GREEN (4)
DISABLED  UNKNOWN  DISABLED         GREEN (3) DISABLED (1)
RED       UNKNOWN  YELLOW TRIANGLE  GREEN (3) RED (1)
GREEN     UNKNOWN  GREEN            GREEN (4)
GREEN     RED      YELLOW TRIANGLE  GREEN (3) RED (1)
GREEN     DISABLED DISABLED         GREEN (3) DISABLED (1)
RED       DISABLED YELLOW TRIANGLE  GREEN (2) RED (1) DISABLED (1)
DISABLED  DISABLED DISABLED         GREEN (2) DISABLED (2)

Ignoring implementation details, here's what I will try to get the UI to do (remember, this group has *four* members - in all of my tests, CPU resources #3 and #4 are UNKNOWN)

CPU #1    CPU #2   GroupAvailIcon   GroupListAvailIcons
=======================================================
UNKNOWN   UNKNOWN  YELLOW TRIANGLE* UNKNOWN (4)*
DISABLED  UNKNOWN  YELLOW TRIANGLE* UNKNOWN (3)* DISABLED (1)
RED       UNKNOWN  YELLOW TRIANGLE  UNKNOWN (3)* RED (1)
GREEN     UNKNOWN  YELLOW TRIANGLE* UNKNOWN (3)* GREEN (1)*
GREEN     RED      YELLOW TRIANGLE  UNKNOWN (2)* GREEN (1)* RED (1)
GREEN     DISABLED YELLOW TRIANGLE* UNKNOWN (2)* GREEN (1)* DISABLED (1)
RED       DISABLED YELLOW TRIANGLE  UNKNOWN (2)* RED (1) DISABLED (1)
DISABLED  DISABLED YELLOW TRIANGLE* UNKNOWN (2)* DISABLED (2)

Those marked with (*) is the new behavior, different from the original behavior.
So you can see here that UNKNOWN is no longer ignored or considered GREEN. The group icon is now going to show the yellow triangle "Warning" icon - which seems more appropriate. If a resource is in an UNKNOWN state, we should visually warn the user that something seems amiss (since, presumably, the user will want to know the state of all his resources - having something whose state is unknown to the monitoring system seems like something the user will not consider normal (or green) - the user should be visually alerted to such a condition.)

If nothing is UNKNOWN, this is the behavior:

IF....                              THEN THE GROUP AVAIL ICON IS....
=============================================================================
All resources DISABLED              ... DISABLED
All resources RED                   ... RED
All resources GREEN                 ... GREEN
Some resources DISABLED, some RED   ... YELLOW TRIANGLE
Some resources GREEN, some RED      ... YELLOW TRIANGLE
Some resources GREEN, some DISABLED ... YELLOW TRIANGLE
Some resources GREEN, RED, DISABLED ... YELLOW TRIANGLE

If all are UNKNOWN, the group avail icon should be UNKNOWN (today it shows as GREEN, this is something we should change).

Comment 10 John Mazzitelli 2012-04-27 16:25:01 UTC
I just want to remember this JPQL so I will put it here. I fiddled with group by queries to see if there is a better way, and though I didn't figure out a better way, here's a group by query that we can build on later if we want to further try to get a faster/better set of queries to get group avail:

SELECT g.id, ra.availabilityType, count(ra.availabilityType)
FROM ResourceGroup g LEFT JOIN g.implicitResources res LEFT JOIN res.currentAvailability ra
WHERE g.groupCategory = 'COMPATIBLE' AND g.visible = true
GROUP BY g.id, ra.availabilityType

Comment 11 John Mazzitelli 2012-04-30 14:20:07 UTC
if we do change the avail icons to display, make sure we change the information on it here:

http://rhq-project.org/display/RHQ/Design-Availability+Checking#Design-AvailabilityChecking-GroupAvailability

http://www.rhq-project.org/display/RHQ/Release+Notes+4.4.0

Comment 12 Charles Crouch 2012-05-01 14:29:45 UTC
Making this block the 4.4 release

Comment 13 John Mazzitelli 2012-05-01 15:11:11 UTC
Created attachment 581415 [details]
db performance tests before and after fix

I attached a LibreOffice spreadsheet containing data from a series of database performance tests I ran. These were based off of a unit test that will be going in with the fix - I just tweeked the unit test to create different sizes/numbers of groups and the users used to query the data.

There is a lot of data in that spreadsheet, but the sheets that are of interest are those whose names are prefixed by "BOTH_". These contain charts of data for testing both with and without the fix for this BZ. With the fix, it was expected that database performance would suffer (since we are adding new subselects to the queries). This data shows how much of a performance hit there is.

I ran testing on both Oracle and Postgres and the database vendor doesn't make a difference (which is good and to be expected). Ignore the absolute values of the Oracle tests - Oracle is not tuned and running on my laptop over a wireless network - so the absolute numbers will be slow. The important thing is to look at the relative numbers between testing with and without the fix (this is important for the Postgres data as well - though my Postgres was running on a more powerful machine and was running on the same machine as the unit test JVM so no network latency is involved in the Postgres data. But again, just look at the relative numbers to get a feel for the impact of this fix).

In short, the fix introduces some additional slowdown, but nothing that I would consider bad enough for this fix to be rejected.

Spreadsheet with the full data is attached for you to examine yourself.

Comment 14 Jay Shaughnessy 2012-05-01 16:38:13 UTC
Mazz, I agree, I think for the more intelligent behavior the perf hit is
acceptable.  It does slow things down, sometimes a little more than I would
have guessed, but overall performance is still reasonable, I think.

Comment 15 John Mazzitelli 2012-05-01 16:54:33 UTC
master commits:

0c1f0a03e4709b89890bedd85b75af9a13dd9142
2567366ed179b4b896b1b501d8d0d50007000ba8
eef406b0d827e8027a81f2ec20ad5e4eba5dfaca

Comment 16 Sunil Kondkar 2012-05-15 11:31:53 UTC
Verified on Version: 3.1.0.BETA1 Build Number: 95ef567:68f5518

Verified that when member resources are in unknown state, the GroupAvailIcon shows the yellow triangle "Warning" icon and the GroupListAvailIcons in the compatible groups list view displays 'unknown' icon.


Note You need to log in before you can comment on or make changes to this bug.