Bug 1295863

Summary: The number of resources in All Groups/Compatible Groups page is not correct all the time
Product: [JBoss] JBoss Operations Network Reporter: bkramer <bkramer>
Component: UIAssignee: Josejulio Martínez <jmartine>
Status: CLOSED ERRATA QA Contact: vsorokin <vsorokin>
Severity: high Docs Contact:
Priority: urgent    
Version: JON 3.3.4CC: bkramer, fbrychta, jmartine, loleary, mfoley, spinder, stianlund+bugzilla, vsorokin
Target Milestone: DR01Keywords: Triaged
Target Release: JON 3.3.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-27 15:32:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1295864    
Bug Blocks:    
Attachments:
Description Flags
attached screen shots
none
DynaGroup_-_Server_Type__RHQ_Storage_Node_.png
none
ServerName screen shots
none
Sample of 'All Groups' page in JON 3.3.6 GUI
none
Sample of 'All Groups, Compatible' page in JON 3.3.6 GUI none

Description bkramer 2016-01-05 15:59:38 UTC
Created attachment 1111903 [details]
attached screen shots

Description of problem:
In JBoss ON environment with big number of resources, then number of child and/or descendent resources is not correct all the time. 

Version-Release number of selected component (if applicable):
JBoss ON 3.3.4

How reproducible:
Always - for big number of imported resources and big number of groups;

Steps to Reproduce:
1. Install JBoss ON 3.3.4 and at least 3 agents;

2. Import big number of resources (I used rhq-perftest-plugin-4.9.0.jar - so in my case I started all agents with "./rhq-agent.sh -Drhq.perftest.scenario=configurable-1 -Drhq.perftest.server-a-count=200 -Drhq.perftest.service-a-count=100" ); 

3. Create two dynagroups:

    3.1.) name: Server Type
          description: Group by Server Type
          expression:
            resource.type.category = SERVER
            groupby resource.type.name
          recursive: yes
          calculate: 5 minutes

    3.2.) name: Server Name
          description: Group by Server Name
          expression: 
            resource.type.category = SERVER
            groupby resource.name
          recursive: yes
          calculate: 5 minutes

4. Navigate to All Groups page and check the content of the page.



Actual results:

Attached screen shots that show the issue. 

Initially, as servers and services were not discovered at the same time, the numbers of servers/services were different. But, after few calculations of dynagroups, the page All Groups was as in incorrectNoResources_1.png (see attached images). In this screen shot, most of the dynagroups show correct number of children (3) and descendants (303) - except for server-a-152 which for some reason shows 3 available and 1 unknown in the children column (this cannot be true).

Then - IncorrectNoResources_2.png shows the same server-a-152 server but also server-a-94 with only 1 unknown child resource and 303 descendants (and it should be 3 available child resources - as seen in IncorrectNoResources_6.png).

When "Refresh" button is pressed for the first time (see IncorrectNoResources_3.png) server-a-115 showed 600 children and 303 descendants (the first number is clearly wrong), and then IncorrectNoResources_4.png after next refresh button where server-a-115 shows 3 child and 303 descendants. Also, in the same IncorrectNoResources_4.png see the numbers for server-a-161 - they look correct - 3 child and 303 descendants. After Refresh button is pressed again I got the situation as in IncorrectNoResources_5.png where the number of resources for server-a-115 is correct but for server-a-161 not (as now this server shows 600 child resources and 303 descendants).



Expected results:
Every dynagroup/compatible group in All Groups/Compatible Groups page shows correct number of child/descendants resources.

Additional info:
It may happen that this is caused with work done on Bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=1244941.

Comment 1 bkramer 2016-01-05 16:21:18 UTC
Additionally, see attached DynaGroup_-_Server_Type__RHQ_Storage_Node_.png - it shows 3 child resources and I only have one storage node. Interesting, when I press "Refresh" button, for a few milliseconds I can see 1 child resource but then it changes to 3. :-(

Comment 2 bkramer 2016-01-05 16:22:06 UTC
Created attachment 1111906 [details]
DynaGroup_-_Server_Type__RHQ_Storage_Node_.png

Comment 4 bkramer 2016-01-07 14:13:53 UTC
One more thing - I used the same Server Name dynagroup but without "groupby resource.name" (see attached ServerName_def.png). When group was calculated, there was only one DynaGroup - Server Name created (as expected) (same screen shot - ServerName_def.png). However, there were differences between values shown on "Mixed Groups" and "All Groups" page (see ServerName_1.png and ServerName_2.png).

Comment 5 bkramer 2016-01-07 14:14:51 UTC
Created attachment 1112470 [details]
ServerName screen shots

Comment 6 Filip Brychta 2016-01-08 13:48:38 UTC
Hi Biljana, do you still have agent logs? I was trying to reproduce the issue and agents failed with java.lang.OutOfMemoryError: Java heap space during discovery scan. I'm just guessing but this weird bahavior could be maybe caused by overloaded agent and failures during discovery scan.
I tried to use less resources (200 servers with 10 services) and groups were working as expected.
The OutOfMemoryError exception on agent is visible in JON 3.3.3 as well.

Comment 7 bkramer 2016-01-08 13:58:49 UTC
(In reply to Filip Brychta from comment #6)
> Hi Biljana, do you still have agent logs? I was trying to reproduce the
> issue and agents failed with java.lang.OutOfMemoryError: Java heap space
> during discovery scan. I'm just guessing but this weird bahavior could be
> maybe caused by overloaded agent and failures during discovery scan.
> I tried to use less resources (200 servers with 10 services) and groups were
> working as expected.
> The OutOfMemoryError exception on agent is visible in JON 3.3.3 as well.

Yes, you will have to increase Xms and Xmx to get agent working without OOM errors. I was generous and set heap of my agent to 2G. Heap for JON Server should be increased as well although I didn't change this (and default is 1G I think) and I didn't have any problem.

Comment 8 Filip Brychta 2016-01-11 12:19:52 UTC
I increased the heap but I'm still not able to reproduce it even with 200 servers with 100 services. All groups contain correct count of children and descendants. Targeting to 3.3.6 as this will require more investigation.

Comment 9 Filip Brychta 2016-01-12 09:53:24 UTC
After discussion with Biljana I was able to reproduce the issue. The issue shows up after uninventory/inventory of resources (I uninventoried whole platform). Until then the numbers of child/descendants were correct.

Comment 10 Larry O'Leary 2016-01-12 14:16:58 UTC
The behavior may be expected when an uninventory is performed.

The uninventory operation is done asynchronously meaning that resources are not removed immediately. With a large number of resources it could take several minutes for an uninventory to complete. In some cases, 20 or 30 minutes depending on what history is associated with each resource.

Comment 11 Filip Brychta 2016-01-13 08:09:18 UTC
That could explain incorrect count of children vs. descendants, but another thing which should be explained is different count of children shown on All groups page vs. real count of children when you open the group and navigate to Inventory->Members tab. e.g. All groups page shows that given group has 1 child but when you navigate to group's Members tab you see there 3 resources.

Comment 12 Stian Lund 2016-01-13 11:36:07 UTC
I don't think it's as simple as a delay in uninventory of resources. The errors are inconcistent, some groups show as having a lot more members than actual, and some a lot less. Others, like Biljana's examples show hundreds of members of a group that should contain only a few.

We have a relatively large environment with 450 platforms each running one or more EAP instances. There are automated jobs to import new resources, and since this is Test-env, a job to remove unavailable platforms after some time. This could have explained some inconsistences but not at the amounts we are seeing.

Comment 13 Stian Lund 2016-03-02 09:56:19 UTC
Hello,
has there been any update on this case? Since this affects the Groups view in JON later than 3.3.3 we are unable to go to production with these versions.

If there is a need for more information to be able to reproduce the issue I am very willing to help.

Also as information; for the support case I created a simple CLI script to list groups and the member count, and this always returns the correct number of servers, so the error must lie in how the UI calculates the same value.

Code:
var criteria = new ResourceGroupCriteria;
criteria.fetchExplicitResources(true);
var groupList = ResourceGroupManager.findResourceGroupsByCriteria(criteria);

for ( var i = 0; i < groupList.size(); i++ ) { 

var group = groupList.get(i);
var groupName = group.name;
var groupSize = group.explicitResources.size();

println(groupName + ", " + groupSize + " members");

}

Comment 14 Josejulio Martínez 2016-03-24 01:23:07 UTC
The UI uses a non-public API for this operation. That operation fetches a list of explicit and a list of implicit results and to merges them together. It assumes that the order is the same on both lists.

On my tests, the lists weren't always on the same order, so the resources counts were mixed (i.e. if i had 2 JVM and 1 Agent I could end up with a view with 1 JVM and 2 Agents). Treating the order as unknown fixed the problem on my env.

Comment 15 Josejulio Martínez 2016-05-12 15:18:29 UTC
commit 1bef38230893da8d940f5f545b99eb20ad7fdd19
Merge: 4e8ddcb f39302f
Author: Michael Burman <yak>
Date:   Thu May 12 14:47:22 2016 +0300

    Merge pull request #223 from josejulio/bugs/1295863
    
    Bug 1295863 - The number of resources in All Groups/Compatible Groups...


commit f39302f62c15a0635571b0c778250974b7a9349f
Author: Josejulio Martínez <jmartine>
Date:   Wed Mar 16 18:45:31 2016 -0600

    Bug 1295863 - The number of resources in All Groups/Compatible Groups page is not correct all the time
    
    We can't assume that the order is the same.

Comment 17 Simeon Pinder 2016-06-18 01:12:06 UTC
Moving to ON_QA as available to test with JON 3.3.6 DR01 brew build:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=499890

Comment 18 vsorokin 2016-06-23 18:07:39 UTC
Created attachment 1171627 [details]
Sample of 'All Groups' page in JON 3.3.6 GUI

Sample of 'All Groups' page in JON 3.3.6 GUI - for compartment with CLI script output

Comment 19 vsorokin 2016-06-23 18:10:02 UTC
Created attachment 1171629 [details]
Sample of 'All Groups, Compatible' page in JON 3.3.6 GUI

Sample of 'All Groups, Compatible' page in JON 3.3.6 GUI

Comment 20 vsorokin 2016-06-23 18:29:56 UTC
Created testing environment of JON and 6 instances of EAP (6/7, domain/standalone).

Create 2 dynamic groups as ordered.

Output of CLI script (for comparing with attached screenshots):
=====================================================
rhqadmin@localhost:7080$ exec -f dump.cli
DynaGroup - Groups by platform ( Linux ), 7 members
DynaGroup - All resources currently down, 5 members
DynaGroup - All RHQ Agent resources in inventory, 7 members
DynaGroup - Managed EAP7 Servers in server-group ( eap7 domain,main-server-group ), 2 members
DynaGroup - Server Name ( EAP (0.0.0.0:10090) ), 2 members
DynaGroup - Server Name ( RHQ Agent ), 7 members
DynaGroup - Server Name ( EAP 7 (0.0.0.0:10090) ), 3 members
DynaGroup - Server Type ( RHQ Storage Node ), 1 members
DynaGroup - Server Name ( EAP server-two ), 1 members
DynaGroup - Server Name ( RHQ Storage Node(vso-jon-latest.bc.jonqe.lab.eng.bos.redhat.com) ), 1 members
DynaGroup - Server Name ( rhq ), 3 members
DynaGroup - Server Name ( EAP (127.0.0.1:6990) RHQ Server ), 1 members
DynaGroup - Server Type ( RHQ Agent ), 7 members
DynaGroup - Server Type ( RHQ Agent JVM ), 7 members
DynaGroup - Server Type ( EAP7 Standalone Server ), 3 members
DynaGroup - Server Type ( Managed Server ), 3 members
DynaGroup - Server Name ( JVM ), 8 members
DynaGroup - Server Name ( EAP server-one ), 1 members
DynaGroup - Server Name ( EAP server-three ), 1 members
DynaGroup - Server Name ( EAP 7 Domain Controller (master 0.0.0.0:8990) ), 1 members
DynaGroup - Server Type ( JBossAS7 Standalone Server ), 3 members
DynaGroup - Managed EAP7 Servers in domain ( eap7 domain ), 3 members
DynaGroup - Managed EAP7 Servers in server-group ( eap7 domain,other-server-group ), 1 members
DynaGroup - Server Type ( EAP7 Host Controller ), 1 members
DynaGroup - Server Type ( Cassandra Server JVM ), 1 members
DynaGroup - Server Type ( Postgres Server ), 3 members
================================================================

Numbers of children tested by clicking on <item>, them on <item>/Inventory - looks like quite correct.

Screenshots attached.

 The number of resources in All Groups/Compatible Groups page is correct for moment of testing. 
BZ -> verified.

Comment 21 errata-xmlrpc 2016-07-27 15:32:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1519.html