Bug 1027740

Summary: Unable to fetch storage nodes when >1 nodes failed to connect to cluster
Product: [Other] RHQ Project Reporter: Armine Hovsepyan <ahovsepy>
Component: Core UI, Storage NodeAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: hrupp, jkremser, jsanda, mfoley
Target Milestone: GA   
Target Release: RHQ 4.10   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-23 12:30:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1027892    
Attachments:
Description Flags
unableToFetchStoragenodes
none
server.log
none
storage node UI timeout none

Description Armine Hovsepyan 2013-11-07 11:07:43 UTC
Created attachment 821016 [details]
unableToFetchStoragenodes

Description of problem:
Unable to fetch storage nodes when >1 nodes failed to connect to cluster

Version-Release number of selected component (if applicable):
rhq 4.10 d3ea23b

How reproducible:
2 out of 2

Steps to Reproduce:
 1. install server with N1
 2.install N2
 3. install N3
 4. install n4 so that it fails the add to maintenance state
 5. stop N2, n3 and N4
 6. remove rhq-storage-auth from N1
 7. install N5  -  it tries to connect to N2 and fails
 8. start N2, N3 and N4
 9. get back rhq-storage-auth in N1
 10. restart N5
 11. add manually N1 IP to N5 rhq-storage-auth
 12. add manually N5 IP into N1's rhq-storage-auth
 13. restart N5


Actual results:
unable to see nodes in Administration -> Storage Nodes

Expected results:
Storage node administration still function

Additional info:
screen-shot and fragment from server.log attached

Comment 1 Armine Hovsepyan 2013-11-07 11:08:20 UTC
Created attachment 821017 [details]
server.log

Comment 2 Mike Foley 2013-11-07 12:21:15 UTC
Is there a corressponding JON BZ?  If not, please consider cloning this, and consider it a blocker for JON 3.2 GA.

Comment 5 Jirka Kremser 2013-11-07 14:03:01 UTC
Armine, could you try to minimize the repro steps? Is it necessary to install all 5 storage nodes?

Comment 6 Heiko W. Rupp 2013-11-07 14:03:40 UTC
I don't think this is a typical case and I consider that rather a documentation issue, where we should point out that adding new storage nodes require at least one storage node that is up and running correctly.

Comment 7 Armine Hovsepyan 2013-11-07 14:07:51 UTC
The issue was hitted during the reproduction of #1026088, but yes, the it can be considered as  "bad" scenario.

Comment 8 Jirka Kremser 2013-11-07 16:00:40 UTC
There is an error in UI because the call fetching the nodes expires. server log says: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)

Perhaps it is relaged to bug 1017372.

Comment 9 John Sanda 2013-11-10 12:25:32 UTC
Created attachment 822064 [details]
storage node UI timeout

Comment 10 John Sanda 2013-11-10 12:26:00 UTC
I do not think that the root cause has anything to query a single node. Nor do I think the issue is a Cassandra read timeout. These things certainly can and will contribute to fetching storage nodes in the admin view longer; however, the problem is much more fundamental with respect to how we query the storage nodes. The method in question, StorageNodeManagerBean.getStorageNodeComposites, performs a lot of work for each storage node.

The issue can be reproduced much more easily than the steps in the initial description. I did so by setting up a 5 node cluster with virtually no load on either postgres or on the storage cluster. I updated getStorageComposites to log the execution time. Here are the results from the past two attempts that timed out in the UI:

07:07:51,516 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-2) Retrieved storage node composites in 26272 ms
07:10:53,341 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-2) Retrieved storage node composites in 28589 ms

My server.log shows that the method completed without error but the results do not make it to the UI because the server simply takes too long to respond. I am attaching a screenshot of the UI.

Comment 11 John Sanda 2013-11-10 19:16:10 UTC
I deployed another node, increasing the cluster size to 6 nodes. Here is my server.log from several repeated, failed attempts to load the storage nodes view.

14:08:59,849 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-3) Retrieved storage node composites in 25856 ms
14:10:37,558 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-3) Retrieved storage node composites in 25958 ms
14:11:50,880 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-3) Retrieved storage node composites in 25866 ms
14:12:51,662 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-15) Retrieved storage node composites in 25721 ms

The numbers will get worse when the database and/or cluster has some actual load and as the number of nodes increases.

Comment 12 Jirka Kremser 2013-11-11 13:26:23 UTC
Trying 6 nodes on 1 machine:

14:19:19,549 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-/0.0.0.0:7080-3) Retrieved storage node composites in 2492 ms
14:19:23,378 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-/0.0.0.0:7080-3) Retrieved storage node composites in 2518 ms
14:19:27,138 INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-/0.0.0.0:7080-3) Retrieved storage node composites in 2435 ms

There is actually just 1 node installed in my env., but 6 records in the relational DB leading to the same resource.

What it shows is that the time complexity is linear to the number of nodes (for just 1 node the time was about 400ms (=2492 / 6)) iff there is no inter-node communication.

In the real environment, where the storage nodes are interconnected the situation is obviously different and the complexity is not linear. 

Still investigating..

Comment 13 Jirka Kremser 2013-11-11 14:18:46 UTC
StorageNodeManager.getLoad() method was calling MeasurementManager.getMeasurementAggregate() that was calling the MetricServer.getSummaryAggregate(). 

By removing this intermediate call I was able to improve the times to ~700ms.

Comment 14 John Sanda 2013-11-12 04:21:17 UTC
I have pushed a few commits to the branch jsanda/load-storage-composites (https://git.fedorahosted.org/cgit/rhq/rhq.git/log/?h=jsanda/load-storage-composites). Jirka and I discussed some of those changes earlier today. I made 3 changes,

1) Retrieve unacknowledged alerts in one query instead of N (where N is number of storage nodes).

2) Fetch metric data directly from MetricsServer instead of going through MeasurementDataManagerBean.getMeasurementAggregate.

3) Load metric data asynchronously so that data for multiple schedules can be fetched in parallel.

With these changes, the time to load the storage node composites drops from about 25 secs to about 2.5 seconds. The disparity should be even greater with some actual load on either postgres or on the storage cluster.

Comment 15 John Sanda 2013-11-13 01:13:08 UTC
Changes have been pushed to master.

master commit hashes:
c9f38bd4
3a6f0c055
2fbf0fd8
085f9305
11158e9e1

Comment 16 John Sanda 2013-11-14 18:27:21 UTC
Two more commits that went into master for this.

65d30323
ce56e863

I also want to note the dramatic performance improvements as a result of all the changes. I have an 8 node cluster and two servers with which I have been testing. One server has been running the unoptimized code while the other has been running the optimized versions. The unoptimized version of StorageNodeManagerBean.getStorageNodeComposites took over 200 seconds to complete.  With all of the optimizations in place, it takes less than 200 milliseconds.

Comment 17 Heiko W. Rupp 2014-04-23 12:30:27 UTC
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.