Bug 1027740
Summary: | Unable to fetch storage nodes when >1 nodes failed to connect to cluster | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Armine Hovsepyan <ahovsepy> | ||||||||
Component: | Core UI, Storage Node | Assignee: | John Sanda <jsanda> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.10 | CC: | hrupp, jkremser, jsanda, mfoley | ||||||||
Target Milestone: | GA | ||||||||||
Target Release: | RHQ 4.10 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2014-04-23 12:30:27 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1027892 | ||||||||||
Attachments: |
|
Created attachment 821017 [details]
server.log
Is there a corressponding JON BZ? If not, please consider cloning this, and consider it a blocker for JON 3.2 GA. Armine, could you try to minimize the repro steps? Is it necessary to install all 5 storage nodes? I don't think this is a typical case and I consider that rather a documentation issue, where we should point out that adding new storage nodes require at least one storage node that is up and running correctly. The issue was hitted during the reproduction of #1026088, but yes, the it can be considered as "bad" scenario. There is an error in UI because the call fetching the nodes expires. server log says: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded) Perhaps it is relaged to bug 1017372. Created attachment 822064 [details]
storage node UI timeout
I do not think that the root cause has anything to query a single node. Nor do I think the issue is a Cassandra read timeout. These things certainly can and will contribute to fetching storage nodes in the admin view longer; however, the problem is much more fundamental with respect to how we query the storage nodes. The method in question, StorageNodeManagerBean.getStorageNodeComposites, performs a lot of work for each storage node. The issue can be reproduced much more easily than the steps in the initial description. I did so by setting up a 5 node cluster with virtually no load on either postgres or on the storage cluster. I updated getStorageComposites to log the execution time. Here are the results from the past two attempts that timed out in the UI: 07:07:51,516 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-2) Retrieved storage node composites in 26272 ms 07:10:53,341 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-2) Retrieved storage node composites in 28589 ms My server.log shows that the method completed without error but the results do not make it to the UI because the server simply takes too long to respond. I am attaching a screenshot of the UI. I deployed another node, increasing the cluster size to 6 nodes. Here is my server.log from several repeated, failed attempts to load the storage nodes view. 14:08:59,849 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-3) Retrieved storage node composites in 25856 ms 14:10:37,558 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-3) Retrieved storage node composites in 25958 ms 14:11:50,880 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-3) Retrieved storage node composites in 25866 ms 14:12:51,662 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-storage5.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.208:7080-15) Retrieved storage node composites in 25721 ms The numbers will get worse when the database and/or cluster has some actual load and as the number of nodes increases. Trying 6 nodes on 1 machine: 14:19:19,549 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-/0.0.0.0:7080-3) Retrieved storage node composites in 2492 ms 14:19:23,378 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-/0.0.0.0:7080-3) Retrieved storage node composites in 2518 ms 14:19:27,138 INFO [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-/0.0.0.0:7080-3) Retrieved storage node composites in 2435 ms There is actually just 1 node installed in my env., but 6 records in the relational DB leading to the same resource. What it shows is that the time complexity is linear to the number of nodes (for just 1 node the time was about 400ms (=2492 / 6)) iff there is no inter-node communication. In the real environment, where the storage nodes are interconnected the situation is obviously different and the complexity is not linear. Still investigating.. StorageNodeManager.getLoad() method was calling MeasurementManager.getMeasurementAggregate() that was calling the MetricServer.getSummaryAggregate(). By removing this intermediate call I was able to improve the times to ~700ms. I have pushed a few commits to the branch jsanda/load-storage-composites (https://git.fedorahosted.org/cgit/rhq/rhq.git/log/?h=jsanda/load-storage-composites). Jirka and I discussed some of those changes earlier today. I made 3 changes, 1) Retrieve unacknowledged alerts in one query instead of N (where N is number of storage nodes). 2) Fetch metric data directly from MetricsServer instead of going through MeasurementDataManagerBean.getMeasurementAggregate. 3) Load metric data asynchronously so that data for multiple schedules can be fetched in parallel. With these changes, the time to load the storage node composites drops from about 25 secs to about 2.5 seconds. The disparity should be even greater with some actual load on either postgres or on the storage cluster. Changes have been pushed to master. master commit hashes: c9f38bd4 3a6f0c055 2fbf0fd8 085f9305 11158e9e1 Two more commits that went into master for this. 65d30323 ce56e863 I also want to note the dramatic performance improvements as a result of all the changes. I have an 8 node cluster and two servers with which I have been testing. One server has been running the unoptimized code while the other has been running the optimized versions. The unoptimized version of StorageNodeManagerBean.getStorageNodeComposites took over 200 seconds to complete. With all of the optimizations in place, it takes less than 200 milliseconds. Bulk closing of 4.10 issues. If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10. |
Created attachment 821016 [details] unableToFetchStoragenodes Description of problem: Unable to fetch storage nodes when >1 nodes failed to connect to cluster Version-Release number of selected component (if applicable): rhq 4.10 d3ea23b How reproducible: 2 out of 2 Steps to Reproduce: 1. install server with N1 2.install N2 3. install N3 4. install n4 so that it fails the add to maintenance state 5. stop N2, n3 and N4 6. remove rhq-storage-auth from N1 7. install N5 - it tries to connect to N2 and fails 8. start N2, N3 and N4 9. get back rhq-storage-auth in N1 10. restart N5 11. add manually N1 IP to N5 rhq-storage-auth 12. add manually N5 IP into N1's rhq-storage-auth 13. restart N5 Actual results: unable to see nodes in Administration -> Storage Nodes Expected results: Storage node administration still function Additional info: screen-shot and fragment from server.log attached