Description of problem: Deploying multiple storage nodes prior to the server installation involves several, manual configuration steps as described at https://docs.jboss.org/author/display/RHQ/Deploying+Multiple+Storage+Nodes. If a node is not specified in the rhq.storage.nodes property in rhq-server.properties, the server will think it is a new node and start the deployment process for that node. If none of the nodes that were specified in rhq.storage.nodes have yet been imported into inventory, then the deployment process for the new node will fail with an IndexOutOfBoundsException in StorageNodeOperationsHandlerLocal.announceStorageNode(). The error is not logged against the storage node; consequently, the new node will have a cluster status of JOINING making it very difficult to determine that there was a problem. The situation is made worse because the exception propagates up the call stack, rolling back the transaction in which resources are getting imported into inventory. This only happens if the user fails to specify the node in rhq.storage.nodes. Because it can happen, it will happen. We need to provide more robust error handling in this situation so that 1) the error is logged against the storage node causing its cluster status to report DOWN (as opposed to JOINING), 2) a detailed error message is provided in the server log, and 3) the exception is handled so that importing resources does not fail. We originally stumbled onto this issue with bug 1003611. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I have added error handling along with detailed logging to deal with this situation. The IndexOutOfBoundsException is now caught and the following is logged, 15:19:27,686 ERROR [org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean] (http-/0.0.0.0:7080-3) If this error occurred with a storage node that was deployed prior to installing the server, then this may indicate that the rhq.storage.nodes property in rhq-server.properties was not set correctly. All nodes deployed prior to server installation should be listed in the rhq.storage.nodes property. Please review the deployment documentation for additional details. The user can simply redeploy the storage node from the UI (or CLI). master commit hash: b30d3fe
reopening. steps: 1. in jonHome/bin/rhq-storage.properties set rhq.storage.seeds=IP1,IP@ for both nothing else changed here 2. jonHome/bin/rhqctl install --storage on IP1 3. jonHome/bin/rhqctl install --storage --agent-preference="rhq.agent.server.bind-address=IP1" on IP2 4. jonHome/bin/rhqctl start on both As soon as nodes connected 5. jonHome/bin/rhqctl install --server --start on 1IP1 Actual result: Impossible to log in to server_gui Exception in server log: 10:21:23,753 ERROR [org.rhq.server.metrics.MetricsServer] (New I/O worker #5) An error occurred while inserting raw data MeasurementDataNumeric[name=Calculated.FreeDiskToDataSizeRatio, value=13706.69, scheduleId=10361, timestamp=1382019681146]: com.datastax.driver.core.exceptions.UnauthorizedException: User dwntmhcd has no MODIFY permission on <table rhq.raw_metrics> or any of its parents
There is a different issue that occurred with the error reported in comment 2. Two storage nodes were properly configured and deployed prior to installing the server, but only one storage node was specified in the rhq.storage.nodes property. The second node subsequently went through the deployment process upon being imported into inventory. During the deployment process the node is bootstrapped into the cluster. Its data directories are purged to ensure we can bootstrap it. This explains the UnauthorizedException. We were sending writes to the nodes with credentials that it no longer knew about. This was caused by the node going through the deployment process. The errors could have been prevented if the user specified both nodes in rhq.storage.nodes. We can do better here and make things more robust by lifting the requirement to specify all nodes (already installed) in rhq.storage.nodes. The driver already knows about the nodes; so, we can go ahead and create the additional storage node entities that we discover from the driver. I think it is perfectly reasonable to do this. Even though the node was not listed in rhq.storage.nodes, the user has to go through a number of manual steps to get those nodes clustered which tells me that she knows what she is doing and likely just forgot to update rhq.storage.properties.
Changes have been committed to master. The relevant commit hashes are, 93e856e1 4012733 de5d069
Created attachment 815371 [details] storageConnection
verified installed 2 storages prior to server installation, connected to each other, then installed server without providing rhq.storage.nodes, so only one node was specified there. After server installation in saw both storages, stopped storage on server box, server was performing, no exceptions in server and/or storage log, restarted full rhq on server box and it was correctly re-connected to separated storage.
Bulk closing of 4.10 issues. If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.