Bug 1004050 - Provide better error handling for multi-storage node deployment prior to server install
Summary: Provide better error handling for multi-storage node deployment prior to serv...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RHQ Project
Classification: Other
Component: Core Server
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHQ 4.10
Assignee: John Sanda
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks: 951619
TreeView+ depends on / blocked
 
Reported: 2013-09-03 19:15 UTC by John Sanda
Modified: 2014-04-23 12:31 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1021530 (view as bug list)
Environment:
Last Closed: 2014-04-23 12:31:41 UTC
Embargoed:


Attachments (Terms of Use)
storageConnection (544.27 KB, image/png)
2013-10-23 12:12 UTC, Armine Hovsepyan
no flags Details

Description John Sanda 2013-09-03 19:15:16 UTC
Description of problem:
Deploying multiple storage nodes prior to the server installation involves several, manual configuration steps as described at https://docs.jboss.org/author/display/RHQ/Deploying+Multiple+Storage+Nodes. If a node is not specified in the rhq.storage.nodes property in rhq-server.properties, the server will think it is a new node and start the deployment process for that node. If none of the nodes that were specified in rhq.storage.nodes have yet been imported into inventory, then the deployment process for the new node will fail with an IndexOutOfBoundsException in StorageNodeOperationsHandlerLocal.announceStorageNode(). The error is not logged against the storage node; consequently, the new node will have a cluster status of JOINING making it very difficult to determine that there was a problem. The situation is made worse because the exception propagates up the call stack, rolling back the transaction in which resources are getting imported into inventory.

This only happens if the user fails to specify the node in rhq.storage.nodes. Because it can happen, it will happen. We need to provide more robust error handling in this situation so that 1) the error is logged against the storage node causing its cluster status to report DOWN (as opposed to JOINING), 2) a detailed error message is provided in the server log, and 3) the exception is handled so that importing resources does not fail.

We originally stumbled onto this issue with bug 1003611.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Sanda 2013-09-10 19:45:15 UTC
I have added error handling along with detailed logging to deal with this situation. The IndexOutOfBoundsException is now caught and the following is logged,

15:19:27,686 ERROR [org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean] (http-/0.0.0.0:7080-3) If this error occurred with a storage node that was deployed prior to installing the server, then this may indicate that the rhq.storage.nodes property in rhq-server.properties was not set correctly. All nodes deployed prior to server installation should be listed in the rhq.storage.nodes property. Please review the deployment documentation for additional details.

The user can simply redeploy the storage node from the UI (or CLI).

master commit hash: b30d3fe

Comment 2 Armine Hovsepyan 2013-10-17 14:40:15 UTC
reopening.

steps:
1. in jonHome/bin/rhq-storage.properties set rhq.storage.seeds=IP1,IP@ for both
nothing else changed here
2. jonHome/bin/rhqctl install --storage on IP1
3. jonHome/bin/rhqctl install --storage --agent-preference="rhq.agent.server.bind-address=IP1" on IP2
4. jonHome/bin/rhqctl  start on both
As soon as nodes connected
5. jonHome/bin/rhqctl install --server --start on 1IP1

Actual result:
Impossible to log in to server_gui
Exception in server log:
10:21:23,753 ERROR [org.rhq.server.metrics.MetricsServer] (New I/O worker #5) An error occurred while inserting raw data MeasurementDataNumeric[name=Calculated.FreeDiskToDataSizeRatio, value=13706.69, scheduleId=10361, timestamp=1382019681146]: com.datastax.driver.core.exceptions.UnauthorizedException: User dwntmhcd has no MODIFY permission on <table rhq.raw_metrics> or any of its parents

Comment 4 John Sanda 2013-10-21 13:09:09 UTC
There is a different issue that occurred with the error reported in comment 2. Two storage nodes were properly configured and deployed prior to installing the server, but only one storage node was specified in the rhq.storage.nodes property. The second node subsequently went through the deployment process upon being imported into inventory. During the deployment process the node is bootstrapped into the cluster. Its data directories are purged to ensure we can bootstrap it. This explains the UnauthorizedException. We were sending writes to the nodes with credentials that it no longer knew about.

This was caused by the node going through the deployment process. The errors could have been prevented if the user specified both nodes in rhq.storage.nodes. We can do better here and make things more robust by lifting the requirement to specify all nodes (already installed) in rhq.storage.nodes. The driver already knows about the nodes; so, we can go ahead and create the additional storage node entities that we discover from the driver. I think it is perfectly reasonable to do this. Even though the node was not listed in rhq.storage.nodes, the user has to go through a number of manual steps to get those nodes clustered which tells me that she knows what she is doing and likely just forgot to update rhq.storage.properties.

Comment 5 John Sanda 2013-10-23 02:29:58 UTC
Changes have been committed to master. The relevant commit hashes are,

93e856e1
4012733
de5d069

Comment 6 Armine Hovsepyan 2013-10-23 12:12:27 UTC
Created attachment 815371 [details]
storageConnection

Comment 7 Armine Hovsepyan 2013-10-23 14:21:27 UTC
verified 

 installed 2 storages prior to server installation, connected to each other, then installed server without providing rhq.storage.nodes, so only one node was specified there. After server installation in saw both storages, stopped storage on server box, server was performing, no exceptions in server and/or storage log, restarted full rhq on server box and it was correctly re-connected to separated storage.

Comment 8 Heiko W. Rupp 2014-04-23 12:31:41 UTC
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.


Note You need to log in before you can comment on or make changes to this bug.