Bug 1079598 - Storage node will get stuck in ANNOUNCE mode if existing storage node is in INSTALLED state
Summary: Storage node will get stuck in ANNOUNCE mode if existing storage node is in I...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Storage Node
Version: JON 3.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ER01
: JON 3.2.3
Assignee: John Sanda
QA Contact: Garik Khachikyan
URL:
Whiteboard:
Depends On: 1101773
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-22 00:15 UTC by Larry O'Leary
Modified: 2018-12-05 17:49 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-05 15:40:16 UTC
Type: Bug


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 762683 None None None Never
Red Hat Bugzilla 1026108 None None None Never
Red Hat Bugzilla 1101773 None None None Never
Red Hat Bugzilla 1103841 None None None Never

Internal Links: 1026108 1101773 1103841

Description Larry O'Leary 2014-03-22 00:15:02 UTC
Description of problem:
It is possible that a storage cluster can become unstable if a storage node is installed an not started and then another storage node is installed and started.

The result is that the started (second) storage node will enter the ANNOUNCE phase but fail due to the first storage node not having an associated resource to make the announcement to.

ERROR [org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean] (http-/0.0.0.0:7080-1) Aborting storage node deployment due to unexpected error while announcing storage node at jboss-on.example.com: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

Version-Release number of selected component (if applicable):
3.2.0 dcb8b6f:734bd56

How reproducible:
Always

Steps to Reproduce:
1.  Configure storage node to use unqualified host name:

        sed -i 's/^[#]*rhq\.storage\.hostname=.*$/rhq.storage.hostname=jboss-on/' "${RHQ_SERVER_HOME}"'/bin/rhq-storage.properties'

2.  Set `jboss.bind.address`:

        sed -i 's/^[#]*jboss\.bind\.address=.*$/jboss.bind.address=0.0.0.0/' "${RHQ_SERVER_HOME}"'/bin/rhq-server.properties'

3.  Run rhqctl install
4.  Delete the rhq-storage installation:

        rm -rf "${RHQ_SERVER_HOME}"'/rhq-storage'

5.  Configure storage node to use fully qualified host name:

        sed -i 's/^[#]*rhq\.storage\.hostname=.*$/rhq.storage.hostname=jboss-on.example.com/' "${RHQ_SERVER_HOME}"'/bin/rhq-storage.properties'

#.  Run rhqctl install --storage
#.  Run rhqctl start

Actual results:
First storage node (jboss-on) has operation mode of INSTALLED
Second storage node (jboss-on.example.com) has operation mode of ANNOUNCE and is reported as a failed state.

The following warnings and errors are logged in server log:

        WARN  [org.rhq.enterprise.server.storage.StorageClientManagerBean] (EJB default - 3) Storage client subsystem wasn't initialized. The RHQ server will be set to MAINTENANCE mode. Please verify  that the storage cluster is operational.: java.lang.IllegalStateException: There is no storage node metadata stored in the relational database. This may have happened as a result of running dbsetup or deleting rows from rhq_storage_node table. Please re-install the storage node to fix this issue.
            at org.rhq.enterprise.server.storage.StorageClientManagerBean.createSession(StorageClientManagerBean.java:335) [rhq-server.jar:4.9.0.JON320GA]
            at org.rhq.enterprise.server.storage.StorageClientManagerBean.init(StorageClientManagerBean.java:154) [rhq-server.jar:4.9.0.JON320GA]
            at org.rhq.enterprise.server.storage.StorageClientManagerBean.storageSessionMaintenance(StorageClientManagerBean.java:129) [rhq-server.jar:4.9.0.JON320GA]
            ...
        ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-3) JBAS014134: EJB Invocation failed on component MeasurementDataManagerBean for method public abstract void org.rhq.enterprise.server.measurement.MeasurementDataManagerLocal.addNumericData(java.util.Set): javax.ejb.EJBException: java.lang.NullPointerException
            at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInNoTx(CMTTxInterceptor.java:191) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10]
            ...
        Caused by: java.lang.NullPointerException
            at org.rhq.enterprise.server.measurement.MeasurementDataManagerBean.addNumericData(MeasurementDataManagerBean.java:237) [rhq-server.jar:4.9.0.JON320GA]
            ...


Expected results:
First storage node (jboss-on) has operation mode of INSTALLED
Second storage node (jboss-on.example.com) has operation mode of NORMAL and is functioning normally.

Comment 1 John Sanda 2014-05-25 00:32:26 UTC
This is an interesting problem. When the server installer runs the first time, we create a row in the rhq_storage_node table for hostname jboss-on. The installer only inserts rows into rhq_storage_node if none already exist. The reason for this has to do with how we determine whether or not we need to run the deploy process when a storage node resource is committed into inventory. We determine that by simply looking to see if there is a StorageNode entity. If there is not one, we run the deploy process for that node. And we want to run the deploy process for any nodes except the initial nodes created at server installation time. 

Running rhqctl install the second time does not create a new storage node entity, i.e., row in rhq_storage_node table. I wanted to point this out because it is probably not obvious.

It is important to understand that the Cassandra schema is created when the server installer runs. This includes disabling the default Cassandra super user and creating the rhq user with randomly generated username/password. Schema is not propagated to new nodes until the deploy process runs.

Since the storage client subsystem failed to initialize, the server went into maintenance mode, which means all agent requests are rejected. This made me wonder how the resource for jboss-on.example.com node wound up getting committed into inventory at all. I think there is about 30 second window before the server goes into maintenance mode.

One way to resolve this would be to first try and connect to the jboss-on.example.com node with the rhq user. If we are able to create a CQL session, then tells us that the node has the schema. If we cannot connect as the rhq user we could try connecting as the default user and then install the schema. This gets us semi-operational. 

I say semi-operation because things if the jboss-on node was actually a different node, we would other problems. The jboss-on and jboss-on.example.com nodes would not be able to talk to one another without updating the rhq-storage-auth.conf file   for each node. I am not sure how comfortable with implementing those changes for 3.2.2. 

I wonder if there might be an easier way to resolve this. There are two issues in this bug. 

1) The user needs a way to change the hostname of the storage node. 

2) We need to decide what to do when a new storage node is committed into inventory and the deploy process should run but existing cluster nodes have an operation mode of INSTALLED, meaning they are not yet managed by the agent.

cc'ing Stefan to see what he thinks.

Comment 2 John Sanda 2014-05-27 21:58:18 UTC
Part of the problem is due to bug 1101773. The server is going into maintenance mode at start up, but the comm layer is not getting notified; consequently, the server is not rejecting incoming agent requests. 

This explains how the second storage node was imported into inventory. It should never have happened which also by the way makes solving this bug a bit different and more difficult. 

Maybe we make the StorageNode.address property updatable through the CLI/remote API. If the node has an operation mode of INSTALLED, then all we can do is update the column in the rhq_storage_node table since it is not managed. It would be up to the user to make sure that the value matches in the listen_address property in cassandra.yaml. If the node is managed, then we could also execute a resource operation to update cassandra.yaml and restart the node in order for the change to take effect.

Comment 3 John Sanda 2014-06-02 17:36:42 UTC
I have starting working on a design document to capture the various scenarios in which a storage node address would be changed. It can be found at https://docs.jboss.org/author/display/RHQ/Changing+Storage+Node+Address. It is too much risk/change to implement everything covered in that document in JON 3.2.x.

I am proposing a solution that limits the scope to the scenario presented by Larry. We give the user the ability either through the storage node admin UI or through the CLI to update the storage node address *only* when the storage node has an operation mode of INSTALLED and when it is not yet linked to a resource.

I have create bug 1103841 to implement support for the other scenarios in a future release.

Comment 4 Jirka Kremser 2014-06-10 18:36:08 UTC
I hit the same issue when installing 2 nodes (n1: server+agent+s.node n2: agent+s.node) simultaneously. In the UI in the storage node details there is a note:

Deployment error: Aborting storage node deployment due to unexpected error while announcing storage node at jk-bz1105743-2.bc.jonqe.lab.eng.bos.redhat.com Check the server log for details. Root cause: Index: 0, Size: 0

It is thrown when evaluating clusterNodes.get(0) perhaps, we could check the size and throw something better.

Comment 5 John Sanda 2014-06-12 16:51:20 UTC
The error in comment 4 was only possible because of bug 1101773. With that fixed, that scenario should no longer be possible.

Comment 6 John Sanda 2014-06-12 16:53:12 UTC
Changes have been made so that the storage node address is editable only when the storage node has an operation mode of INSTALLED and is not linked to a resource. It can be edited through the CLI or through the admin UI.

commit hashes in release/jon3.2.x branch:
a4bf9b242f6
7cdafdfcc5f7
4808832cbc

Comment 7 Simeon Pinder 2014-08-15 03:19:00 UTC
Moving to ON_QA as this is available for test in JON 3.2.3 ER01 build:

http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.3.GA/8-14-14/

Comment 8 Garik Khachikyan 2014-08-27 14:05:22 UTC
# VERIFIED

doing following scenario:
1. modify the main storage node (the one getting installed together with server) to be named with unqualified hostname.
2. after install (dont start it still) remove the rhq-storage/ directory. (break the storage node)
3. start server
4. plug another storage node (with all normal settings)
Result: on 3.2.0 GA UI throws red alerts, etc.

=> let's upgrade to 3.2.3 CR01

1. take down the storage node (the standalone one)
2. take server down too and apply the patch to it
3. start server
4. visit UI - looks promising (broken storage node status: http://screencloud.net/v/nnzO; the other node status: http://screencloud.net/v/58pd)
so no exceptions.


Note You need to log in before you can comment on or make changes to this bug.