Bug 1002202 - storage node stays in bootstrap mode while joining to cluster if it had already been deployed - undeployed
storage node stays in bootstrap mode while joining to cluster if it had alrea...
Product: RHQ Project
Classification: Other
Component: Installer (Show other bugs)
All Linux
unspecified Severity urgent (vote)
: ---
: RHQ 4.9
Assigned To: John Sanda
Mike Foley
Depends On:
Blocks: 951619
  Show dependency treegraph
Reported: 2013-08-28 11:33 EDT by Armine Hovsepyan
Modified: 2015-09-02 20:01 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-09-24 15:09:14 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
storage-uninstall.png (590.79 KB, image/png)
2013-09-02 06:24 EDT, Armine Hovsepyan
no flags Details
storage-reinstalled.png (514.64 KB, image/png)
2013-09-02 06:25 EDT, Armine Hovsepyan
no flags Details

  None (edit)
Description Armine Hovsepyan 2013-08-28 11:33:24 EDT
Description of problem:
storage node stays in bootstrap mode while joining to cluster if it had already been deployed - undeployed

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. install and start rhq server, storage and agent on ip1
2. install and start storage and agent on ip2
3. undeploy storage in ip2
4. deploy storage on ip2 again

Actual results:
storage stays in bootstrap mode "forever"

Expected results:
storage goes through INSTALL -> ANNOUNCE -> BOOTSTRAP -> ADD_MAINTENANCE modes and gets normal cluster status.

Additional info:
Investigation results from John Sanda: 
once the C* bootstrap finishes we wait for a event notification from the driver that the node is up then we change its mode from bootstrap to add_maintenance, but that event is not firing. It happens with a node that was previously deployed.
Comment 1 John Sanda 2013-08-28 12:51:38 EDT
We are running into https://issues.apache.org/jira/browse/CASSANDRA-5769. The event is not reported over the native, CQL protocol. We use this event notification to determine that this node has joined the cluster at which point we can initiate necessary cluster maintenance. Since the event has not fired, we are in a perpetual holding pattern.

Even though we are close to releasing 4.9, I think upgrading C* makes sense for a couple reasons. First, it resolves this issue. Secondly, it gives us an opportunity to test upgrading our C* bit in the community before a JON release.
Comment 2 John Sanda 2013-08-31 18:20:19 EDT
I have upgraded RHQ to use Cassandra 1.2.9 which includes the fix for CASSANDRA-5769. Even with that fix there are still scenarios in which the server could miss the event notification that advances the deployment beyond the bootstrap phase; consequently, I put some additional logic in place to continue the deployment if the bootstrap is successful and if the new node is part of the cluster.
Comment 3 Armine Hovsepyan 2013-09-02 06:24:45 EDT
Created attachment 792791 [details]
Comment 4 Armine Hovsepyan 2013-09-02 06:25:09 EDT
Created attachment 792792 [details]
Comment 5 Armine Hovsepyan 2013-09-02 06:25:55 EDT
please get screen-shots attached.

for exception in re-installation screen-shot new bug is filed #1003545
Comment 6 Heiko W. Rupp 2013-09-24 15:09:14 EDT
Bulk closing of RHQ 4.9 verified items

Note You need to log in before you can comment on or make changes to this bug.