Bug 1002202

Summary: storage node stays in bootstrap mode while joining to cluster if it had already been deployed - undeployed
Product: [Other] RHQ Project Reporter: Armine Hovsepyan <ahovsepy>
Component: InstallerAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.9CC: hrupp, jsanda, mfoley
Target Milestone: ---   
Target Release: RHQ 4.9   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-24 19:09:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 951619    
Attachments:
Description Flags
storage-uninstall.png
none
storage-reinstalled.png none

Description Armine Hovsepyan 2013-08-28 15:33:24 UTC
Description of problem:
storage node stays in bootstrap mode while joining to cluster if it had already been deployed - undeployed

Version-Release number of selected component (if applicable):
e2a1811

How reproducible:
always

Steps to Reproduce:
1. install and start rhq server, storage and agent on ip1
2. install and start storage and agent on ip2
3. undeploy storage in ip2
4. deploy storage on ip2 again

Actual results:
storage stays in bootstrap mode "forever"

Expected results:
storage goes through INSTALL -> ANNOUNCE -> BOOTSTRAP -> ADD_MAINTENANCE modes and gets normal cluster status.

Additional info:
Investigation results from John Sanda: 
once the C* bootstrap finishes we wait for a event notification from the driver that the node is up then we change its mode from bootstrap to add_maintenance, but that event is not firing. It happens with a node that was previously deployed.

Comment 1 John Sanda 2013-08-28 16:51:38 UTC
We are running into https://issues.apache.org/jira/browse/CASSANDRA-5769. The event is not reported over the native, CQL protocol. We use this event notification to determine that this node has joined the cluster at which point we can initiate necessary cluster maintenance. Since the event has not fired, we are in a perpetual holding pattern.

Even though we are close to releasing 4.9, I think upgrading C* makes sense for a couple reasons. First, it resolves this issue. Secondly, it gives us an opportunity to test upgrading our C* bit in the community before a JON release.

Comment 2 John Sanda 2013-08-31 22:20:19 UTC
I have upgraded RHQ to use Cassandra 1.2.9 which includes the fix for CASSANDRA-5769. Even with that fix there are still scenarios in which the server could miss the event notification that advances the deployment beyond the bootstrap phase; consequently, I put some additional logic in place to continue the deployment if the bootstrap is successful and if the new node is part of the cluster.

Comment 3 Armine Hovsepyan 2013-09-02 10:24:45 UTC
Created attachment 792791 [details]
storage-uninstall.png

Comment 4 Armine Hovsepyan 2013-09-02 10:25:09 UTC
Created attachment 792792 [details]
storage-reinstalled.png

Comment 5 Armine Hovsepyan 2013-09-02 10:25:55 UTC
verified.
please get screen-shots attached.

for exception in re-installation screen-shot new bug is filed #1003545

Comment 6 Heiko W. Rupp 2013-09-24 19:09:14 UTC
Bulk closing of RHQ 4.9 verified items