1002202 – storage node stays in bootstrap mode while joining to cluster if it had already been deployed - undeployed

Bug 1002202 - storage node stays in bootstrap mode while joining to cluster if it had already been deployed - undeployed

Summary: storage node stays in bootstrap mode while joining to cluster if it had alrea...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Installer
Sub Component:
Version:	4.9
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHQ 4.9
Assignee:	John Sanda
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	951619
TreeView+	depends on / blocked

Reported:	2013-08-28 15:33 UTC by Armine Hovsepyan
Modified:	2015-09-03 00:01 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-09-24 19:09:14 UTC
Embargoed:

Attachments	(Terms of Use)
storage-uninstall.png (590.79 KB, image/png) 2013-09-02 10:24 UTC, Armine Hovsepyan	no flags	Details
storage-reinstalled.png (514.64 KB, image/png) 2013-09-02 10:25 UTC, Armine Hovsepyan	no flags	Details
View All

Description Armine Hovsepyan 2013-08-28 15:33:24 UTC

Description of problem:
storage node stays in bootstrap mode while joining to cluster if it had already been deployed - undeployed

Version-Release number of selected component (if applicable):
e2a1811

How reproducible:
always

Steps to Reproduce:
1. install and start rhq server, storage and agent on ip1
2. install and start storage and agent on ip2
3. undeploy storage in ip2
4. deploy storage on ip2 again

Actual results:
storage stays in bootstrap mode "forever"

Expected results:
storage goes through INSTALL -> ANNOUNCE -> BOOTSTRAP -> ADD_MAINTENANCE modes and gets normal cluster status.

Additional info:
Investigation results from John Sanda: 
once the C* bootstrap finishes we wait for a event notification from the driver that the node is up then we change its mode from bootstrap to add_maintenance, but that event is not firing. It happens with a node that was previously deployed.

Comment 1 John Sanda 2013-08-28 16:51:38 UTC

We are running into https://issues.apache.org/jira/browse/CASSANDRA-5769. The event is not reported over the native, CQL protocol. We use this event notification to determine that this node has joined the cluster at which point we can initiate necessary cluster maintenance. Since the event has not fired, we are in a perpetual holding pattern.

Even though we are close to releasing 4.9, I think upgrading C* makes sense for a couple reasons. First, it resolves this issue. Secondly, it gives us an opportunity to test upgrading our C* bit in the community before a JON release.

Comment 2 John Sanda 2013-08-31 22:20:19 UTC

I have upgraded RHQ to use Cassandra 1.2.9 which includes the fix for CASSANDRA-5769. Even with that fix there are still scenarios in which the server could miss the event notification that advances the deployment beyond the bootstrap phase; consequently, I put some additional logic in place to continue the deployment if the bootstrap is successful and if the new node is part of the cluster.

Comment 3 Armine Hovsepyan 2013-09-02 10:24:45 UTC

Created attachment 792791 [details]
storage-uninstall.png

Comment 4 Armine Hovsepyan 2013-09-02 10:25:09 UTC

Created attachment 792792 [details]
storage-reinstalled.png

Comment 5 Armine Hovsepyan 2013-09-02 10:25:55 UTC

verified.
please get screen-shots attached.

for exception in re-installation screen-shot new bug is filed #1003545

Comment 6 Heiko W. Rupp 2013-09-24 19:09:14 UTC

Bulk closing of RHQ 4.9 verified items

Note You need to log in before you can comment on or make changes to this bug.