1019841 – storage node cannot be started if "prepare for bootstrap" operation canceled

Bug 1019841 - storage node cannot be started if "prepare for bootstrap" operation canceled

Summary: storage node cannot be started if "prepare for bootstrap" operation canceled

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Operations, Storage Node
Sub Component:
Version:	JON 3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ER07
Target Release:	JON 3.2.0
Assignee:	John Sanda
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1002210 1012435
TreeView+	depends on / blocked

Reported:	2013-10-16 13:32 UTC by Armine Hovsepyan
Modified:	2015-09-03 00:02 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-01-02 20:35:34 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
server.log (3.79 KB, text/x-log) 2013-10-16 13:32 UTC, Armine Hovsepyan	no flags	Details
rhqctl_start-storage.log (10.89 KB, text/x-log) 2013-10-16 13:33 UTC, Armine Hovsepyan	no flags	Details
deployment resource operation warning (19.81 KB, image/png) 2013-11-13 03:22 UTC, John Sanda	no flags	Details
snapshot (181.91 KB, image/png) 2013-12-03 16:21 UTC, Armine Hovsepyan	no flags	Details
snapshot2 (213.49 KB, image/png) 2013-12-03 16:21 UTC, Armine Hovsepyan	no flags	Details
no-snapshot (182.69 KB, image/png) 2013-12-03 16:22 UTC, Armine Hovsepyan	no flags	Details
no-snapshot2 (195.35 KB, image/png) 2013-12-03 16:22 UTC, Armine Hovsepyan	no flags	Details
View All

Description Armine Hovsepyan 2013-10-16 13:32:38 UTC

Created attachment 812927 [details]
server.log

Description of problem:
storage node cannot be started if "prepare for bootstrap" operation canceled 

Version-Release number of selected component (if applicable):
jon 3.2 er3

How reproducible:
always

Steps to Reproduce:
1. install and start jon server, storage and agent on IP1
2. install and start jon storage and agent on IP2, connect agent to IP1
3. wait for storage on IP2 to connect to one on IP1
4. navigate to resource of storage in IP2
5. run "prepare for bootstrap operation"
6. cancel "prepare for bootstrap operation"

Actual results:
" Deployment has been aborted due to failed operation [Prepare For Bootstrap]  " exception thrown in server.log
storage goes down
unable to start storage node

Expected results:
storage on ip2 is still functioning (?)

Additional info:
fragment from server.log attached
fragment from storage start attached

Marked bug as high severity, since any operation start/cancel should not de-stabilize services. Fell free to lower it or cover with documentation only, if operation cancel should not be executed.

Comment 1 Armine Hovsepyan 2013-10-16 13:33:00 UTC

Created attachment 812929 [details]
rhqctl_start-storage.log

Comment 2 Mike Foley 2013-11-11 16:38:39 UTC

Targetting for JON 3.2 ER6 and adding GA Blocker per BZ triage santos, loleary, heute, foley, yaroboro 


minimum requirement is to get the storage node started after step#6.

Comment 3 John Sanda 2013-11-11 16:50:08 UTC

If you cancel one of the deployment operations, it should be assumed that the storage node is not functioning, in so far as being part of the storage cluster. Only when the deployment has completed successfully can it be safely assumed that the node is functioning properly where properly means being a member of the storage cluster. The resolution here is to simply redeploy the node. 

Armine, can you please retest with retrying the deployment?

Comment 4 Armine Hovsepyan 2013-11-12 11:01:46 UTC

John, I've tried the ame scenario again. The result is:

After step 3   i wait for storage to enter the cluster - step4, then run "prepare for bootstrap" operation, cancel it, and it's impossible to get storage back again - it's installed and inactive, cannot be re-deployed, undeployed, restart or start doesn't start storage.

Comment 5 John Sanda 2013-11-12 12:11:33 UTC

Thanks for the clarification. Can you provide a current server.log and rhq-storage.log?

Comment 6 John Sanda 2013-11-12 15:21:36 UTC

I reviewed the logs and got clarification from Armine around the reproduction steps. The prepare for bootstrap operation was invoked manually with bad parameters, essentially resulting in a bad storage node configuration that prevented the storage node from starting up. It logged this error at start up due to the misconfigured seeds property in cassandra.yaml,

ERROR [main] 2013-11-12 05:40:08,279 CassandraDaemon.java (line 464) Exception encountered during startup
java.lang.IllegalStateException: Unable to contact any seeds!
        at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:947)
        at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:716)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:554)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:451)
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
ERROR [StorageServiceShutdownHook] 2013-11-12 05:40:08,285 CassandraDaemon.java (line 192) Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
        at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
        at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
        at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
        at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:519)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at java.lang.Thread.run(Thread.java:679)


After the initial failed bootstrap operation, other resource operations were run against the node that are intended to only be run as part of the deployment process which eventually updated the node's operation mode from BOOTSTRAP to ADD_MAINTENANCE; consequently, any subsequent attempts to deploy the node (using the deploy work flow from the admin UI) would start in the ADD_MAINTENANCE phase which is also bad because the node was not bootstrapped into the cluster properly.

As it stands now, the second now is part of the cluster but the other node in the cluster has it flagged as being down. This because the rhq-storage-auth.conf file of the new node was not updated due to the failed bootstrap.

All of the resource operations that are part of the (un)deployment process have the following description that appears in the UI,

"This operation is NOT intended for direct usage. It is part of the deployment process. Please see the storage node deployment documentation for more information."

The message alone obviously does not prevent someone from running the (un)deployment resource operations, but I am not sure that I would consider this a valid scenario. Other than making sure we have good docs around the deployment process, there is one other thing I might consider doing. During the deployment process, before the new node is bootstrapped, we take snapshots of existing cluster nodes. Then if for whatever reason the cluster does get into a bad state as a result of the deployment, we have a backup from which we can restore.

Comment 7 John Sanda 2013-11-13 01:23:45 UTC

To summarize some of the details from comment 5, I do not think this is a valid scenario, specifically steps 5 and 6. As I explained above, a warning is displayed to the user when she directly schedules a deployment resource operation such as prepare for bootstrap. I am uploading a screenshot that shows the warning message.

The action item I am taking for this bug is to generate snapshots of all cluster nodes before bootstrapping the new node. This way, if something goes wrong that puts the cluster in a dysfunctional state from which we cannot easily recover, then we have a known, good state to which we can rollback.

Comment 8 John Sanda 2013-11-13 03:22:56 UTC

Created attachment 823240 [details]
deployment resource operation warning

Comment 9 John Sanda 2013-11-14 20:44:46 UTC

The following changes have been made. During the ANNOUNCE phase of deployment snapshot of the system, system_auth, and rhq keyspaces are taken. The snapshot directory name will be of the form, "pre_<new_node_address>_bootstrap_<timestamp>". During the UNANNOUNCE phase of undeployment snapshots of the aforementioned keyspaces is taken. The snapshot directory name will be of the form, "pre_<removed_node_address>_decommission_<timestamp>".

release/jon3.2.x branch commit: 228ab450

Comment 10 Simeon Pinder 2013-11-19 15:47:51 UTC

Moving to ON_QA as available for testing with new brew build.

Comment 11 Simeon Pinder 2013-11-22 05:13:27 UTC

Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 13 Armine Hovsepyan 2013-12-03 16:21:33 UTC

Created attachment 832205 [details]
snapshot

Comment 14 Armine Hovsepyan 2013-12-03 16:21:56 UTC

Created attachment 832206 [details]
snapshot2

Comment 15 Armine Hovsepyan 2013-12-03 16:22:24 UTC

Created attachment 832208 [details]
no-snapshot

Comment 16 Armine Hovsepyan 2013-12-03 16:22:49 UTC

Created attachment 832209 [details]
no-snapshot2

Comment 17 Armine Hovsepyan 2013-12-03 16:24:20 UTC

verified in er7
snapshots are being created for "first" node in cluster while joining other nodes (snapshot.png, snapshot2.png)
no snapshot for "new" nodes - (no-shapshot.png, no-shapshot2.png)

Note You need to log in before you can comment on or make changes to this bug.