Bug 1019841

Summary: storage node cannot be started if "prepare for bootstrap" operation canceled
Product: [JBoss] JBoss Operations Network Reporter: Armine Hovsepyan <ahovsepy>
Component: Operations, Storage NodeAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.2CC: ahovsepy, jsanda, mfoley
Target Milestone: ER07   
Target Release: JON 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-02 20:35:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1002210, 1012435    
Attachments:
Description Flags
server.log
none
rhqctl_start-storage.log
none
deployment resource operation warning
none
snapshot
none
snapshot2
none
no-snapshot
none
no-snapshot2 none

Description Armine Hovsepyan 2013-10-16 13:32:38 UTC
Created attachment 812927 [details]
server.log

Description of problem:
storage node cannot be started if "prepare for bootstrap" operation canceled 

Version-Release number of selected component (if applicable):
jon 3.2 er3

How reproducible:
always

Steps to Reproduce:
1. install and start jon server, storage and agent on IP1
2. install and start jon storage and agent on IP2, connect agent to IP1
3. wait for storage on IP2 to connect to one on IP1
4. navigate to resource of storage in IP2
5. run "prepare for bootstrap operation"
6. cancel "prepare for bootstrap operation"

Actual results:
" Deployment has been aborted due to failed operation [Prepare For Bootstrap]  " exception thrown in server.log
storage goes down
unable to start storage node

Expected results:
storage on ip2 is still functioning (?)

Additional info:
fragment from server.log attached
fragment from storage start attached

Marked bug as high severity, since any operation start/cancel should not de-stabilize services. Fell free to lower it or cover with documentation only, if operation cancel should not be executed.

Comment 1 Armine Hovsepyan 2013-10-16 13:33:00 UTC
Created attachment 812929 [details]
rhqctl_start-storage.log

Comment 2 Mike Foley 2013-11-11 16:38:39 UTC
Targetting for JON 3.2 ER6 and adding GA Blocker per BZ triage santos, loleary, heute, foley, yaroboro 


minimum requirement is to get the storage node started after step#6.

Comment 3 John Sanda 2013-11-11 16:50:08 UTC
If you cancel one of the deployment operations, it should be assumed that the storage node is not functioning, in so far as being part of the storage cluster. Only when the deployment has completed successfully can it be safely assumed that the node is functioning properly where properly means being a member of the storage cluster. The resolution here is to simply redeploy the node. 

Armine, can you please retest with retrying the deployment?

Comment 4 Armine Hovsepyan 2013-11-12 11:01:46 UTC
John, I've tried the ame scenario again. The result is:

After step 3   i wait for storage to enter the cluster - step4, then run "prepare for bootstrap" operation, cancel it, and it's impossible to get storage back again - it's installed and inactive, cannot be re-deployed, undeployed, restart or start doesn't start storage.

Comment 5 John Sanda 2013-11-12 12:11:33 UTC
Thanks for the clarification. Can you provide a current server.log and rhq-storage.log?

Comment 6 John Sanda 2013-11-12 15:21:36 UTC
I reviewed the logs and got clarification from Armine around the reproduction steps. The prepare for bootstrap operation was invoked manually with bad parameters, essentially resulting in a bad storage node configuration that prevented the storage node from starting up. It logged this error at start up due to the misconfigured seeds property in cassandra.yaml,

ERROR [main] 2013-11-12 05:40:08,279 CassandraDaemon.java (line 464) Exception encountered during startup
java.lang.IllegalStateException: Unable to contact any seeds!
        at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:947)
        at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:716)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:554)
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:451)
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
ERROR [StorageServiceShutdownHook] 2013-11-12 05:40:08,285 CassandraDaemon.java (line 192) Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
        at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
        at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
        at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
        at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:519)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at java.lang.Thread.run(Thread.java:679)


After the initial failed bootstrap operation, other resource operations were run against the node that are intended to only be run as part of the deployment process which eventually updated the node's operation mode from BOOTSTRAP to ADD_MAINTENANCE; consequently, any subsequent attempts to deploy the node (using the deploy work flow from the admin UI) would start in the ADD_MAINTENANCE phase which is also bad because the node was not bootstrapped into the cluster properly.

As it stands now, the second now is part of the cluster but the other node in the cluster has it flagged as being down. This because the rhq-storage-auth.conf file of the new node was not updated due to the failed bootstrap.

All of the resource operations that are part of the (un)deployment process have the following description that appears in the UI,

"This operation is NOT intended for direct usage. It is part of the deployment process. Please see the storage node deployment documentation for more information."

The message alone obviously does not prevent someone from running the (un)deployment resource operations, but I am not sure that I would consider this a valid scenario. Other than making sure we have good docs around the deployment process, there is one other thing I might consider doing. During the deployment process, before the new node is bootstrapped, we take snapshots of existing cluster nodes. Then if for whatever reason the cluster does get into a bad state as a result of the deployment, we have a backup from which we can restore.

Comment 7 John Sanda 2013-11-13 01:23:45 UTC
To summarize some of the details from comment 5, I do not think this is a valid scenario, specifically steps 5 and 6. As I explained above, a warning is displayed to the user when she directly schedules a deployment resource operation such as prepare for bootstrap. I am uploading a screenshot that shows the warning message.

The action item I am taking for this bug is to generate snapshots of all cluster nodes before bootstrapping the new node. This way, if something goes wrong that puts the cluster in a dysfunctional state from which we cannot easily recover, then we have a known, good state to which we can rollback.

Comment 8 John Sanda 2013-11-13 03:22:56 UTC
Created attachment 823240 [details]
deployment resource operation warning

Comment 9 John Sanda 2013-11-14 20:44:46 UTC
The following changes have been made. During the ANNOUNCE phase of deployment snapshot of the system, system_auth, and rhq keyspaces are taken. The snapshot directory name will be of the form, "pre_<new_node_address>_bootstrap_<timestamp>". During the UNANNOUNCE phase of undeployment snapshots of the aforementioned keyspaces is taken. The snapshot directory name will be of the form, "pre_<removed_node_address>_decommission_<timestamp>".

release/jon3.2.x branch commit: 228ab450

Comment 10 Simeon Pinder 2013-11-19 15:47:51 UTC
Moving to ON_QA as available for testing with new brew build.

Comment 11 Simeon Pinder 2013-11-22 05:13:27 UTC
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 13 Armine Hovsepyan 2013-12-03 16:21:33 UTC
Created attachment 832205 [details]
snapshot

Comment 14 Armine Hovsepyan 2013-12-03 16:21:56 UTC
Created attachment 832206 [details]
snapshot2

Comment 15 Armine Hovsepyan 2013-12-03 16:22:24 UTC
Created attachment 832208 [details]
no-snapshot

Comment 16 Armine Hovsepyan 2013-12-03 16:22:49 UTC
Created attachment 832209 [details]
no-snapshot2

Comment 17 Armine Hovsepyan 2013-12-03 16:24:20 UTC
verified in er7
snapshots are being created for "first" node in cluster while joining other nodes (snapshot.png, snapshot2.png)
no snapshot for "new" nodes - (no-shapshot.png, no-shapshot2.png)