Bug 1234912

Summary:	Do not authenticate against new storage node when replication_factor of system_auth keyspace is wrong
Product:	[JBoss] JBoss Operations Network	Reporter:	John Sanda <jsanda>
Component:	Core Server, Storage Node	Assignee:	Libor Zoubek <lzoubek>
Status:	CLOSED ERRATA	QA Contact:	Filip Brychta <fbrychta>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	JON 3.3.0	CC:	fbrychta, loleary, lzoubek, spinder, theute
Target Milestone:	ER02	Keywords:	Triaged
Target Release:	JON 3.3.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-10-28 14:36:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1200594

Description John Sanda 2015-06-23 13:30:43 UTC

Description of problem:
Consider a default installation with single storage node. Then we deployment a second node. When bootstrap phase of the deployment finishes, we apply schema changes on the server side. Namely, we change the replication_factor of both the system_auth and rhq keyspaces to two. We then start the add_maintenance phase of the deployment where we run repair on both keyspaces. We run the repair operation on every node.

There have been situations in which the deployment fails before the replication_factor of the system_auth keyspace has been updated, which means that authentication data will not be replicated to the new node. If the server is restarted, we restart the Cassandra driver and attempt to connect to both nodes. In a worst case scenario, server start up will completely fail if the first/original node is down.

To make things a bit worse, users might run repair via the cluster maintenance operation in hopes of resolving the issue. Because the replication_factor is still one, there is nothing to replicate and the problem still exists. We do not expose or check the replication_factor anywhere so tracking this down is typically a long, painful process.

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. Set up a default installation with a single storage node
2. Deploy a second storage node
3. Shut down the first storage node after the new node has bootstrapped and before replication_factor has been changed
4. Restart the server

Actual results:
Server start up will fail.

Expected results:
Server start up failure is unavoidable if the first node is down. We need to be aware of the problem so that we can report it to the user so that we can get back to a good state. Minimally this should be reported in the server log or maybe via an alert.

Specifically, we need to report that the cluster is in an inconsistent state that prevents us from authenticating against the second node. The first node needs to be back up and running so that bring the cluster back into a consistent state which entails updating the replication_factor and running repair.

Additional info:

Comment 1 John Sanda 2015-06-23 14:01:24 UTC

At start up, and maybe periodically as a scheduled job, we should check that the replication_factor is what we expect it to be for the system_auth and rhq keyspaces. Of course in the original scenario described this won't be possible since we cannot authenticate against the new node. 

We store and track the state of cluster maintenance in the rhq_storage_node table in the RDBMS. I think we need an explicit state stored somewhere in the RDBMS that allow to easily and immediately (at startup) identify the problem. State is tracked using the StorageNode.OperationMode enum. Maybe we could add two additional values like UPDATE_SYSTEM_AUTH_SCHEMA and UPDATE_RHQ_SCHEMA.

The one problem with storing state in this way is that if another deploy or undeploy process is started, we essentially lose this state information. This problem is not specific to this situation. It is a problem in general with the implementation for how we store and track state with respect to cluster maintenance.

Comment 2 Libor Zoubek 2015-09-30 13:43:08 UTC

branch:  master
link:    https://github.com/rhq-project/rhq/commit/278fc3a2a
time:    2015-09-30 15:41:46 +0200
commit:  278fc3a2a95c7eb1ce0af7b0ff80f73d0f309b8d
author:  Libor Zoubek - lzoubek
message: Bug 1234912 - Do not authenticate against new storage node when
         replication_factor of system_auth keyspace is wrong

         For system_auth keyspace set replication_factor=clusterSize, so
         each node keeps it's own copy of auth data. Created recurrent
         job which checks replication_factor for rhq and system_auth
         keyspaces when invalid replication_factor is detected, job
         tries to fix it and then recommends running clusterMaintenance

         This commit also changes "expected" replication factor of
         system_auth keyspace to be equal to number of nodes.

Comment 3 Libor Zoubek 2015-09-30 17:34:18 UTC

branch:  release/jon3.3.x
link:    https://github.com/rhq-project/rhq/commit/ee4afd78d
time:    2015-09-30 19:33:16 +0200
commit:  ee4afd78df30af016539b925de06179827c40773
author:  Libor Zoubek - lzoubek
message: Bug 1234912 - Do not authenticate against new storage node when
         replication_factor of system_auth keyspace is wrong

         For system_auth keyspace set replication_factor=clusterSize, so
         each node keeps it's own copy of auth data. Created recurrent
         job which checks replication_factor for rhq and system_auth
         keyspaces when invalid replication_factor is detected, job
         tries to fix it and then recommends running clusterMaintenance

         This commit also changes "expected" replication factor of
         system_auth keyspace to be equal to number of nodes.

         (cherry picked from commit
         278fc3a2a95c7eb1ce0af7b0ff80f73d0f309b8d) Signed-off-by: Libor
         Zoubek <lzoubek>

Comment 4 Libor Zoubek 2015-10-05 13:42:33 UTC

branch:  master
link:    https://github.com/rhq-project/rhq/commit/7fb9222c8
time:    2015-10-05 15:41:16 +0200
commit:  7fb9222c80981fb876d8a7eea472304761f42555
author:  Libor Zoubek - lzoubek
message: Bug 1234912 - Do not authenticate against new storage node when
         replication_factor of system_auth keyspace is wrong

         Correctly close storage cluster session and fix scheduling
         interval of job


branch:  release/jon3.3.x
link:    https://github.com/rhq-project/rhq/commit/3ef061530
time:    2015-10-05 15:42:13 +0200
commit:  3ef06153042b4105a1da6dd678944e3240a25f4f
author:  Libor Zoubek - lzoubek
message: Bug 1234912 - Do not authenticate against new storage node when
         replication_factor of system_auth keyspace is wrong

         Correctly close storage cluster session and fix scheduling
         interval of job

         (cherry picked from commit
         7fb9222c80981fb876d8a7eea472304761f42555) Signed-off-by: Libor
         Zoubek <lzoubek>

Comment 6 Libor Zoubek 2015-10-08 15:02:19 UTC

branch:  master
link:    https://github.com/rhq-project/rhq/commit/e1fa9edbe
time:    2015-10-08 16:34:35 +0200
commit:  e1fa9edbe0a53bf39c86312cf7a8848e934ac57b
author:  Libor Zoubek - lzoubek
message: Bug 1234912 - Do not authenticate against new storage node when
         replication_factor of system_auth keyspace is wrong

         Fix "healthy" replication factor definition


branch:  release/jon3.3.x
link:    https://github.com/rhq-project/rhq/commit/fa7b1a1f8
time:    2015-10-08 17:00:58 +0200
commit:  fa7b1a1f8dc55140e8b9fc900db044bde3892f98
author:  Libor Zoubek - lzoubek
message: Bug 1234912 - Do not authenticate against new storage node when
         replication_factor of system_auth keyspace is wrong

         Fix "healthy" replication factor definition

         (cherry picked from commit
         e1fa9edbe0a53bf39c86312cf7a8848e934ac57b) Signed-off-by: Libor
         Zoubek <lzoubek>

Comment 7 Simeon Pinder 2015-10-09 04:40:21 UTC

Moving to ON_QA as available to test with the following build:
https://brewweb.devel.redhat.com/buildinfo?buildID=460382

 *Note: jon-server-patch-3.3.0.GA.zip maps to ER01 build of
 jon-server-3.3.0.GA-update-04.zip.

Comment 8 Simeon Pinder 2015-10-15 05:17:59 UTC

Moving target milestone to ER02 to retest after latest Cassandra changes.

Comment 9 Simeon Pinder 2015-10-15 05:22:36 UTC

Moving to ON_QA as available to test with the following build:
https://brewweb.devel.redhat.com//buildinfo?buildID=461043

 *Note: jon-server-patch-3.3.0.GA.zip maps to ER02 build of
 jon-server-3.3.0.GA-update-04.zip.

Comment 10 Filip Brychta 2015-10-20 15:12:15 UTC

Verified on:
Version :	
3.3.0.GA Update 04
Build Number :	
e9ed05b:aa79ebd

Verification steps:
Deploying and removing up to 4 storage nodes and manually changing replication factor for rhq and system_auth keyspaces and checking that those are automatically reset to correct values.

Comment 12 errata-xmlrpc 2015-10-28 14:36:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1947.html