Description of problem: Consider a default installation with single storage node. Then we deployment a second node. When bootstrap phase of the deployment finishes, we apply schema changes on the server side. Namely, we change the replication_factor of both the system_auth and rhq keyspaces to two. We then start the add_maintenance phase of the deployment where we run repair on both keyspaces. We run the repair operation on every node. There have been situations in which the deployment fails before the replication_factor of the system_auth keyspace has been updated, which means that authentication data will not be replicated to the new node. If the server is restarted, we restart the Cassandra driver and attempt to connect to both nodes. In a worst case scenario, server start up will completely fail if the first/original node is down. To make things a bit worse, users might run repair via the cluster maintenance operation in hopes of resolving the issue. Because the replication_factor is still one, there is nothing to replicate and the problem still exists. We do not expose or check the replication_factor anywhere so tracking this down is typically a long, painful process. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Set up a default installation with a single storage node 2. Deploy a second storage node 3. Shut down the first storage node after the new node has bootstrapped and before replication_factor has been changed 4. Restart the server Actual results: Server start up will fail. Expected results: Server start up failure is unavoidable if the first node is down. We need to be aware of the problem so that we can report it to the user so that we can get back to a good state. Minimally this should be reported in the server log or maybe via an alert. Specifically, we need to report that the cluster is in an inconsistent state that prevents us from authenticating against the second node. The first node needs to be back up and running so that bring the cluster back into a consistent state which entails updating the replication_factor and running repair. Additional info:
At start up, and maybe periodically as a scheduled job, we should check that the replication_factor is what we expect it to be for the system_auth and rhq keyspaces. Of course in the original scenario described this won't be possible since we cannot authenticate against the new node. We store and track the state of cluster maintenance in the rhq_storage_node table in the RDBMS. I think we need an explicit state stored somewhere in the RDBMS that allow to easily and immediately (at startup) identify the problem. State is tracked using the StorageNode.OperationMode enum. Maybe we could add two additional values like UPDATE_SYSTEM_AUTH_SCHEMA and UPDATE_RHQ_SCHEMA. The one problem with storing state in this way is that if another deploy or undeploy process is started, we essentially lose this state information. This problem is not specific to this situation. It is a problem in general with the implementation for how we store and track state with respect to cluster maintenance.
branch: master link: https://github.com/rhq-project/rhq/commit/278fc3a2a time: 2015-09-30 15:41:46 +0200 commit: 278fc3a2a95c7eb1ce0af7b0ff80f73d0f309b8d author: Libor Zoubek - lzoubek message: Bug 1234912 - Do not authenticate against new storage node when replication_factor of system_auth keyspace is wrong For system_auth keyspace set replication_factor=clusterSize, so each node keeps it's own copy of auth data. Created recurrent job which checks replication_factor for rhq and system_auth keyspaces when invalid replication_factor is detected, job tries to fix it and then recommends running clusterMaintenance This commit also changes "expected" replication factor of system_auth keyspace to be equal to number of nodes.
branch: release/jon3.3.x link: https://github.com/rhq-project/rhq/commit/ee4afd78d time: 2015-09-30 19:33:16 +0200 commit: ee4afd78df30af016539b925de06179827c40773 author: Libor Zoubek - lzoubek message: Bug 1234912 - Do not authenticate against new storage node when replication_factor of system_auth keyspace is wrong For system_auth keyspace set replication_factor=clusterSize, so each node keeps it's own copy of auth data. Created recurrent job which checks replication_factor for rhq and system_auth keyspaces when invalid replication_factor is detected, job tries to fix it and then recommends running clusterMaintenance This commit also changes "expected" replication factor of system_auth keyspace to be equal to number of nodes. (cherry picked from commit 278fc3a2a95c7eb1ce0af7b0ff80f73d0f309b8d) Signed-off-by: Libor Zoubek <lzoubek>
branch: master link: https://github.com/rhq-project/rhq/commit/7fb9222c8 time: 2015-10-05 15:41:16 +0200 commit: 7fb9222c80981fb876d8a7eea472304761f42555 author: Libor Zoubek - lzoubek message: Bug 1234912 - Do not authenticate against new storage node when replication_factor of system_auth keyspace is wrong Correctly close storage cluster session and fix scheduling interval of job branch: release/jon3.3.x link: https://github.com/rhq-project/rhq/commit/3ef061530 time: 2015-10-05 15:42:13 +0200 commit: 3ef06153042b4105a1da6dd678944e3240a25f4f author: Libor Zoubek - lzoubek message: Bug 1234912 - Do not authenticate against new storage node when replication_factor of system_auth keyspace is wrong Correctly close storage cluster session and fix scheduling interval of job (cherry picked from commit 7fb9222c80981fb876d8a7eea472304761f42555) Signed-off-by: Libor Zoubek <lzoubek>
branch: master link: https://github.com/rhq-project/rhq/commit/e1fa9edbe time: 2015-10-08 16:34:35 +0200 commit: e1fa9edbe0a53bf39c86312cf7a8848e934ac57b author: Libor Zoubek - lzoubek message: Bug 1234912 - Do not authenticate against new storage node when replication_factor of system_auth keyspace is wrong Fix "healthy" replication factor definition branch: release/jon3.3.x link: https://github.com/rhq-project/rhq/commit/fa7b1a1f8 time: 2015-10-08 17:00:58 +0200 commit: fa7b1a1f8dc55140e8b9fc900db044bde3892f98 author: Libor Zoubek - lzoubek message: Bug 1234912 - Do not authenticate against new storage node when replication_factor of system_auth keyspace is wrong Fix "healthy" replication factor definition (cherry picked from commit e1fa9edbe0a53bf39c86312cf7a8848e934ac57b) Signed-off-by: Libor Zoubek <lzoubek>
Moving to ON_QA as available to test with the following build: https://brewweb.devel.redhat.com/buildinfo?buildID=460382 *Note: jon-server-patch-3.3.0.GA.zip maps to ER01 build of jon-server-3.3.0.GA-update-04.zip.
Moving target milestone to ER02 to retest after latest Cassandra changes.
Moving to ON_QA as available to test with the following build: https://brewweb.devel.redhat.com//buildinfo?buildID=461043 *Note: jon-server-patch-3.3.0.GA.zip maps to ER02 build of jon-server-3.3.0.GA-update-04.zip.
Verified on: Version : 3.3.0.GA Update 04 Build Number : e9ed05b:aa79ebd Verification steps: Deploying and removing up to 4 storage nodes and manually changing replication factor for rhq and system_auth keyspaces and checking that those are automatically reset to correct values.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1947.html