Bug 1139735 - There is no way how to upgrade HA environment having two servers each with a co-located storage node when C* schema changes are included
Summary: There is no way how to upgrade HA environment having two servers each with a ...
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Installer, Upgrade, Storage Node
Version: JON 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: CR01
: JON 3.3.0
Assignee: Jay Shaughnessy
QA Contact: Filip Brychta
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 1151073
TreeView+ depends on / blocked
 
Reported: 2014-09-09 14:14 UTC by Filip Brychta
Modified: 2014-12-11 13:59 UTC (History)
5 users (show)

(edit)
Clone Of:
: 1151073 (view as bug list)
(edit)
Last Closed: 2014-12-11 13:59:31 UTC


Attachments (Terms of Use)
server1 upgrade log (16.37 KB, text/plain)
2014-10-27 11:09 UTC, Filip Brychta
no flags Details
server2 upgrade log (22.75 KB, text/plain)
2014-10-27 11:10 UTC, Filip Brychta
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1157480 None None None Never

Internal Trackers: 1157480

Description Filip Brychta 2014-09-09 14:14:38 UTC
Description of problem:
Upgrading an HA environment with co-located storage nodes when C* schema changes are included should be done following way:
1-    Upgrade each storage node
2-    Restart each storage node
3-    Upgrade servers

but this is not possible right now. There is no option to upgrade just storage node without other parts.

Version-Release number of selected component (if applicable):
Version :	
3.3.0.ER02
Build Number :	
4fbb183:7da54e2


Additional info:
When upgrading an HA environment, you are supposed to take down all server and then perform the upgrade one server at a time.  The storage node schema is installed/updated by the server installer. When there are storage schema changes, as will be the case in JON 3.3.0, *all* storage nodes should be up and running while the server installer runs. Users will need to,

    Upgrade each storage node
    Restart each storage node
    Upgrade servers


This is probably cause confusion for users. Suppose I have two servers, S1 and S2, and two storage nodes, N1 and N2. I can easily see users shutting everything down, upgrade S1 and N1, and then upgrade S2 and N2. They should not do that. They need to upgrade N1 and N2 and restart them. Then upgrade S1 and S2.

Comment 2 Simeon Pinder 2014-09-29 08:12:45 UTC
Moving into ER05 as didn't make the ER04 cut.

Comment 3 Jay Shaughnessy 2014-09-30 18:10:24 UTC
Discussed with jsanda and mazz to figure out what our go-forward approach should be for handling rhq storage and server cluster upgrades.  Our current thinking given the following constraints:

- keep things as similar to the past as possible
- protect the users from messing up as best as possible
- handle the fact that schema upgrade for storage is a "cluster-wide" operation
  requiring all storage-nodes to be not only running but already running the
  upgraded Cassandra bits.

The use case is that an upgrade to version V must be applied to N storage nodes and M servers (where N and M are both >= 1).

The general approach:

Users will continue to use the same 'rhqctl upgrade' command as before but we will be doing more version tracking and upgrade 'progress reporting' than done previously. Servers will not start until the the total upgrade of the HA and Storage clusters is complete.

To facilitate this we will now add an installed-version-stamp to the Server and StorageNode entities.  We'll be using these to track the progress of the overall upgrade.  We may also need to track the Storage schema version in the system config table, analogous to what we do for the RDB schema version.

Workflow Example:

1) User unzips version V bits, as before on either a standalone Server.
   standalone Storage Node, or co-located install.
2) user runs 'rhqctl upgrade --from-server-dir <old-version-dir>

We perform the following based on various situations, in the following order:

IF this is a Server and all Storage Nodes are upgraded and running and the
   Cassandra schema has not  been updated, perform schema update on the
   cluster. If successful update SN schema version in system config.

IF everything is upgraded report to the user that upgrade is complete and
   components can be started. exit. [if optional --start option is specified,
   we can start the server now]

DO upgrade the SN and/or Server bits. Note that the RDB schema will be upgraded when the first Server is upgraded.  Leave the SN running after upgrade, if applicable.

IF this is a Server and all Storage Nodes are upgraded and running and the
   Cassandra schema has not  been updated, perform schema update on the
   cluster. If successful update SN schema version in system config.

IF everything is upgraded report to the use that upgrade is complete and
   components can be started. exit.
ELSE
   report on upgrade progress (what is done, what remains). exit.


Notes:
* the --start option will likely be made non-functional and deprecated on the
  upgrade command, unless we want to start the last upgraded server
* for this first release we may require that a Server be upgraded prior to a
  standalone SN, to get the new version-stamp fields into the RDB.

Comment 4 Jay Shaughnessy 2014-10-07 21:49:42 UTC
Here is a the merge commit for this branch work.  Pending are the wiki updates for updates to the upgrade process. I'll post the links for reference when ready.


Master commit 281e1bc21252e4f15e2af7fa079fe200231895bf
Merge: 5c0b296 28089e3
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Tue Oct 7 17:30:51 2014 -0400

    Merge branch 'jshaughn/1139735'

    Conflicts:
        modules/common/cassandra-installer/src/main/resources/module/main/module.xml

Comment 5 Jay Shaughnessy 2014-10-09 01:00:40 UTC
master commit 6bc27076ce1a8aee205ede7c30ff83285e23d8f4
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Wed Oct 8 20:59:02 2014 -0400

 Add undo of the storage node version stamping in case of an upgrade failure
 invoking undo logic.   The current thinking is that this is not necessary for
 the server stamp since it happens at the end of server upgrade.

 Also, deprecate the upgrade --start option and make it do nothing if specified.
 This is because upgrade is a multi-step process and as such an automatic
 start does not make much sense, and more likely will cause problems. Start
 should only be done manually, after all upgrade steps are complete.

Comment 6 Michael Burman 2014-10-09 08:26:39 UTC
Fixed a little bit in the master (previous commits prevented clean installation):

commit a6ac0b9a43a330f43bc6c160410807e988e6228f
Author: Michael Burman <miburman@redhat.com>
Date:   Thu Oct 9 11:25:20 2014 +0300

    [BZ 1139735] Storage node schema version should not be attempted to upgrade in case of clean installation

Comment 7 Jay Shaughnessy 2014-10-09 14:02:56 UTC
A little more info regarding the commit in Comment 6:

Don't stamp version at storage node install-time.  The db row does not exist at that time. Instead, the version will be applied on insert, either when installing the server (for SN#1), or at mergeInventory time (when SN2..N are discovered and reported as inventory).

Comment 8 Jay Shaughnessy 2014-10-09 18:14:03 UTC
release/jon3.3.x commit 9b36b19233b7e833ed4c3bb8571985f8fc1da65c
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Tue Oct 7 17:30:51 2014 -0400

    (cherry picked from commit 281e1bc21252e4f15e2af7fa079fe200231895bf)
    Signed-off-by: Jay Shaughnessy <jshaughn@redhat.com>


release/jon3.3.x commit 9c52c977d8a08af6a99a93f79237351c067094b3
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Wed Oct 8 20:59:02 2014 -0400

    (cherry picked from commit 6bc27076ce1a8aee205ede7c30ff83285e23d8f4)
    Signed-off-by: Jay Shaughnessy <jshaughn@redhat.com>


release/jon3.3.x commit f5a941af8a68a8c0341ddcd5abfb729cd0bc5283
Author: Michael Burman <miburman@redhat.com>
Date:   Thu Oct 9 11:25:20 2014 +0300

    (cherry picked from commit a6ac0b9a43a330f43bc6c160410807e988e6228f)
    Signed-off-by: Jay Shaughnessy <jshaughn@redhat.com>

Comment 9 Thomas Segismont 2014-10-10 07:51:21 UTC
Additional commit in master

commit 5851e041b91c1db2374ec5474695ad53e67a4796
Author: Thomas Segismont <tsegismo@redhat.com>
Date:   Fri Oct 10 09:49:40 2014 +0200
    
    There's no need to change serialVersionUid when adding a field
    
    This will repair the API check job in core-domain

Comment 10 Thomas Segismont 2014-10-10 07:52:46 UTC
Cherry-picked over to release/jon3.3.x

commit 6138c9e10c84d18eab2596aeed4442b9fc0f267a
Author: Thomas Segismont <tsegismo@redhat.com>
Date:   Fri Oct 10 09:49:40 2014 +0200
    
    There's no need to change serialVersionUid when adding a field
    
    This will repair the API check job in core-domain
    
    (cherry picked from commit 5851e041b91c1db2374ec5474695ad53e67a4796)
    Signed-off-by: Thomas Segismont <tsegismo@redhat.com>

Comment 12 Jay Shaughnessy 2014-10-14 18:21:07 UTC
Test issues fixed by Thomas due to the fact that upgraded dbs will not have the new 'version' fields, they must get created manually in the tests, analogous to a production upgrade.

master commit: 28134c1ef13181ce8d1b9186f80c95614aee87a4
  Author: Thomas Segismont <tsegismo@redhat.com>
  Date:   2014-10-14 (Tue, 14 Oct 2014)

itests-2 were broken beacuse Server and StorageNode entities now have a version column


master Commit: 5054f8fe4e1e81b6167f2538f8f0ad74b3ddf56f
  Author: Thomas Segismont <tsegismo@redhat.com>
  Date:   2014-10-14 (Tue, 14 Oct 2014)

Fix core domain tests and server itests when database does not have Server and StorageNode version columns

Version columns are not created as part of the db-upgrade process, but only when running the installer.
So we need something when running itests or core domain tests against an upgraded database.



release/jon3.3.x Commit: 10a2cc3a4d965c6c04b22ecf3b1dba47c63481d1
  Author: Thomas Segismont <tsegismo@redhat.com>
  Date:   2014-10-14 (Tue, 14 Oct 2014)

(cherry picked from commit 28134c1ef13181ce8d1b9186f80c95614aee87a4)
Signed-off-by: Thomas Segismont <tsegismo@redhat.com>


release/jon3.3.x Commit: 708737e525c67b58774a6a1bf79556b6f3133ffc
  Author: Thomas Segismont <tsegismo@redhat.com>
  Date:   2014-10-14 (Tue, 14 Oct 2014)

(cherry picked from commit 5054f8fe4e1e81b6167f2538f8f0ad74b3ddf56f)
Signed-off-by: Jay Shaughnessy <jshaughn@redhat.com>

Comment 13 Jay Shaughnessy 2014-10-15 16:14:09 UTC
More test fallout:

master commit 9ad8a28735c9c9f0a64288f6ffda6973feede014
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Wed Oct 15 10:44:00 2014 -0400

    Now that the storage installer needs to contact the DB to version-stamp the
    rhq_storage_node table, the cassandra installer upgrade tests failed, because
    they support no DB infrastructure.  To get around this added an installer
    option to ignore the version stamping.  These options are not user-visible,
    and who knows, maybe someday support will have a reason to use it.


release/jon3.3.x commit b2c08d57fbc1b185a4b50408fc3e19b103e0818f
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Wed Oct 15 10:44:00 2014 -0400

    (cherry picked from commit 9ad8a28735c9c9f0a64288f6ffda6973feede014)
    Signed-off-by: Jay Shaughnessy <jshaughn@redhat.com>

Comment 14 Jay Shaughnessy 2014-10-16 16:48:13 UTC
commit ae3d37af4896bea2af151e82d24a43147643b46a
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Thu Oct 16 12:45:44 2014 -0400

    [1139735] More i-test fallout, oracle only
    Unlike Postgres, where DDL changes can be transactional, Oracle executes
    DDL (like adding a column) like autocommit=true.  As such, it can't happen
    within an existing Tx, which is what we have by default in out itest
    setup beans (@PostConstruct).  Change things to execute the DDL outside
    of an existing Tx.


commit 6d86c9ad0a0074c77653587850bd7c612c960b67
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Thu Oct 16 12:45:44 2014 -0400

    (cherry picked from commit ae3d37af4896bea2af151e82d24a43147643b46a)
    Signed-off-by: Jay Shaughnessy <jshaughn@redhat.com>

Comment 15 Simeon Pinder 2014-10-21 20:24:22 UTC
Moving to ON_QA as available to test with the latest brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=394734

Comment 16 Filip Brychta 2014-10-27 11:09:08 UTC
I have following setup (all using JON3.2.0.GA):
- server1: JON server (master - postgres db is here) + SN + agent
- server2: JON server (slave) + SN + agent
- server3: JON server (slave) + agent
- server4: SN + agent

following upgrade manual https://docs.jboss.org/author/display/RHQ/Upgrading+RHQ
upgrade failed.

1- stop all components (./rhqctl stop) on all servers
2- unzip jon-server-3.3.0.ER05.zip on all servers
3- I tried to upgrade server4 but it failed (separate bz1157480)
4- tried to upgrade server1 (jon-server-3.3.0.ER05/bin/rhqctl upgrade --from-server-dir /home/hudson/jon-server-3.2.0.GA/)

Result:
Authentication error on host fbr-ha.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.105: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM

5- tried to upgrade server2

Result:
Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replica available for query at consistency ONE (1 required but only 0 alive)

6- tried to upgrade server3 (jon-server-3.3.0.ER05/bin/rhqctl upgrade --from-server-dir /home/hudson/jon-server-3.2.0.GA/ --use-remote-storage-node true)

Result:
05:25:31,122 ERROR [org.rhq.enterprise.server.installer.InstallerServiceImpl] Failed to connect to the storage cluster. Please check the following:
	1) At least one storage node is running
	2) The rhq.storage.nodes property specifies the correct hostname/address of at least one storage node
	3) The rhq.storage.cql-port property has the correct value


This last error is expected but it shows (step 3 as well) that it's necessary to follow certain order during the upgrade. If I understand it correctly, for my setup I need to do:
1- stop everything
2- upgrade server1 (to workaround bz1157480)
3- now it should be possible to upgrade server4
4- upgrade server2
5- start some storage node (if it's not already running) and upgrade server3
6- start all storage nodes (if they are not already running) and upgrade storage node scheme
7- start everything

All this should be well documented and when a user does something wrong, error messages should be clear and advice should be provided.

Main problem here of course are issues during step 4 and step5. Complete logs from steps 4 and 5 are attached.

Comment 17 Filip Brychta 2014-10-27 11:09:46 UTC
Created attachment 950949 [details]
server1 upgrade log

Comment 18 Filip Brychta 2014-10-27 11:10:09 UTC
Created attachment 950950 [details]
server2 upgrade log

Comment 19 John Sanda 2014-10-27 14:34:56 UTC
I think the problem is in the documentation. The docs say to shut down the storage nodes during the server upgrade, but that will not work. The server installer checks the storage node schema. This check involves querying the storage node cluster, hence the UnavailableException. And all nodes should be up and running for that check.

Comment 20 Filip Brychta 2014-10-27 15:25:01 UTC
I tried to follow John's advice in comment 19 and I successfully upgraded servers 1 - 3 but the working scenario is not very user friendly.

To avoid errors from steps 4 and 5 in comment 19, It's necessary to have at least 2 SN running so I used this scenario:
1- keep all SN running and stop everything else
2- upgrade server1 -> Ok
3- upgrade server2 -> Ok
4- tried to upgrade server3 -> failed - Not enough replica available for query at consistency ONE (1 required but only 0 alive) because SNs on server1 and server2 were stopped after the upgrade
5- I stopped SN on server4 and started upgraded SNs on server1 and server2
6- upgrade of server3 worked
7- upgrade of server4 still fails even with valid DB properties

I still need to try rhqctl upgrade --storage-schema on different set up since I'm not able to upgrade server4.

Comment 21 John Sanda 2014-10-27 19:36:52 UTC
I have been discussing this with Jay, and I think I have a better handle on what happened. I am assuming that all nodes were shutdown prior to upgrading server1. When we run rhqctl upgrade on server1, rhqctl starts the co-located storage node prior to running the server installer. From InstallerServiceImpl.prepareDatabase(), we call InstallerServiceImpl.prepareStorageSchema(). And from there we call storageNodeSchemaManager.checkCompatibility(). 

At the point in which we call the checkCompatibility method, there is no connection to the storage cluster. The first thing this method does though is to try and connect to the cluster using the credentials specified by the rhq.storage.username and rhq.storage.password properties. If the replicas that own the rows in the system_auth keyspace where the credentials are stored are down, then we will fail to connect to the cluster with an com.datastax.driver.core.exceptions.AuthenticationException. The checkCompatibility method catches and simply re-throws the AuthenticationException. 

InstallerServiceImpl.prepareStorageSchema catches the AuthenticationException and logs the message,

05:06:37,651 INFO  [org.rhq.enterprise.server.installer.InstallerServiceImpl] Install RHQ schema along with updates to storage nodes.

And then storageNodeSchemaManager.install(schemaProperties) is called. That install method calls VersionManager.install which in turn tries to initialize a cluster connection. That connection attempt also fails with an AuthenticationException. At this point VersionManager thinks it needs to create a new schema, so in the catch block at line 89 (in VersionManager.java), the create method is invoked.

The create method attempts to connect to the cluster using the default system username/password. The Cassandra node receives a org.apache.cassandra.transport.messages.CredentialsMessage from the driver. This tells Cassandra to authenticate the user which in this case is the default super user, i.e., cassandra/cassandra. CredentialsMessage calls ClientState.login(Map<String, String) which in turn calls PasswordAuthenticator.authenticate(Map<String, String>). If the username in the authentication query is the default super user, then the query is done at a consistency level of quorum. This explains the following message in the server1 log,

05:06:37,718 ERROR [org.rhq.enterprise.server.installer.InstallerServiceImpl] Could not complete storage cluster schema installation: Authentication error on host fbr-ha.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.105: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM: java.lang.RuntimeException: com.datastax.driver.core.exceptions.AuthenticationException: Authentication error on host fbr-ha.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.105: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM

The real problem here is that we wound up trying to install the schema from scratch. The initial AuthenticationException occurred because the replicas holding the credentials were down. We can prevent this situation by requiring that all nodes are up, which is an easy check, e.g,

for (Host host : session.getCluster().getMetadata.getAllHosts()) {
    if (!host.isUp()) {
        // do not proceed with any schema updates
    }
}

Comment 22 Filip Brychta 2014-10-27 20:02:58 UTC
I tried it again only with servers server1 - server3 and following scenario was working correctly:
1- stop everything except storage nodes
2- upgrade server1 and server2
3- start upgraded storage nodes on server1 and server2 to be able to upgrade server3
4- upgrade server3
5- upgrade storage-schema
6- start everything

So I guess expected result of this bz would be:
- fixed documentation
- better error handling to avoid ugly errors from comment 16 (Cannot achieve consistency level QUORUM, Not enough replica available for query at consistency ONE) and provide clear error message with advise that all storage nodes must be running
- no idea what to do with the fact that upgrade process stops upgraded SN so a user must start it again to proceed with upgrade of other servers

Comment 23 Jay Shaughnessy 2014-10-29 14:03:45 UTC

master commit 1b45c2af77fe3a4670bc4317508780f80ae3f0bc
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Wed Oct 29 09:54:13 2014 -0400

    [1139735, 1157480] More upgrade use cases that didn't work.
    [1139735] An upgrade with a number of storage nodes > than our replication
    factor (>= 4) could fail because our check for the storage cluster
    version can fail if the auth info or version info are not replicated
    to the currently running SNs.  To avoid this, for ugrades we now push
    all storage cluster interaction to the post-upgrade step (i.e.
    rhqctl --storage-schema).  This includes the schema creation for a new,
    remote storage cluster.

    [1157480] An upgrade of standalone storage node can fail because those
    installs may not have set proper DB props in rhq-server.properties.  It
    wasn't required in the past but it is now.  In the code we now make sure
    the properties are copied forward on upgrade. But there is also doco neede
    here because prior to upgrade from an earlier version, using standalone
    SNs, the old rhq-server.properties files will need to be updated.

    Also, update the --list-versions report to better reflect that it may need
    all SNs to be running to perform the storage schema version check.


release/jon3.3.x commit 8999436f3bfdc8bb7c311d1f27615c69d455f845
Author: Jay Shaughnessy <jshaughn@redhat.com>
Date:   Wed Oct 29 09:54:13 2014 -0400

    (cherry picked from commit 1b45c2af77fe3a4670bc4317508780f80ae3f0bc)
    Signed-off-by: Jay Shaughnessy <jshaughn@redhat.com>

Comment 24 Jay Shaughnessy 2014-10-30 14:14:37 UTC
FYI, see Bug 1158924 for pointers to relevant wiki updates for install/upgrade of standalone storage nodes.

Comment 25 Simeon Pinder 2014-11-03 19:03:36 UTC
Moving to ON_QA as available to test with latest brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=396547

Comment 26 Filip Brychta 2014-11-04 11:12:32 UTC
Verified on
Version :	
3.3.0.CR01
Build Number :	
08c2f39:6ac97ac


Note You need to log in before you can comment on or make changes to this bug.