Bug 1185375

Summary: Upgrade from 3.2 to 3.3 fails at upgrade --storage-schema due to MigrateAggregateMetrics taking too long
Product: [JBoss] JBoss Operations Network Reporter: dsteigne
Component: Upgrade, Storage NodeAssignee: John Sanda <jsanda>
Status: CLOSED ERRATA QA Contact: Filip Brychta <fbrychta>
Severity: urgent Docs Contact:
Priority: high    
Version: JON 3.3.0CC: fbrychta, jsanda, loleary, rhatlapa
Target Milestone: DR01Keywords: Triaged
Target Release: JON 3.3.2   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1198845 (view as bug list) Environment:
Last Closed: 2015-04-30 16:10:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1206387, 1206641    
Bug Blocks: 1198845    
Attachments:
Description Flags
migration log for comment 17 none

Description dsteigne 2015-01-23 15:09:29 UTC
Description of problem:
 Upgrade from 3.2 to 3.3 fails at upgrade --storage-schema due to MigrateAggregateMetrics taking too long, after over 3 hours it only migrated 9436 schedules and then loses connection to the storage node 

    15:27:11,633 INFO  [org.rhq.cassandra.schema.AbstractManager] Applying update file: schema/update/0005.xml
    15:27:19,757 INFO  [org.rhq.cassandra.schema.MigrateAggregateMetrics] Migrating aggregate metrics for 176143 schedule ids
    15:27:19,759 INFO  [org.rhq.cassandra.schema.MigrateAggregateMetrics] Migrating one_hour data for 176143 schedules
    15:27:19,766 INFO  [org.rhq.cassandra.schema.MigrateAggregateMetrics] There are 176143 remaining schedules for the one_hour data migration
    15:27:49,783 INFO  [org.rhq.cassandra.schema.MigrateAggregateMetrics] There are 176102 remaining schedules for the one_hour data migration
    15:28:19,800 INFO  [org.rhq.cassandra.schema.MigrateAggregateMetrics] There are 176091 remaining schedules for the one_hour data migration
    ...
    18:39:32,454 INFO  [org.rhq.cassandra.schema.MigrateAggregateMetrics] There are 166707 remaining schedules for the one_hour data migration
    18:39:33,561 ERROR [org.rhq.enterprise.server.installer.InstallerServiceImpl] Failed to connect to the storage cluster. Please check the following:
        1) At least one storage node is running
        2) The rhq.storage.nodes property specifies the correct hostname/address of at least one storage node
        3) The rhq.storage.cql-port property has the correct value

    18:39:33,583 ERROR [org.rhq.enterprise.server.installer.Installer] java.lang.Exception:Could not connect to the storage cluster: All host(s) tried for query failed (tried: jon-storage-01.example.com/192.168.1.50 (Timeout during read))

Version-Release number of selected component (if applicable):
3.3.0

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 John Sanda 2015-02-03 03:56:20 UTC
A little background on MigrateAggregateMetrics may be useful. It throttles reads and writes in the following way. It only allows one concurrent read. MigrateAggregateMetrics loads all of the aggregate data for one measurement schedule from the 1hr table. It then submits a tasks to a thread pool to iterate over the result set and write to the new aggregate_metrics table. By default we allow 20k concurrent writes per second per node. So for a 3 node cluster, we would allow up to 60k concurrent writes per second. Only when that task finishing writing its data do we then query data for the next measurement schedule. The reason for this throttling was to prevent placing memory pressure  on storage node clusters, particularly on single node clusters where we have run into various performance problems.

I ran the MigrateAggregateMetrics upgrade step on my dev machine using the same data set mentioned in the original description with a single node cluster. My machine has 8 cores, 16 GB of RAM, and SSD. It took a little over 6 hours and 20 minutes to finish. Storage node heap usage never went over 192 MB and overall memory usage peaked at just under 430 MB. 

More than half of the queries against the 1 hr table touched 7 SSTables, meaning 7 disks per read. Running a major compaction against the legacy 1hr, 6hr, and 24hr tables prior to upgrade could certainly read performance in some cases like this.

We can also make the read throttling configurable so that we load data for multiple schedules in parallel. This can have a really big impact on performance with a tradeoff of course. The more concurrent reads we do, the more memory pressure there will be on the storage node.

I am not sure how much of an impact that increasing the write parallelism will have. The max number of live 1hr data points for example is 336, which translates to 1008 CQL rows. For each data point we perform one write to the new aggregate_metrics table. We should not even be bumping into the 20k/sec write limit for a single node. Of course if we increase the read concurrency, we might then want to adjust the write concurrency as well.

It is also worth looking at batching writes. I do not know how much at all it could help, but it is definitely worth a look.

I will run through the migration step again using the same data set and increase the read concurrency to get a feel for what kind of improvement that brings.

Comment 3 John Sanda 2015-02-03 12:43:18 UTC
I increased the read parallelism to 5 and also increased the heap size of my storage node to 512 MB (it was previously 256 MB). The migration step completed in 2:48 hr. We should see further improvements if we increase the read parallelism even more and apply some of the other changes I mentioned in comment 2. I will continue investigating.

Comment 4 John Sanda 2015-02-03 17:12:24 UTC
I did another test run that batches writes in addition to the parallel writes. I used a batch size of 30. The execution time was 1:05 hr. I think we can still shave off more time with some more tweaks so I am going to do a few more test runs.

Comment 7 John Sanda 2015-02-05 15:17:00 UTC
I wanted to give a brief update. I have brought the execution time (on my dev box) from 6:30 hr originally to 21 min. I am going to do one more run with a couple more changes to see how much more time if any I can shave off. Then I will work on getting the changes doc'd and merged.

Comment 9 Larry O'Leary 2015-02-13 18:23:12 UTC
As it stands, it is not practical to upgrade a JBoss ON 3.2 to 3.3 system in environments that have more then 50,000 schedules with more then 4 gigabytes of metric data from the JBoss ON 3.2 system.

I am increasing the severity to urgent.

Comment 11 John Sanda 2015-02-16 15:05:09 UTC
The execution time of 21 minutes I reported in comment 7 was erroneous. I learned after the fact that it was from a bad test run. The upgrade code first queries the relational db to get a list of all numeric measurement schedule ids, including disabled measurement schedules. We wind up querying Cassandra for a lot of schedule ids that do not have any data at all. I spent some time trying to find a way to determine precisely what schedule ids have data. I found a way using the legacy Thrift APIs with the Hector library. This is not like a "SELECT DISTINCT id FROM table" SQL query that executes very quickly and efficiently. It is a somewhat iteration; however, it is fast enough to incorporate into the upgrade code for two reasons. First, it can potentially dramatically reduce the number of queries we perform. Secondly, it provides valuable feedback as we can log exactly the number of metrics that have to be migrated.

Comment 12 John Sanda 2015-02-16 15:37:56 UTC
I have done a lot more testing over the past few days using the sample data set and have some pretty specific numbers to share. I have been testing on my dev machine which has the following specs,

processor: quad-core, intel i7, 2.7 GHz
RAM: 16 GB
hard disk: SSD (have to look up specs on model/speeds)

I have been running against a single node. Both the node and the upgrade code are running on my machine. I have change the heap settings of my storage node in cassandra-jvm.properties as follows ,

heap_min=-Xms2048M
heap_max=-Xmx2048M
heap_new=-Xmn1024M

With my set up, I am able to reliably do 100 reads/sec which results in a migration rate of about 6k/minute. In other words, I have been able to consistently migrate 6k metrics per minute without error. This results in a processing time of about 27 minutes for the sample data set. I am doing 20k writes/sec. Read and write limits are configured using a RateLimiter. I have tested numerous different configurations including a bunch with higher write limits (higher write throughput might be a better way of referring to it), and it does not improve execution times. In fact in some cases, it has made execution time worse. Read throughput is the limiting factor here.

I tried my read throughput, increasing it to 105, 110, and 120 reads/sec. Anything over 100 reads/sec results in some number of failed migrations. I also  did a testing with storage node's heap settings doubled and a read rate of 120 reads/sec. This still resulted in  failed migrations.

These numbers are of course specific to my hardware configuration, but they are precise and reliable. The settings I am using for reads and writes might be too slow for some deployments and too fast for others. This is why they are both configurable via system properties. 

If the upgrade ends with failed migrations maybe because it was going too fast or if the user kills the upgrade maybe because it was going too slow, then it can be restarted with different read/write limits to slow down or speed up as necessary. And on subsequent runs, we will not redo any migrations that have already completed successfully.

Comment 13 John Sanda 2015-02-16 15:46:13 UTC
I mentioned that with my hardware configuration, I can reliably perform 6k metric migrations per minute. Let's say I have 50k measurement schedules with data to migrate. 50k schedules roughly means 150k metrics since we are dealing 1 hr, 6 hr, and 24 metrics. The migration time would then take about 25 minutes on my machine.

Comment 14 John Sanda 2015-02-17 19:02:58 UTC
I did another extensive round of testing after a machine reboot and shutting down a bunch of stuff like browsers, intellij, etc. I was able to increase my reads and writes as follows,

500 reads/second
40000 writes/second

This reduced the execution time of the data migration to about 7 minutes.

I am going to test with a two node cluster using virtual machines and spinning disks. I will report back the results of those tests as soon as I have some data.

Comment 15 John Sanda 2015-03-17 02:34:47 UTC
This wound up being a lot more work for a couple reasons. First, I implemented dynamic throttling and improved fault tolerance. 

With the throttling, the migration will continue increasing request throughput as long as there are no errors. Once an error threshold is exceeded, request throughput is decreased. The idea is to go as fast as we can whether we are running a default deployment on a two core virtual machine with default heap settings, or if we are running a 5 node cluster with each node running on dedicated hardware having 8 CPU cores, 16 GB RAM, etc. 

For the fault tolerance changes, requests will be retried until all metric migrations complete successfully. The migration task still maintains a commit log of completed migrations so that if the upgrade is restarted for whatever reason, it will not redo already finished work.

The migration logs progress every 30 seconds. The number of 1 hr, 6 hr, and 24 hr remaining metrics is reported. The total number of read and write errors is also reported. This is reported as an indicator how well the storage node cluster is keeping up with the load. Lastly, the rate at which migrations are completed is reported every minute.

The one_hour_metrics, six_hour_mertrics, and twenty_four_hour_metrics are dropped after all of the migrations have completed successfully. When a table is dropped, Cassandra does *not* delete the data. It creates a snapshot of the SSTable files and deletes the originals. This means that whatever space was being consumed is still being consumed by the snapshots. Those snapshots can be  deleted after the upgrade completes in order to reclaim disk space.

Changes have been pushed to the release/jon3.3.x branch.

commit hash: b76f9f35a7

Comment 16 Simeon Pinder 2015-03-27 02:57:39 UTC
Moving to ON_QA as available to test with the following cumulative patch brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=427025

Comment 17 Filip Brychta 2015-04-09 06:57:26 UTC
First test on JON 3.3.2.DR01 looks fine.
Set up:
(all using JON3.2.0.GA):
- server1: JON server (master - postgres db is here) + SN + agent
- server2: JON server (slave) + SN + agent
- server3: JON server (slave) + agent
- server4: SN + agent
- two remote agents with EAP6

Scenario:
1 - stop everything
2 - for each server unzip JON 3.3.0.GA and 3.3.2.DR01 patch
3 - upgrade server4 (note that rhq-server.properties must be manually updated - bz1157480)
4 - upgrade server1,2,3
5 - apply the patch on all servers
6 - start all storage nodes
7 - rhqctl upgrade --storage-schema
8 - start server1 and wait until it's completely UP
9 - start other servers one by one (wait for each server to be completely UP before starting another one)

This setup (JON3.2.0) was running ~3 weeks and there was following number of schedule ids:
one_hour: 16644
six_hour: 16644
twenty_four_hour: 16650


Note that step 5 (apply the patch on all servers) is after steps 3 and 4 because of bz 1210181.

Complete log is attached.
Migration took ~18 minutes on VM with 2CPUs and 4GB of memory and shared storage(definitely slower than local disk)

Comment 18 Filip Brychta 2015-04-09 07:00:46 UTC
Created attachment 1012529 [details]
migration log for comment 17

Comment 19 Filip Brychta 2015-04-21 09:43:00 UTC
Verified on ER01
Two issues were found while testing this (bz1213782 and 1213366) which I'm not able to reproduce. Tracking them separately and marking this bz as verified.

Comment 21 errata-xmlrpc 2015-04-30 16:10:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0920.html