Description of problem:
Upgrade from 3.2 to 3.3 fails at upgrade --storage-schema due to MigrateAggregateMetrics taking too long, after over 3 hours it only migrated 9436 schedules and then loses connection to the storage node
15:27:11,633 INFO [org.rhq.cassandra.schema.AbstractManager] Applying update file: schema/update/0005.xml
15:27:19,757 INFO [org.rhq.cassandra.schema.MigrateAggregateMetrics] Migrating aggregate metrics for 176143 schedule ids
15:27:19,759 INFO [org.rhq.cassandra.schema.MigrateAggregateMetrics] Migrating one_hour data for 176143 schedules
15:27:19,766 INFO [org.rhq.cassandra.schema.MigrateAggregateMetrics] There are 176143 remaining schedules for the one_hour data migration
15:27:49,783 INFO [org.rhq.cassandra.schema.MigrateAggregateMetrics] There are 176102 remaining schedules for the one_hour data migration
15:28:19,800 INFO [org.rhq.cassandra.schema.MigrateAggregateMetrics] There are 176091 remaining schedules for the one_hour data migration
18:39:32,454 INFO [org.rhq.cassandra.schema.MigrateAggregateMetrics] There are 166707 remaining schedules for the one_hour data migration
18:39:33,561 ERROR [org.rhq.enterprise.server.installer.InstallerServiceImpl] Failed to connect to the storage cluster. Please check the following:
1) At least one storage node is running
2) The rhq.storage.nodes property specifies the correct hostname/address of at least one storage node
3) The rhq.storage.cql-port property has the correct value
18:39:33,583 ERROR [org.rhq.enterprise.server.installer.Installer] java.lang.Exception:Could not connect to the storage cluster: All host(s) tried for query failed (tried: jon-storage-01.example.com/192.168.1.50 (Timeout during read))
Version-Release number of selected component (if applicable):
Steps to Reproduce:
A little background on MigrateAggregateMetrics may be useful. It throttles reads and writes in the following way. It only allows one concurrent read. MigrateAggregateMetrics loads all of the aggregate data for one measurement schedule from the 1hr table. It then submits a tasks to a thread pool to iterate over the result set and write to the new aggregate_metrics table. By default we allow 20k concurrent writes per second per node. So for a 3 node cluster, we would allow up to 60k concurrent writes per second. Only when that task finishing writing its data do we then query data for the next measurement schedule. The reason for this throttling was to prevent placing memory pressure on storage node clusters, particularly on single node clusters where we have run into various performance problems.
I ran the MigrateAggregateMetrics upgrade step on my dev machine using the same data set mentioned in the original description with a single node cluster. My machine has 8 cores, 16 GB of RAM, and SSD. It took a little over 6 hours and 20 minutes to finish. Storage node heap usage never went over 192 MB and overall memory usage peaked at just under 430 MB.
More than half of the queries against the 1 hr table touched 7 SSTables, meaning 7 disks per read. Running a major compaction against the legacy 1hr, 6hr, and 24hr tables prior to upgrade could certainly read performance in some cases like this.
We can also make the read throttling configurable so that we load data for multiple schedules in parallel. This can have a really big impact on performance with a tradeoff of course. The more concurrent reads we do, the more memory pressure there will be on the storage node.
I am not sure how much of an impact that increasing the write parallelism will have. The max number of live 1hr data points for example is 336, which translates to 1008 CQL rows. For each data point we perform one write to the new aggregate_metrics table. We should not even be bumping into the 20k/sec write limit for a single node. Of course if we increase the read concurrency, we might then want to adjust the write concurrency as well.
It is also worth looking at batching writes. I do not know how much at all it could help, but it is definitely worth a look.
I will run through the migration step again using the same data set and increase the read concurrency to get a feel for what kind of improvement that brings.
I increased the read parallelism to 5 and also increased the heap size of my storage node to 512 MB (it was previously 256 MB). The migration step completed in 2:48 hr. We should see further improvements if we increase the read parallelism even more and apply some of the other changes I mentioned in comment 2. I will continue investigating.
I did another test run that batches writes in addition to the parallel writes. I used a batch size of 30. The execution time was 1:05 hr. I think we can still shave off more time with some more tweaks so I am going to do a few more test runs.
I wanted to give a brief update. I have brought the execution time (on my dev box) from 6:30 hr originally to 21 min. I am going to do one more run with a couple more changes to see how much more time if any I can shave off. Then I will work on getting the changes doc'd and merged.
As it stands, it is not practical to upgrade a JBoss ON 3.2 to 3.3 system in environments that have more then 50,000 schedules with more then 4 gigabytes of metric data from the JBoss ON 3.2 system.
I am increasing the severity to urgent.
The execution time of 21 minutes I reported in comment 7 was erroneous. I learned after the fact that it was from a bad test run. The upgrade code first queries the relational db to get a list of all numeric measurement schedule ids, including disabled measurement schedules. We wind up querying Cassandra for a lot of schedule ids that do not have any data at all. I spent some time trying to find a way to determine precisely what schedule ids have data. I found a way using the legacy Thrift APIs with the Hector library. This is not like a "SELECT DISTINCT id FROM table" SQL query that executes very quickly and efficiently. It is a somewhat iteration; however, it is fast enough to incorporate into the upgrade code for two reasons. First, it can potentially dramatically reduce the number of queries we perform. Secondly, it provides valuable feedback as we can log exactly the number of metrics that have to be migrated.
I have done a lot more testing over the past few days using the sample data set and have some pretty specific numbers to share. I have been testing on my dev machine which has the following specs,
processor: quad-core, intel i7, 2.7 GHz
RAM: 16 GB
hard disk: SSD (have to look up specs on model/speeds)
I have been running against a single node. Both the node and the upgrade code are running on my machine. I have change the heap settings of my storage node in cassandra-jvm.properties as follows ,
With my set up, I am able to reliably do 100 reads/sec which results in a migration rate of about 6k/minute. In other words, I have been able to consistently migrate 6k metrics per minute without error. This results in a processing time of about 27 minutes for the sample data set. I am doing 20k writes/sec. Read and write limits are configured using a RateLimiter. I have tested numerous different configurations including a bunch with higher write limits (higher write throughput might be a better way of referring to it), and it does not improve execution times. In fact in some cases, it has made execution time worse. Read throughput is the limiting factor here.
I tried my read throughput, increasing it to 105, 110, and 120 reads/sec. Anything over 100 reads/sec results in some number of failed migrations. I also did a testing with storage node's heap settings doubled and a read rate of 120 reads/sec. This still resulted in failed migrations.
These numbers are of course specific to my hardware configuration, but they are precise and reliable. The settings I am using for reads and writes might be too slow for some deployments and too fast for others. This is why they are both configurable via system properties.
If the upgrade ends with failed migrations maybe because it was going too fast or if the user kills the upgrade maybe because it was going too slow, then it can be restarted with different read/write limits to slow down or speed up as necessary. And on subsequent runs, we will not redo any migrations that have already completed successfully.
I mentioned that with my hardware configuration, I can reliably perform 6k metric migrations per minute. Let's say I have 50k measurement schedules with data to migrate. 50k schedules roughly means 150k metrics since we are dealing 1 hr, 6 hr, and 24 metrics. The migration time would then take about 25 minutes on my machine.
I did another extensive round of testing after a machine reboot and shutting down a bunch of stuff like browsers, intellij, etc. I was able to increase my reads and writes as follows,
This reduced the execution time of the data migration to about 7 minutes.
I am going to test with a two node cluster using virtual machines and spinning disks. I will report back the results of those tests as soon as I have some data.
This wound up being a lot more work for a couple reasons. First, I implemented dynamic throttling and improved fault tolerance.
With the throttling, the migration will continue increasing request throughput as long as there are no errors. Once an error threshold is exceeded, request throughput is decreased. The idea is to go as fast as we can whether we are running a default deployment on a two core virtual machine with default heap settings, or if we are running a 5 node cluster with each node running on dedicated hardware having 8 CPU cores, 16 GB RAM, etc.
For the fault tolerance changes, requests will be retried until all metric migrations complete successfully. The migration task still maintains a commit log of completed migrations so that if the upgrade is restarted for whatever reason, it will not redo already finished work.
The migration logs progress every 30 seconds. The number of 1 hr, 6 hr, and 24 hr remaining metrics is reported. The total number of read and write errors is also reported. This is reported as an indicator how well the storage node cluster is keeping up with the load. Lastly, the rate at which migrations are completed is reported every minute.
The one_hour_metrics, six_hour_mertrics, and twenty_four_hour_metrics are dropped after all of the migrations have completed successfully. When a table is dropped, Cassandra does *not* delete the data. It creates a snapshot of the SSTable files and deletes the originals. This means that whatever space was being consumed is still being consumed by the snapshots. Those snapshots can be deleted after the upgrade completes in order to reclaim disk space.
Changes have been pushed to the release/jon3.3.x branch.
commit hash: b76f9f35a7
Moving to ON_QA as available to test with the following cumulative patch brew build:
First test on JON 3.3.2.DR01 looks fine.
(all using JON3.2.0.GA):
- server1: JON server (master - postgres db is here) + SN + agent
- server2: JON server (slave) + SN + agent
- server3: JON server (slave) + agent
- server4: SN + agent
- two remote agents with EAP6
1 - stop everything
2 - for each server unzip JON 3.3.0.GA and 3.3.2.DR01 patch
3 - upgrade server4 (note that rhq-server.properties must be manually updated - bz1157480)
4 - upgrade server1,2,3
5 - apply the patch on all servers
6 - start all storage nodes
7 - rhqctl upgrade --storage-schema
8 - start server1 and wait until it's completely UP
9 - start other servers one by one (wait for each server to be completely UP before starting another one)
This setup (JON3.2.0) was running ~3 weeks and there was following number of schedule ids:
Note that step 5 (apply the patch on all servers) is after steps 3 and 4 because of bz 1210181.
Complete log is attached.
Migration took ~18 minutes on VM with 2CPUs and 4GB of memory and shared storage(definitely slower than local disk)
Created attachment 1012529 [details]
migration log for comment 17
Verified on ER01
Two issues were found while testing this (bz1213782 and 1213366) which I'm not able to reproduce. Tracking them separately and marking this bz as verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.