+++ This bug was initially created as a clone of Bug #1135629 +++ Description of problem: In RHQ 4.11 we had the metrics_index table which was replaced with the metrics_cache_index table in RHQ 4.12. With both tables the number of rows per (CQL) partition would be N, where N is the number of measurement schedules having data stored during a given time slice, e.g., 04:00 - 05:00. As N gets bigger, we wind up with increasingly large partitions, often referred to as "wide rows" in Cassandra terminology. The problem is even worse with the new metrics_cache_index table because we store even more data in each row. Because of bug 1135603, we no longer need the metrics_cache_index table. We need to avoid really big wide rows and also deal with the read timeouts. The following schema change will address these concerns, CREATE TABLE rhq.metrics_idx ( bucket text, partition int, time timestamp, schedule_id int, PRIMARY KEY ((bucket, partition, time), schedule_id) ) This is almost the same as the original metrics_index table except that it now has a partition column. We will need to track the number of partitions or possible values for the partition column. Let's say it is 5, which means the column can range from zero to four. Now the N schedule ids will be spread across N / 5 partitions instead of a single one. The index queries used during aggregation will also need to be refactored so avoid retrieving too much data at once. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Note that the metrics_cache_index table is not in JON. It was first introduced in RHQ 4.12. We need this bug though because code changes are needed in the release/jon3.3.x branch.
is it possible to get some documentation on how big a problem this is ... and how effective the solution is? i would like to avoid "premature optimizations" involving possibly disruptive schema changes.
Changes have been pushed to the release/jon3.3.x branch. See bug 1135629 for details on changes. commit hashes: 6368fae4f52 58673a3c18 091f947f1a 44b37973be f3af0274cce 457cb6cf3e78 531cee0d4 16a044e176 43265ad40d 274915480 99de2c6d 805d07479
Moving to ON_QA as available for test with the following brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=385149
Verified on JON 3.3 ER03 Verified on a set up with a high load using perftest plugin and agentcopy tool. Let the env run for couple of days. There are not time-out errors in server log and no errors in rhq-storage.log.