Description of problem:
In RHQ 4.11 we had the metrics_index table which was replaced with the metrics_cache_index table in RHQ 4.12. With both tables the number of rows per (CQL) partition would be N, where N is the number of measurement schedules having data stored during a given time slice, e.g., 04:00 - 05:00. As N gets bigger, we wind up with increasingly large partitions, often referred to as "wide rows" in Cassandra terminology. The problem is even worse with the new metrics_cache_index table because we store even more data in each row.
Because of bug 1135603, we no longer need the metrics_cache_index table. We need to avoid really big wide rows and also deal with the read timeouts. The following schema change will address these concerns,
CREATE TABLE rhq.metrics_idx (
PRIMARY KEY ((bucket, partition, time), schedule_id)
This is almost the same as the original metrics_index table except that it now has a partition column. We will need to track the number of partitions or possible values for the partition column. Let's say it is 5, which means the column can range from zero to four. Now the N schedule ids will be spread across N / 5 partitions instead of a single one. The index queries used during aggregation will also need to be refactored so avoid retrieving too much data at once.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Having implemented my own partition scheme for bulk data Cassandra, I found that 100 partitions takes very little time to query (fraction of a second), so 1000 partitions for many schedules may be a good starting point.
If you can figure out the number of active metric schedules (excluding trait schedules) you can come up with a better estimate. For new installations, I'm not sure how this might work. Of course, you can always increase the partition sizes at runtime.
I was actually thinking smaller, a lot smaller. I think the biggest problem with the current and past approaches was that we only do a single query for index. And the can result in timeouts as the partitions get bigger. I think we resolve most of the problems with the following queries
WHERE bucket = 'raw' AND partition = 2 AND time = 16:00
// and to get the next page...
WHERE bucket = 'raw' AND partition = 2 AND time = 16:00 AND schedule_id > 11000
We implement paging across a single partition as well as break up the index for a given time slice into multiple partitions. As for how many, I was actually thinking just a handful to start with. Maybe I am wrong, but here is my reasoning. Assuming that the data is not already in memory, a read for a partition key requires a disk seek for each SSTable that has to be read. After that initial seek, columns are read sequentially which should be very efficient. Just based on this, we know that querying N partitions will require S disk seeks where S >= N. Of course with multiple nodes queries can be executed in parallel to reduce the overall latency. When Cassandra performs a read from disk, the scanned file segments, which I believe are 64k in size by default will get loaded into memory. So it is entirely possible that when we query for a page of data from the index, Cassandra will load into memory more than we need. Then the subsequent query will not even have to hit the disk at all.
I agree that the number of schedules needs to be taken into account when trying to determine how many index partitions we should have. But I also think we have to take the cluster size into consideration as well.
I think the file segment size is determined by the column_index_size_in_kb property in cassandra.yaml.
I think in terms of inserting data, it is more efficient to insert into narrower rows than wider. Intuitively, it would seem that is the case, as maintaining sorted rows requires more moving around of data. I'm also thinking a large enough partition count would avoid any need for paged queries. But perhaps you're right that it is more I/O efficient to simply have fairly wide rows.
I'd like it if there might be a way to try both and decide which is more efficient, both in reads and in writes.
I have removed the metrics_cache and metrics_cache index tables. References to them in code have been removed as well. Paging is now used when querying the index during data aggregation. The paging functionality is implemented in IndexIterator.java. There is a new schema upgrade step, ReplaceIndex.java, that migrates data from the legacy index tables into the new metrics_idx table.
More work is needed to make the number of index partitions fully configurable. That can come later though. There was no reason for that work to hold up these changes from getting into master. If need be we can address it in a separate BZ.
I created bug 1140849 for tracking the work for making the number of partitions configurable. Moving this to ON_QA.