Bug 1135629

Summary:	Wide rows in metrics index table can cause read timeouts
Product:	[Other] RHQ Project	Reporter:	John Sanda <jsanda>
Component:	Core Server, Storage Node	Assignee:	Nobody <nobody>
Status:	ON_QA ---	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.12	CC:	genman, hrupp
Target Milestone:	---
Target Release:	RHQ 4.13
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1135630 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1135630

Description John Sanda 2014-08-29 19:57:33 UTC

Description of problem:
In RHQ 4.11 we had the metrics_index table which was replaced with the metrics_cache_index table in RHQ 4.12. With both tables the number of rows per (CQL) partition would be N, where N is the number of measurement schedules having data stored during a given time slice, e.g., 04:00 - 05:00. As N gets bigger, we wind up with increasingly large partitions, often referred to as "wide rows" in Cassandra terminology. The problem is even worse with the new metrics_cache_index table because we store even more data in each row.

Because of bug 1135603, we no longer need the metrics_cache_index table. We need to avoid really big wide rows and also deal with the read timeouts. The following schema change will address these concerns,

CREATE TABLE rhq.metrics_idx (
bucket text,
partition int,
time timestamp,
schedule_id int,
PRIMARY KEY ((bucket, partition, time), schedule_id)
)

This is almost the same as the original metrics_index table except that it now has a partition column. We will need to track the number of partitions or possible values for the partition column. Let's say it is 5, which means the column can range from zero to four. Now the N schedule ids will be spread across N / 5 partitions instead of a single one. The index queries used during aggregation will also need to be refactored so avoid retrieving too much data at once.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 Elias Ross 2014-09-04 18:11:25 UTC

Having implemented my own partition scheme for bulk data Cassandra, I found that 100 partitions takes very little time to query (fraction of a second), so 1000 partitions for many schedules may be a good starting point.

If you can figure out the number of active metric schedules (excluding trait schedules) you can come up with a better estimate. For new installations, I'm not sure how this might work. Of course, you can always increase the partition sizes at runtime.

Comment 2 John Sanda 2014-09-04 19:23:40 UTC

I was actually thinking smaller, a lot smaller. I think the biggest problem with the current and past approaches was that we only do a single query for index. And the can result in timeouts as the partitions get bigger. I think we resolve most of the problems with the following queries

SELECT schedule_id 
FROM metrics_idx
WHERE bucket = 'raw' AND partition = 2 AND time = 16:00
LIMIT 2000;

// and to get the next page...
SELECT schedule_id
FROM metrics_idx
WHERE bucket = 'raw' AND partition = 2 AND time = 16:00 AND  schedule_id > 11000
LIMIT 2000;

We implement paging across a single partition as well as break up the index for a given time slice into multiple partitions. As for how many, I was actually thinking just a handful to start with. Maybe I am wrong, but here is my reasoning.  Assuming that the data is not already in memory, a read for a partition key requires a disk seek for each SSTable that has to be read. After that initial seek, columns are read sequentially which should be very efficient. Just based on this, we know that querying N partitions will require S disk seeks where S >= N. Of course with multiple nodes queries can be executed in parallel to reduce the overall latency. When Cassandra performs a read from disk, the scanned file segments, which I believe are 64k in size by default will get loaded into memory. So it is entirely possible that when we query for a page of data from the index, Cassandra will load into memory more than we need. Then the subsequent query will not even have to hit the disk at all.

I agree that the number of schedules needs to be taken into account when trying to determine how many index partitions we should have. But I also think we have to take the cluster size into consideration as well.

Comment 3 John Sanda 2014-09-04 19:34:23 UTC

I think the file segment size is determined by the column_index_size_in_kb property in cassandra.yaml.

Comment 4 Elias Ross 2014-09-05 02:44:30 UTC

I think in terms of inserting data, it is more efficient to insert into narrower rows than wider. Intuitively, it would seem that is the case, as maintaining sorted rows requires more moving around of data. I'm also thinking a large enough partition count would avoid any need for paged queries. But perhaps you're right that it is more I/O efficient to simply have fairly wide rows.

I'd like it if there might be a way to try both and decide which is more efficient, both in reads and in writes.

Comment 5 John Sanda 2014-09-10 02:16:40 UTC

I have removed the metrics_cache and metrics_cache index tables. References to them in code have been removed as well. Paging is now used when querying the index during data aggregation. The paging functionality is implemented in IndexIterator.java. There is a new schema upgrade step, ReplaceIndex.java, that migrates data from the legacy index tables into the new metrics_idx table.

More work is needed to make the number of index partitions fully configurable. That can come later though. There was no reason for that work to hold up these changes from getting into master. If need be we can address it in a separate BZ.

commit hashes:
6c0ebd22f5
e352c0cdc4d
58d3eb6c8
75aeb92e1a
b4eaebb1f9
36e8d5030
428d9aa0c
a3a59ffbe1

Comment 6 John Sanda 2014-09-11 20:20:37 UTC

I created bug 1140849 for tracking the work for making the number of partitions configurable. Moving this to ON_QA.