Bug 1135629
Summary: | Wide rows in metrics index table can cause read timeouts | |||
---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | John Sanda <jsanda> | |
Component: | Core Server, Storage Node | Assignee: | Nobody <nobody> | |
Status: | ON_QA --- | QA Contact: | ||
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.12 | CC: | genman, hrupp | |
Target Milestone: | --- | |||
Target Release: | RHQ 4.13 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1135630 (view as bug list) | Environment: | ||
Last Closed: | Type: | Bug | ||
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1135630 |
Description
John Sanda
2014-08-29 19:57:33 UTC
Having implemented my own partition scheme for bulk data Cassandra, I found that 100 partitions takes very little time to query (fraction of a second), so 1000 partitions for many schedules may be a good starting point. If you can figure out the number of active metric schedules (excluding trait schedules) you can come up with a better estimate. For new installations, I'm not sure how this might work. Of course, you can always increase the partition sizes at runtime. I was actually thinking smaller, a lot smaller. I think the biggest problem with the current and past approaches was that we only do a single query for index. And the can result in timeouts as the partitions get bigger. I think we resolve most of the problems with the following queries SELECT schedule_id FROM metrics_idx WHERE bucket = 'raw' AND partition = 2 AND time = 16:00 LIMIT 2000; // and to get the next page... SELECT schedule_id FROM metrics_idx WHERE bucket = 'raw' AND partition = 2 AND time = 16:00 AND schedule_id > 11000 LIMIT 2000; We implement paging across a single partition as well as break up the index for a given time slice into multiple partitions. As for how many, I was actually thinking just a handful to start with. Maybe I am wrong, but here is my reasoning. Assuming that the data is not already in memory, a read for a partition key requires a disk seek for each SSTable that has to be read. After that initial seek, columns are read sequentially which should be very efficient. Just based on this, we know that querying N partitions will require S disk seeks where S >= N. Of course with multiple nodes queries can be executed in parallel to reduce the overall latency. When Cassandra performs a read from disk, the scanned file segments, which I believe are 64k in size by default will get loaded into memory. So it is entirely possible that when we query for a page of data from the index, Cassandra will load into memory more than we need. Then the subsequent query will not even have to hit the disk at all. I agree that the number of schedules needs to be taken into account when trying to determine how many index partitions we should have. But I also think we have to take the cluster size into consideration as well. I think the file segment size is determined by the column_index_size_in_kb property in cassandra.yaml. I think in terms of inserting data, it is more efficient to insert into narrower rows than wider. Intuitively, it would seem that is the case, as maintaining sorted rows requires more moving around of data. I'm also thinking a large enough partition count would avoid any need for paged queries. But perhaps you're right that it is more I/O efficient to simply have fairly wide rows. I'd like it if there might be a way to try both and decide which is more efficient, both in reads and in writes. I have removed the metrics_cache and metrics_cache index tables. References to them in code have been removed as well. Paging is now used when querying the index during data aggregation. The paging functionality is implemented in IndexIterator.java. There is a new schema upgrade step, ReplaceIndex.java, that migrates data from the legacy index tables into the new metrics_idx table. More work is needed to make the number of index partitions fully configurable. That can come later though. There was no reason for that work to hold up these changes from getting into master. If need be we can address it in a separate BZ. commit hashes: 6c0ebd22f5 e352c0cdc4d 58d3eb6c8 75aeb92e1a b4eaebb1f9 36e8d5030 428d9aa0c a3a59ffbe1 I created bug 1140849 for tracking the work for making the number of partitions configurable. Moving this to ON_QA. |