Bug 1135630 - Wide rows in metrics index table can cause read timeouts
Summary: Wide rows in metrics index table can cause read timeouts
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Core Server, Storage Node
Version: JON 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ER03
: JON 3.3.0
Assignee: John Sanda
QA Contact: Sunil Kondkar
URL:
Whiteboard:
Depends On: 1135629
Blocks: 1126410 1133609 1135604
TreeView+ depends on / blocked
 
Reported: 2014-08-29 19:58 UTC by John Sanda
Modified: 2019-04-16 14:17 UTC (History)
4 users (show)

Fixed In Version:
Clone Of: 1135629
Environment:
Last Closed: 2014-12-11 14:01:42 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1374703 0 None None None Never

Description John Sanda 2014-08-29 19:58:24 UTC
+++ This bug was initially created as a clone of Bug #1135629 +++

Description of problem:
In RHQ 4.11 we had the metrics_index table which was replaced with the metrics_cache_index table in RHQ 4.12. With both tables the number of rows per (CQL) partition would be N, where N is the number of measurement schedules having data stored during a given time slice, e.g., 04:00 - 05:00. As N gets bigger, we wind up with increasingly large partitions, often referred to as "wide rows" in Cassandra terminology. The problem is even worse with the new metrics_cache_index table because we store even more data in each row.

Because of bug 1135603, we no longer need the metrics_cache_index table. We need to avoid really big wide rows and also deal with the read timeouts. The following schema change will address these concerns,

CREATE TABLE rhq.metrics_idx (
  bucket text,
  partition int,
  time timestamp,
  schedule_id int,
  PRIMARY KEY ((bucket, partition, time), schedule_id)
)

This is almost the same as the original metrics_index table except that it now has a partition column. We will need to track the number of partitions or possible values for the partition column. Let's say it is 5, which means the column can range from zero to four. Now the N schedule ids will be spread across N / 5 partitions instead of a single one. The index queries used during aggregation will also need to be refactored so avoid retrieving too much data at once.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Sanda 2014-08-29 19:58:48 UTC
Note that the metrics_cache_index table is not in JON. It was first introduced in RHQ 4.12. We need this bug though because code changes are needed in the release/jon3.3.x branch.

Comment 3 Mike Foley 2014-09-10 14:54:46 UTC
is it possible to get some documentation on how big a problem this is ... and how effective the solution is?  i would like to avoid "premature optimizations" involving possibly disruptive schema changes.

Comment 5 John Sanda 2014-09-12 01:36:56 UTC
Changes have been pushed to the release/jon3.3.x branch. See bug 1135629 for details on changes.

commit hashes:

6368fae4f52
58673a3c18
091f947f1a
44b37973be
f3af0274cce
457cb6cf3e78
531cee0d4
16a044e176
43265ad40d
274915480
99de2c6d
805d07479

Comment 6 Simeon Pinder 2014-09-17 02:49:22 UTC
Moving to ON_QA as available for test with the following brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=385149

Comment 7 Sunil Kondkar 2014-09-30 15:22:43 UTC
Verified on JON 3.3 ER03

Verified on a set up with a high load using perftest plugin and agentcopy tool. Let the env run for couple of days.
There are not time-out errors in server log and no errors in rhq-storage.log.


Note You need to log in before you can comment on or make changes to this bug.