1135602 – Store aggregate metrics in a single (CQL) row

Bug 1135602 - Store aggregate metrics in a single (CQL) row

Summary: Store aggregate metrics in a single (CQL) row

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Core Server, Performance, Storage Node
Sub Component:
Version:	JON 3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ER03
Target Release:	JON 3.3.0
Assignee:	John Sanda
QA Contact:	Garik Khachikyan
Docs Contact:
URL:
Whiteboard:
Depends On:	1049054
Blocks:	1133605 1133609
TreeView+	depends on / blocked

Reported:	2014-08-29 18:36 UTC by John Sanda
Modified:	2019-04-16 14:17 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:	1049054
Environment:
Last Closed:	2014-12-11 14:02:42 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1374703	0	None	None	None	Never

Description John Sanda 2014-08-29 18:36:03 UTC

+++ This bug was initially created as a clone of Bug #1049054 +++

Description of problem:
This change has the potential to be a big win as it will reduce the number of writes for storing aggregate metrics by a factor of three. The C* schema for aggregate metrics looks like,

CREATE TABLE one_hour_metrics (
    schedule_id int,
    time timestamp,
    type int,
    value double,
    PRIMARY KEY (schedule_id, time, type)
) WITH COMPACT STORAGE;

where the type column identifies the value as one of max, min, or average. Storing an aggregate metric requires three separate write requests. Batch statements could be used to reduce the number of network roundtrips; however, I found that the performance of prepared statements to be better than that of unprepared batches. Inserting an aggregate would look like,

insert into one_hour_metrics (schedule_id, time, type, value) VALUES 
(100, '2014-01-01', 0, 3.14);
insert into one_hour_metrics (schedule_id, time, type, value) VALUES 
(100, '2014-01-01', 1, 3.14);
insert into one_hour_metrics (schedule_id, time, type, value) VALUES 
(100, '2014-01-01', 2, 3.14);

Using CQL collections we can reduce the number of writes for an aggregate metric to one while the on-disk storage remains nearly identical. The schema would look like,

CREATE TABLE one_hour_metrics (
    schedule_id int,
    time timestamp,
    value map<int, double>,
    PRIMARY KEY (schedule_id, time, type)
);

And inserting an aggregate would look like,

insert into one_hour_metrics (schedule_id, time, value) VALUES
(100, '2014-01-01', {0: 3,14, 1: 3.14, 2: 314});

Because the table does not use the WITH COMPACT STORAGE directive, there is an extra column of overhead per CQL row. That overhead should become largely insignificant once compression is re-enabled (see bug 1015628). 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Heiko W. Rupp on 2014-05-08 10:42:37 EDT ---

Bump the target version now that 4.11 is out.

--- Additional comment from Jay Shaughnessy on 2014-07-07 12:55:39 EDT ---


Bumping to 4.13 due to time constraints

--- Additional comment from John Sanda on 2014-08-27 22:04:29 EDT ---

The biggest problem here is that we can wind up with partial aggregates. Persisting an aggregate metric requires three separate writes. Since we do not use atomic batches, we can (and have) wound up with partial aggregates which in turn can lead to difficult bugs.

With respect to the proposed schema changes in the description, I think collections are over kill. I think they would be more appropriate if the types of aggregates stored changed dynamically, but that has never been the case. Given this I propose the following new table,

CREATE TABLE aggregate_metrics (
  schedule_id int,
  bucket text,
  time timestamp,
  avg double,
  max double,
  min double,
  PRIMARY KEY ((schedule_id, bucket), time)
);

This table will replace the current one_hour_metrics, six_hour_metrics, and twenty_four_hour_metrics tables. An upgrade task will be needed to migrate data from the old tables to the new table.

--- Additional comment from John Sanda on 2014-08-29 14:34:24 EDT ---

I want to point out that the upgrade task will be part of the regular server installation/upgrade. Unlike the RDBMS where migrating data from one table to another can be done with SQL statements, Cassandra requires custom code.

Comment 4 John Sanda 2014-09-11 15:54:59 UTC

Changes have been pushed to the release/jon3.3.x branch.

commit hashes:
3bcfc405
47209c00
d64b1aa
8574a396
ed578d7224
7c5cd65aa
f5b9a9a2
02dde462c
4ada1478
2cbae976cc
9874544778

Comment 5 John Sanda 2014-09-11 22:27:44 UTC

Here are additional commit hashes I missed before,

8af1e117

Comment 6 Simeon Pinder 2014-09-17 02:49:35 UTC

Moving to ON_QA as available for test with the following brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=385149

Comment 8 Garik Khachikyan 2014-09-24 11:55:26 UTC

# VERIFIED

Seems to me it is working. Following scenario I did performed:
1. install JON 3.2.0 GA (leave it a bit running to collect the 24 hours metrics report over midnight at least) + plug an agent with Standalone EAP6 there.
2. nodetool -p 7299 enablethrift (from rhq-storage/bin/ directory)
3. change the auto-generated password for cluster storage (and the username which cannot be changed :) just write it down and use)
4. ./cqlsh there and gather count of those 3 metrics tables:
* select count(*) from rhq.one_hour_metrics LIMIT 1000000; // 170808
* select count(*) from rhq.six_hour_metrics LIMIT 1000000; // 27237
* select count(*) from rhq.twenty_four_hour_metrics LIMIT 1000000; // 9081
5. now take servers all down 
6. download JON 3.3 ER03, extract and apply the upgrade: `rhqctl upgrade --from-server-dir /home/hudson/jon-server-3.2.0.GA`
7. start services, wait for some 3-5 min to update all related agents, jars etc.
8. refer to some resources and check the charts,like "RHQ Storage Node -> Cassandra Server JVM -> Memory Subsystem -> Heap Usage" - it shows results!
9. cqlsh again 
10. select count(*) from rhq.aggregate_metrics LIMIT 1000000; // 66769

quite promising.
version:
===
13:55:11,628 INFO  [SystemInfoManager] (http-/0.0.0.0:7080-7) SystemInformation: ********
ACTIVE_DRIFT_PLUGIN: [drift-jpa]
AGENT_MAX_QUIET_TIME_ALLOWED: [300000]
ALERT_PURGE: [2678400000]
AS config dir: [/home/hudson/jon-server-3.3.0.ER03/jbossas/standalone/configuration]
AS product name: [EAP]
AS product version: [6.3.0.GA]
AS version: [7.4.0.Final-redhat-19]
AVAILABILITY_PURGE: [31536000000]
Agent cloud-qe-1-vm-1.idmqe.lab.eng.bos.redhat.com: [Agent[id=10011,name=cloud-qe-1-vm-1.idmqe.lab.eng.bos.redhat.com,address=10.16.96.146,port=16163,remote-endpoint=socket://10.16.96.146:16163/?rhq.communications.connector.rhqtype=agent&numAcceptThreads=1&maxPoolSize=303&clientMaxPoolSize=304&socketTimeout=60000&enableTcpNoDelay=true&backlog=200,last-availability-ping=1411559654430,last-availability-report=1411559452823]]
Agent ibm-x3550m3-11.lab.eng.brq.redhat.com: [Agent[id=10001,name=ibm-x3550m3-11.lab.eng.brq.redhat.com,address=10.34.36.137,port=16163,remote-endpoint=socket://10.34.36.137:16163/?rhq.communications.connector.rhqtype=agent&numAcceptThreads=1&maxPoolSize=303&clientMaxPoolSize=304&socketTimeout=60000&enableTcpNoDelay=true&backlog=200,last-availability-ping=1411559678527,last-availability-report=1411559705458]]
AlertCount: [0]
AlertDefinitionCount: [8]
BuildNumber: [4aefe39:44e33a4]
CAM_BASELINE_DATASET: [604800000]
CAM_BASELINE_FREQUENCY: [259200000]
CAM_BASE_URL: [http://10.34.36.137:7080/]
CAM_DATA_MAINTENANCE: [3600000]
CAM_DATA_PURGE_1D: [31536000000]
CAM_DATA_PURGE_1H: [1209600000]
CAM_DATA_PURGE_6H: [2678400000]
CAM_GUIDE_ENABLED: [true]
CAM_HELP_PASSWORD: [- non null -]
CAM_HELP_USER: [web]
CAM_JAAS_PROVIDER: [false]
CAM_LDAP_BASE_DN: [o=JBoss,c=US]
CAM_LDAP_BIND_DN: []
CAM_LDAP_BIND_PW: [- non null -]
CAM_LDAP_FILTER: []
CAM_LDAP_FOLLOW_REFERRALS: [false]
CAM_LDAP_LOGIN_PROPERTY: [cn]
CAM_LDAP_NAMING_FACTORY_INITIAL: [com.sun.jndi.ldap.LdapCtxFactory]
CAM_LDAP_NAMING_PROVIDER_URL: [ldap://localhost/]
CAM_LDAP_PROTOCOL: [false]
CAM_RT_COLLECT_IP_ADDRS: [true]
CAM_SYSLOG_ACTIONS_ENABLED: [false]
DATABASE_CONNECTION_URL: [jdbc:postgresql://127.0.0.1:5432/rhq?loginTimeout=0&socketTimeout=0&prepareThreshold=5&unknownLength=2147483647&loglevel=0&tcpkeepalive=false&binaryTransfer=true]
DATABASE_DRIVER_NAME: [PostgreSQL Native Driver]
DATABASE_DRIVER_VERSION: [PostgreSQL 9.2 JDBC4 (build 1002)]
DATABASE_PRODUCT_NAME: [PostgreSQL]
DATABASE_PRODUCT_VERSION: [8.4.18]
DATA_REINDEX_NIGHTLY: [false]
DB_SCHEMA_VERSION: [2.160]
DRIFT_FILE_PURGE: [2678400000]
ENABLE_AGENT_AUTO_UPDATE: [true]
ENABLE_LOGIN_WITHOUT_ROLES: [false]
EVENT_PURGE: [1209600000]
FullName: [JBoss Operations Network]
Name: [JBoss ON]
OPERATION_HISTORY_PURGE: [0]
PlatformCount: [2]
RESOURCE_GENERIC_PROPERTIES_UPGRADE: [false]
RHQ_SESSION_TIMEOUT: [3600000]
RT_DATA_PURGE: [2678400000]
SERVER_HOME_DIR: [/home/hudson/jon-server-3.3.0.ER03/jbossas/standalone]
SERVER_IDENTITY: [ibm-x3550m3-11.lab.eng.brq.redhat.com]
SERVER_INSTALL_DIR: [/home/hudson/jon-server-3.3.0.ER03]
SERVER_LOCAL_TIME: [September 24, 2014 1:55:11 PM CEST]
SERVER_TIMEZONE: [Central European Time]
SERVER_VERSION: [4.9.0.JON320GA]
SchedulesPerMinute: [315]
ServerCount: [10]
ServiceCount: [1480]
Storage_Node ibm-x3550m3-11.lab.eng.brq.redhat.com: [storageNode.addresss=ibm-x3550m3-11.lab.eng.brq.redhat.com, hostname=ibm-x3550m3-11.lab.eng.brq.redhat.com, beginTime=1411530911580, beginTime=1411530911580, unackAlerts=0, heapUsed=null, heapPercentageUsed=Min: 0.07378296142944886, Max: 0.7839118296585503, Avg: 0.37766811044203025 (%), load=null, dataUsedPercentage=null, dataDiskUsed=null, tokens=null, actuallyOwns=null]
TRAIT_PURGE: [31536000000]
Version: [3.3.0.ER03]
********
===

Note You need to log in before you can comment on or make changes to this bug.