Bug 1135602

Summary: Store aggregate metrics in a single (CQL) row
Product: [JBoss] JBoss Operations Network Reporter: John Sanda <jsanda>
Component: Core Server, Performance, Storage NodeAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Garik Khachikyan <gkhachik>
Severity: medium Docs Contact:
Priority: unspecified    
Version: JON 3.2CC: bkramer, gkhachik, hrupp, jshaughn, mfoley, mkoci
Target Milestone: ER03   
Target Release: JON 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1049054 Environment:
Last Closed: 2014-12-11 14:02:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1049054    
Bug Blocks: 1133605, 1133609    

Description John Sanda 2014-08-29 18:36:03 UTC
+++ This bug was initially created as a clone of Bug #1049054 +++

Description of problem:
This change has the potential to be a big win as it will reduce the number of writes for storing aggregate metrics by a factor of three. The C* schema for aggregate metrics looks like,

CREATE TABLE one_hour_metrics (
    schedule_id int,
    time timestamp,
    type int,
    value double,
    PRIMARY KEY (schedule_id, time, type)
) WITH COMPACT STORAGE;

where the type column identifies the value as one of max, min, or average. Storing an aggregate metric requires three separate write requests. Batch statements could be used to reduce the number of network roundtrips; however, I found that the performance of prepared statements to be better than that of unprepared batches. Inserting an aggregate would look like,

insert into one_hour_metrics (schedule_id, time, type, value) VALUES 
(100, '2014-01-01', 0, 3.14);
insert into one_hour_metrics (schedule_id, time, type, value) VALUES 
(100, '2014-01-01', 1, 3.14);
insert into one_hour_metrics (schedule_id, time, type, value) VALUES 
(100, '2014-01-01', 2, 3.14);

Using CQL collections we can reduce the number of writes for an aggregate metric to one while the on-disk storage remains nearly identical. The schema would look like,

CREATE TABLE one_hour_metrics (
    schedule_id int,
    time timestamp,
    value map<int, double>,
    PRIMARY KEY (schedule_id, time, type)
);

And inserting an aggregate would look like,

insert into one_hour_metrics (schedule_id, time, value) VALUES
(100, '2014-01-01', {0: 3,14, 1: 3.14, 2: 314});

Because the table does not use the WITH COMPACT STORAGE directive, there is an extra column of overhead per CQL row. That overhead should become largely insignificant once compression is re-enabled (see bug 1015628). 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Heiko W. Rupp on 2014-05-08 10:42:37 EDT ---

Bump the target version now that 4.11 is out.

--- Additional comment from Jay Shaughnessy on 2014-07-07 12:55:39 EDT ---


Bumping to 4.13 due to time constraints

--- Additional comment from John Sanda on 2014-08-27 22:04:29 EDT ---

The biggest problem here is that we can wind up with partial aggregates. Persisting an aggregate metric requires three separate writes. Since we do not use atomic batches, we can (and have) wound up with partial aggregates which in turn can lead to difficult bugs.

With respect to the proposed schema changes in the description, I think collections are over kill. I think they would be more appropriate if the types of aggregates stored changed dynamically, but that has never been the case. Given this I propose the following new table,

CREATE TABLE aggregate_metrics (
  schedule_id int,
  bucket text,
  time timestamp,
  avg double,
  max double,
  min double,
  PRIMARY KEY ((schedule_id, bucket), time)
);

This table will replace the current one_hour_metrics, six_hour_metrics, and twenty_four_hour_metrics tables. An upgrade task will be needed to migrate data from the old tables to the new table.

--- Additional comment from John Sanda on 2014-08-29 14:34:24 EDT ---

I want to point out that the upgrade task will be part of the regular server installation/upgrade. Unlike the RDBMS where migrating data from one table to another can be done with SQL statements, Cassandra requires custom code.

Comment 4 John Sanda 2014-09-11 15:54:59 UTC
Changes have been pushed to the release/jon3.3.x branch.

commit hashes:
3bcfc405
47209c00
d64b1aa
8574a396
ed578d7224
7c5cd65aa
f5b9a9a2
02dde462c
4ada1478
2cbae976cc
9874544778

Comment 5 John Sanda 2014-09-11 22:27:44 UTC
Here are additional commit hashes I missed before,

8af1e117

Comment 6 Simeon Pinder 2014-09-17 02:49:35 UTC
Moving to ON_QA as available for test with the following brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=385149

Comment 8 Garik Khachikyan 2014-09-24 11:55:26 UTC
# VERIFIED

Seems to me it is working. Following scenario I did performed:
1. install JON 3.2.0 GA (leave it a bit running to collect the 24 hours metrics report over midnight at least) + plug an agent with Standalone EAP6 there.
2. nodetool -p 7299 enablethrift (from rhq-storage/bin/ directory)
3. change the auto-generated password for cluster storage (and the username which cannot be changed :) just write it down and use)
4. ./cqlsh there and gather count of those 3 metrics tables:
* select count(*) from rhq.one_hour_metrics LIMIT 1000000; // 170808
* select count(*) from rhq.six_hour_metrics LIMIT 1000000; // 27237
* select count(*) from rhq.twenty_four_hour_metrics LIMIT 1000000; // 9081
5. now take servers all down 
6. download JON 3.3 ER03, extract and apply the upgrade: `rhqctl upgrade --from-server-dir /home/hudson/jon-server-3.2.0.GA`
7. start services, wait for some 3-5 min to update all related agents, jars etc.
8. refer to some resources and check the charts,like "RHQ Storage Node -> Cassandra Server JVM -> Memory Subsystem -> Heap Usage" - it shows results!
9. cqlsh again 
10. select count(*) from rhq.aggregate_metrics LIMIT 1000000; // 66769

quite promising.
version:
===
13:55:11,628 INFO  [SystemInfoManager] (http-/0.0.0.0:7080-7) SystemInformation: ********
ACTIVE_DRIFT_PLUGIN: [drift-jpa]
AGENT_MAX_QUIET_TIME_ALLOWED: [300000]
ALERT_PURGE: [2678400000]
AS config dir: [/home/hudson/jon-server-3.3.0.ER03/jbossas/standalone/configuration]
AS product name: [EAP]
AS product version: [6.3.0.GA]
AS version: [7.4.0.Final-redhat-19]
AVAILABILITY_PURGE: [31536000000]
Agent cloud-qe-1-vm-1.idmqe.lab.eng.bos.redhat.com: [Agent[id=10011,name=cloud-qe-1-vm-1.idmqe.lab.eng.bos.redhat.com,address=10.16.96.146,port=16163,remote-endpoint=socket://10.16.96.146:16163/?rhq.communications.connector.rhqtype=agent&numAcceptThreads=1&maxPoolSize=303&clientMaxPoolSize=304&socketTimeout=60000&enableTcpNoDelay=true&backlog=200,last-availability-ping=1411559654430,last-availability-report=1411559452823]]
Agent ibm-x3550m3-11.lab.eng.brq.redhat.com: [Agent[id=10001,name=ibm-x3550m3-11.lab.eng.brq.redhat.com,address=10.34.36.137,port=16163,remote-endpoint=socket://10.34.36.137:16163/?rhq.communications.connector.rhqtype=agent&numAcceptThreads=1&maxPoolSize=303&clientMaxPoolSize=304&socketTimeout=60000&enableTcpNoDelay=true&backlog=200,last-availability-ping=1411559678527,last-availability-report=1411559705458]]
AlertCount: [0]
AlertDefinitionCount: [8]
BuildNumber: [4aefe39:44e33a4]
CAM_BASELINE_DATASET: [604800000]
CAM_BASELINE_FREQUENCY: [259200000]
CAM_BASE_URL: [http://10.34.36.137:7080/]
CAM_DATA_MAINTENANCE: [3600000]
CAM_DATA_PURGE_1D: [31536000000]
CAM_DATA_PURGE_1H: [1209600000]
CAM_DATA_PURGE_6H: [2678400000]
CAM_GUIDE_ENABLED: [true]
CAM_HELP_PASSWORD: [- non null -]
CAM_HELP_USER: [web]
CAM_JAAS_PROVIDER: [false]
CAM_LDAP_BASE_DN: [o=JBoss,c=US]
CAM_LDAP_BIND_DN: []
CAM_LDAP_BIND_PW: [- non null -]
CAM_LDAP_FILTER: []
CAM_LDAP_FOLLOW_REFERRALS: [false]
CAM_LDAP_LOGIN_PROPERTY: [cn]
CAM_LDAP_NAMING_FACTORY_INITIAL: [com.sun.jndi.ldap.LdapCtxFactory]
CAM_LDAP_NAMING_PROVIDER_URL: [ldap://localhost/]
CAM_LDAP_PROTOCOL: [false]
CAM_RT_COLLECT_IP_ADDRS: [true]
CAM_SYSLOG_ACTIONS_ENABLED: [false]
DATABASE_CONNECTION_URL: [jdbc:postgresql://127.0.0.1:5432/rhq?loginTimeout=0&socketTimeout=0&prepareThreshold=5&unknownLength=2147483647&loglevel=0&tcpkeepalive=false&binaryTransfer=true]
DATABASE_DRIVER_NAME: [PostgreSQL Native Driver]
DATABASE_DRIVER_VERSION: [PostgreSQL 9.2 JDBC4 (build 1002)]
DATABASE_PRODUCT_NAME: [PostgreSQL]
DATABASE_PRODUCT_VERSION: [8.4.18]
DATA_REINDEX_NIGHTLY: [false]
DB_SCHEMA_VERSION: [2.160]
DRIFT_FILE_PURGE: [2678400000]
ENABLE_AGENT_AUTO_UPDATE: [true]
ENABLE_LOGIN_WITHOUT_ROLES: [false]
EVENT_PURGE: [1209600000]
FullName: [JBoss Operations Network]
Name: [JBoss ON]
OPERATION_HISTORY_PURGE: [0]
PlatformCount: [2]
RESOURCE_GENERIC_PROPERTIES_UPGRADE: [false]
RHQ_SESSION_TIMEOUT: [3600000]
RT_DATA_PURGE: [2678400000]
SERVER_HOME_DIR: [/home/hudson/jon-server-3.3.0.ER03/jbossas/standalone]
SERVER_IDENTITY: [ibm-x3550m3-11.lab.eng.brq.redhat.com]
SERVER_INSTALL_DIR: [/home/hudson/jon-server-3.3.0.ER03]
SERVER_LOCAL_TIME: [September 24, 2014 1:55:11 PM CEST]
SERVER_TIMEZONE: [Central European Time]
SERVER_VERSION: [4.9.0.JON320GA]
SchedulesPerMinute: [315]
ServerCount: [10]
ServiceCount: [1480]
Storage_Node ibm-x3550m3-11.lab.eng.brq.redhat.com: [storageNode.addresss=ibm-x3550m3-11.lab.eng.brq.redhat.com, hostname=ibm-x3550m3-11.lab.eng.brq.redhat.com, beginTime=1411530911580, beginTime=1411530911580, unackAlerts=0, heapUsed=null, heapPercentageUsed=Min: 0.07378296142944886, Max: 0.7839118296585503, Avg: 0.37766811044203025 (%), load=null, dataUsedPercentage=null, dataDiskUsed=null, tokens=null, actuallyOwns=null]
TRAIT_PURGE: [31536000000]
Version: [3.3.0.ER03]
********
===