+++ This bug was initially created as a clone of Bug #1049054 +++ Description of problem: This change has the potential to be a big win as it will reduce the number of writes for storing aggregate metrics by a factor of three. The C* schema for aggregate metrics looks like, CREATE TABLE one_hour_metrics ( schedule_id int, time timestamp, type int, value double, PRIMARY KEY (schedule_id, time, type) ) WITH COMPACT STORAGE; where the type column identifies the value as one of max, min, or average. Storing an aggregate metric requires three separate write requests. Batch statements could be used to reduce the number of network roundtrips; however, I found that the performance of prepared statements to be better than that of unprepared batches. Inserting an aggregate would look like, insert into one_hour_metrics (schedule_id, time, type, value) VALUES (100, '2014-01-01', 0, 3.14); insert into one_hour_metrics (schedule_id, time, type, value) VALUES (100, '2014-01-01', 1, 3.14); insert into one_hour_metrics (schedule_id, time, type, value) VALUES (100, '2014-01-01', 2, 3.14); Using CQL collections we can reduce the number of writes for an aggregate metric to one while the on-disk storage remains nearly identical. The schema would look like, CREATE TABLE one_hour_metrics ( schedule_id int, time timestamp, value map<int, double>, PRIMARY KEY (schedule_id, time, type) ); And inserting an aggregate would look like, insert into one_hour_metrics (schedule_id, time, value) VALUES (100, '2014-01-01', {0: 3,14, 1: 3.14, 2: 314}); Because the table does not use the WITH COMPACT STORAGE directive, there is an extra column of overhead per CQL row. That overhead should become largely insignificant once compression is re-enabled (see bug 1015628). Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from Heiko W. Rupp on 2014-05-08 10:42:37 EDT --- Bump the target version now that 4.11 is out. --- Additional comment from Jay Shaughnessy on 2014-07-07 12:55:39 EDT --- Bumping to 4.13 due to time constraints --- Additional comment from John Sanda on 2014-08-27 22:04:29 EDT --- The biggest problem here is that we can wind up with partial aggregates. Persisting an aggregate metric requires three separate writes. Since we do not use atomic batches, we can (and have) wound up with partial aggregates which in turn can lead to difficult bugs. With respect to the proposed schema changes in the description, I think collections are over kill. I think they would be more appropriate if the types of aggregates stored changed dynamically, but that has never been the case. Given this I propose the following new table, CREATE TABLE aggregate_metrics ( schedule_id int, bucket text, time timestamp, avg double, max double, min double, PRIMARY KEY ((schedule_id, bucket), time) ); This table will replace the current one_hour_metrics, six_hour_metrics, and twenty_four_hour_metrics tables. An upgrade task will be needed to migrate data from the old tables to the new table. --- Additional comment from John Sanda on 2014-08-29 14:34:24 EDT --- I want to point out that the upgrade task will be part of the regular server installation/upgrade. Unlike the RDBMS where migrating data from one table to another can be done with SQL statements, Cassandra requires custom code.
Changes have been pushed to the release/jon3.3.x branch. commit hashes: 3bcfc405 47209c00 d64b1aa 8574a396 ed578d7224 7c5cd65aa f5b9a9a2 02dde462c 4ada1478 2cbae976cc 9874544778
Here are additional commit hashes I missed before, 8af1e117
Moving to ON_QA as available for test with the following brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=385149
# VERIFIED Seems to me it is working. Following scenario I did performed: 1. install JON 3.2.0 GA (leave it a bit running to collect the 24 hours metrics report over midnight at least) + plug an agent with Standalone EAP6 there. 2. nodetool -p 7299 enablethrift (from rhq-storage/bin/ directory) 3. change the auto-generated password for cluster storage (and the username which cannot be changed :) just write it down and use) 4. ./cqlsh there and gather count of those 3 metrics tables: * select count(*) from rhq.one_hour_metrics LIMIT 1000000; // 170808 * select count(*) from rhq.six_hour_metrics LIMIT 1000000; // 27237 * select count(*) from rhq.twenty_four_hour_metrics LIMIT 1000000; // 9081 5. now take servers all down 6. download JON 3.3 ER03, extract and apply the upgrade: `rhqctl upgrade --from-server-dir /home/hudson/jon-server-3.2.0.GA` 7. start services, wait for some 3-5 min to update all related agents, jars etc. 8. refer to some resources and check the charts,like "RHQ Storage Node -> Cassandra Server JVM -> Memory Subsystem -> Heap Usage" - it shows results! 9. cqlsh again 10. select count(*) from rhq.aggregate_metrics LIMIT 1000000; // 66769 quite promising. version: === 13:55:11,628 INFO [SystemInfoManager] (http-/0.0.0.0:7080-7) SystemInformation: ******** ACTIVE_DRIFT_PLUGIN: [drift-jpa] AGENT_MAX_QUIET_TIME_ALLOWED: [300000] ALERT_PURGE: [2678400000] AS config dir: [/home/hudson/jon-server-3.3.0.ER03/jbossas/standalone/configuration] AS product name: [EAP] AS product version: [6.3.0.GA] AS version: [7.4.0.Final-redhat-19] AVAILABILITY_PURGE: [31536000000] Agent cloud-qe-1-vm-1.idmqe.lab.eng.bos.redhat.com: [Agent[id=10011,name=cloud-qe-1-vm-1.idmqe.lab.eng.bos.redhat.com,address=10.16.96.146,port=16163,remote-endpoint=socket://10.16.96.146:16163/?rhq.communications.connector.rhqtype=agent&numAcceptThreads=1&maxPoolSize=303&clientMaxPoolSize=304&socketTimeout=60000&enableTcpNoDelay=true&backlog=200,last-availability-ping=1411559654430,last-availability-report=1411559452823]] Agent ibm-x3550m3-11.lab.eng.brq.redhat.com: [Agent[id=10001,name=ibm-x3550m3-11.lab.eng.brq.redhat.com,address=10.34.36.137,port=16163,remote-endpoint=socket://10.34.36.137:16163/?rhq.communications.connector.rhqtype=agent&numAcceptThreads=1&maxPoolSize=303&clientMaxPoolSize=304&socketTimeout=60000&enableTcpNoDelay=true&backlog=200,last-availability-ping=1411559678527,last-availability-report=1411559705458]] AlertCount: [0] AlertDefinitionCount: [8] BuildNumber: [4aefe39:44e33a4] CAM_BASELINE_DATASET: [604800000] CAM_BASELINE_FREQUENCY: [259200000] CAM_BASE_URL: [http://10.34.36.137:7080/] CAM_DATA_MAINTENANCE: [3600000] CAM_DATA_PURGE_1D: [31536000000] CAM_DATA_PURGE_1H: [1209600000] CAM_DATA_PURGE_6H: [2678400000] CAM_GUIDE_ENABLED: [true] CAM_HELP_PASSWORD: [- non null -] CAM_HELP_USER: [web] CAM_JAAS_PROVIDER: [false] CAM_LDAP_BASE_DN: [o=JBoss,c=US] CAM_LDAP_BIND_DN: [] CAM_LDAP_BIND_PW: [- non null -] CAM_LDAP_FILTER: [] CAM_LDAP_FOLLOW_REFERRALS: [false] CAM_LDAP_LOGIN_PROPERTY: [cn] CAM_LDAP_NAMING_FACTORY_INITIAL: [com.sun.jndi.ldap.LdapCtxFactory] CAM_LDAP_NAMING_PROVIDER_URL: [ldap://localhost/] CAM_LDAP_PROTOCOL: [false] CAM_RT_COLLECT_IP_ADDRS: [true] CAM_SYSLOG_ACTIONS_ENABLED: [false] DATABASE_CONNECTION_URL: [jdbc:postgresql://127.0.0.1:5432/rhq?loginTimeout=0&socketTimeout=0&prepareThreshold=5&unknownLength=2147483647&loglevel=0&tcpkeepalive=false&binaryTransfer=true] DATABASE_DRIVER_NAME: [PostgreSQL Native Driver] DATABASE_DRIVER_VERSION: [PostgreSQL 9.2 JDBC4 (build 1002)] DATABASE_PRODUCT_NAME: [PostgreSQL] DATABASE_PRODUCT_VERSION: [8.4.18] DATA_REINDEX_NIGHTLY: [false] DB_SCHEMA_VERSION: [2.160] DRIFT_FILE_PURGE: [2678400000] ENABLE_AGENT_AUTO_UPDATE: [true] ENABLE_LOGIN_WITHOUT_ROLES: [false] EVENT_PURGE: [1209600000] FullName: [JBoss Operations Network] Name: [JBoss ON] OPERATION_HISTORY_PURGE: [0] PlatformCount: [2] RESOURCE_GENERIC_PROPERTIES_UPGRADE: [false] RHQ_SESSION_TIMEOUT: [3600000] RT_DATA_PURGE: [2678400000] SERVER_HOME_DIR: [/home/hudson/jon-server-3.3.0.ER03/jbossas/standalone] SERVER_IDENTITY: [ibm-x3550m3-11.lab.eng.brq.redhat.com] SERVER_INSTALL_DIR: [/home/hudson/jon-server-3.3.0.ER03] SERVER_LOCAL_TIME: [September 24, 2014 1:55:11 PM CEST] SERVER_TIMEZONE: [Central European Time] SERVER_VERSION: [4.9.0.JON320GA] SchedulesPerMinute: [315] ServerCount: [10] ServiceCount: [1480] Storage_Node ibm-x3550m3-11.lab.eng.brq.redhat.com: [storageNode.addresss=ibm-x3550m3-11.lab.eng.brq.redhat.com, hostname=ibm-x3550m3-11.lab.eng.brq.redhat.com, beginTime=1411530911580, beginTime=1411530911580, unackAlerts=0, heapUsed=null, heapPercentageUsed=Min: 0.07378296142944886, Max: 0.7839118296585503, Avg: 0.37766811044203025 (%), load=null, dataUsedPercentage=null, dataDiskUsed=null, tokens=null, actuallyOwns=null] TRAIT_PURGE: [31536000000] Version: [3.3.0.ER03] ******** ===