Description of problem: In OCP 3.7, we introduced "temp" tables for raw data in Hawkular Metrics. A different table is used for raw data per every two hour block. After the two hour block has passed, the table is never written to again. The compression job that runs in the Hawkular Metrics server will subsequently fetch the raw data from the table, compress it, write the compressed data to the data_compressed table, and then finally drop the temp table. Handling raw data this way allows space on disk to be reclaimed more quickly, reduces compaction, which in turn reduces I/O and CPU usage. Due to a bug in the job scheduling code in Hawkular Metrics, temp tables were not getting dropped. Hawkular Metrics is supposed to maintain a days worth of tables as a bit of a buffer to ensure that there is always a table to which to write. At any given time there should be 13 or 14 temp tables. I observed some clusters which had as many as 300 temp tables. The large number of tables can result in performance problems. A table has about 1 MB of JVM heap space overhead. For Cassandra pods particularly with smaller heap sizes, this could contribute to excessive GC which would result in high CPU usage and requests timing out. Version-Release number of selected component (if applicable): How reproducible: We would end up with orphaned (in the sense that they won't ever get deleted) temp tables any time the compression job fails to complete successfully. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
@John, What is the checkpoint of this defect, is the checkpoint is "At any given time there should be 13 or 14 temp tables"? But I see there are 15 temp file in my env $ ls -alR /cassandra_data/ | grep -i temp drwxr-xr-x. 3 cassandra cassandra 21 May 22 03:00 data_temp_2018052200-46457d905d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 05:00 data_temp_2018052202-4617dde05d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 258 May 22 05:00 data_temp_2018052204-4672f6305d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052206-8fa29e505d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052208-8e627d305d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052210-8e901ce05d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052212-8f194a105d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052214-8eeb83505d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052216-8fd065105d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052218-8f74d7905d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052220-8ebdbc905d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052222-8f4737e05d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052300-51e88e105d6711e881192b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 22 04:24 data_temp_2018052302-f6840a205d7711e88adb0b295fcd6148 drwxr-xr-x. 3 cassandra cassandra 21 May 22 06:24 data_temp_2018052304-ba086d505d8811e88adb0b295fcd6148
There is a job, TempTableCreator.java, which runs every two hours and creates tables for the next 24 hours relative to the job start time. While it is possible to wind up with more, in general I think we should normally see 12 of these tables. Can you please provide logs for hawkular-metrics and the output of the following: $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx" $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * from hawkular_metrics.finished_jobs_idx"
(In reply to John Sanda from comment #3) > There is a job, TempTableCreator.java, which runs every two hours and > creates tables for the next 24 hours relative to the job start time. While > it is possible to wind up with more, in general I think we should normally > see 12 of these tables. It is 12 temp tables now in my new env, not sure why I get 15 tables yesterday # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-jnv86 1/1 Running 0 1h hawkular-metrics-d5q7x 1/1 Running 0 1h hawkular-metrics-schema-dlzch 0/1 Completed 0 1h heapster-gkmsj 1/1 Running 0 1h # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR /cassandra_data/ | grep -i temp drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 data_temp_2018052300-357621f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 data_temp_2018052302-35485b305e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a > Can you please provide logs for hawkular-metrics and the output of the > following: > > $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * > from hawkular_metrics.scheduled_jobs_idx" # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx" time_slice | job_id | job_name | job_params | job_type | status | trigger --------------------------+--------------------------------------+----------------------+------------+----------------------+--------+-------------------------------------------------------------------------------------------------------------------- 2018-05-23 03:00:00+0000 | 4814c669-40fc-413d-b42b-17a8beedc6df | TEMP_DATA_COMPRESSOR | {} | TEMP_DATA_COMPRESSOR | null | {type: 1, trigger_time: 1527044400000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null} 2018-05-23 03:20:00+0000 | 26a09d0e-f6de-4984-8087-66cc477d3e17 | TEMP_TABLE_CREATOR | {} | TEMP_TABLE_CREATOR | null | {type: 1, trigger_time: 1527045600000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null} (2 rows) > $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * > from hawkular_metrics.finished_jobs_idx" # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- cqlsh --ssl -e "select * from hawkular_metrics.finished_jobs_idx" time_slice | job_id ------------+-------- (0 rows)
I think the checkpoint is the number of temp files is maintained at 12~14, there should not be two many temp files, since after running for 2 hours, the number of temp files is 13, not 12. # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR /cassandra_data/ | grep -i temp drwxr-xr-x. 3 cassandra cassandra 21 May 23 03:00 data_temp_2018052300-357621f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 241 May 23 03:00 data_temp_2018052302-35485b305e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 03:20 data_temp_2018052400-413238b05e3811e8a6a22b75a818259a
Created attachment 1440413 [details] metrics pods log
I failed to mention that the dates/time in the table name are reported in UTC. The temp tables listed in comment 5 look correct. I should also note that the oldest tables get dropped first and not until the raw data in those tables is compressed. If the data in a table does not get compressed for whatever reason, then by design it should not get dropped. The logs files also revealed a new 3.10 regression: ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] (RxComputationScheduler-3) HAWKMETRICS200010: Failed to process request: java.lang.NullPointerException at org.hawkular.metrics.core.service.tags.SimpleTagQueryParser.lambda$tagValueFilter$40(SimpleTagQueryParser.java:557) ... I will create a new ticket for the NPE.
(In reply to John Sanda from comment #7) > The logs files also revealed a new 3.10 regression: > > ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] > (RxComputationScheduler-3) HAWKMETRICS200010: Failed to process request: > java.lang.NullPointerException > at > org.hawkular.metrics.core.service.tags.SimpleTagQueryParser. > lambda$tagValueFilter$40(SimpleTagQueryParser.java:557) > ... > > I will create a new ticket for the NPE. It should be the same defect with https://bugzilla.redhat.com/show_bug.cgi?id=1581060, Endpoint POST /hawkular/metrics/m/stats/query fails with HTTP code: 500
(In reply to Junqi Zhao from comment #5) > I think the checkpoint is the number of temp files is maintained at 12~14, > there should not be two many temp files, since after running for 2 hours, > the number of temp files is 13, not 12. > > # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR > /cassandra_data/ | grep -i temp > drwxr-xr-x. 3 cassandra cassandra 21 May 23 03:00 > data_temp_2018052300-357621f05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 241 May 23 03:00 > data_temp_2018052302-35485b305e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 > data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 03:20 > data_temp_2018052400-413238b05e3811e8a6a22b75a818259a Yes, that is correct. Everything looks good here.
The number temp file is growing after running metrics for one day, not between 12-14, it is 23 now and the size of /cassandra_data/data/hawkular_metrics is 9.4M, I think we should check the data file size, not data files number. Attached metrics logs, there is one exception in hawkular-metrics pod log, WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) # oc -n openshift-infra exec hawkular-cassandra-1-h4ckm -- ls -alR /cassandra_data/ | grep -i temp drwxr-xr-x. 3 cassandra cassandra 21 May 28 05:00 data_temp_2018052802-43137220622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 07:00 data_temp_2018052804-42e5d270622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 09:00 data_temp_2018052806-434138e0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 11:00 data_temp_2018052808-8b8bd830622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 13:00 data_temp_2018052810-8c70b9f0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 15:00 data_temp_2018052812-8c152c70622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 17:00 data_temp_2018052814-8b5e1170622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 19:00 data_temp_2018052816-8b3098d0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 21:00 data_temp_2018052818-8bb99ef0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 23:00 data_temp_2018052820-8c9ea7c0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 29 01:00 data_temp_2018052822-8c42f330622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 241 May 29 01:00 data_temp_2018052900-8be765b0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 05:16 data_temp_2018052902-4ebaef70623611e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 07:16 data_temp_2018052904-124d3550624711e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 09:16 data_temp_2018052906-d5e01770625711e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 11:16 data_temp_2018052908-99743210626811e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 13:16 data_temp_2018052910-5d0650e0627911e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 15:16 data_temp_2018052912-20990bf0628a11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 17:16 data_temp_2018052914-e42d4da0629a11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 19:16 data_temp_2018052916-a7c5fc2062ab11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 21:16 data_temp_2018052918-6b5a16c062bc11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 23:16 data_temp_2018052920-2ee6422062cd11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 29 01:16 data_temp_2018052922-f2794b5062dd11e8a2c26330552d2565 # oc -n openshift-infra exec hawkular-cassandra-1-h4ckm -- du -sh /cassandra_data/data/hawkular_metrics 9.4M /cassandra_data/data/hawkular_metrics
Created attachment 1445005 [details] metrics pods log - 20180529
metrics version is v3.10.0-0.53.0.0
Since I provided correct verification steps in https://bugzilla.redhat.com/show_bug.cgi?id=1571045#c5, can you please retest this?
There are not two many data_temp tables after metrics running for one day, metrics version is v3.10.0-0.64.0.0 # oc -n openshift-infra exec hawkular-cassandra-1-n8h6q -- cqlsh --ssl -e "select table_name from system_schema.tables where keyspace_name = 'hawkular_metrics'" table_name ---------------------- active_time_slices cassalog data data_0 data_compressed data_temp_2018060812 data_temp_2018060814 data_temp_2018060816 data_temp_2018060818 data_temp_2018060820 data_temp_2018060822 data_temp_2018060900 data_temp_2018060902 data_temp_2018060904 data_temp_2018060906 data_temp_2018060908 data_temp_2018060910 finished_jobs_idx jobs leases locks metrics_idx metrics_tags_idx retentions_idx scheduled_jobs_idx sys_config tasks tenants (28 rows)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816