Bug 1570981 - Large number of data_temp tables cause request timeouts and other performance problems
Summary: Large number of data_temp tables cause request timeouts and other performance...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.10.0
Assignee: John Sanda
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1570982 1571045
TreeView+ depends on / blocked
 
Reported: 2018-04-23 22:03 UTC by John Sanda
Modified: 2018-07-30 19:14 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1570982 1571045 (view as bug list)
Environment:
Last Closed: 2018-07-30 19:13:42 UTC
Target Upstream Version:


Attachments (Terms of Use)
metrics pods log (31.72 KB, application/x-gzip)
2018-05-23 04:37 UTC, Junqi Zhao
no flags Details
metrics pods log - 20180529 (37.96 KB, application/x-gzip)
2018-05-29 02:03 UTC, Junqi Zhao
no flags Details


Links
System ID Priority Status Summary Last Updated
JBoss Issue Tracker HWKMETRICS-763 'Major' 'Resolved' 'Temp tables are not getting deleted' 2019-11-27 06:53:04 UTC
JBoss Issue Tracker HWKMETRICS-771 'Major' 'Resolved' 'Compress job should compress all possible tables' 2019-11-27 06:53:04 UTC
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:14:17 UTC

Description John Sanda 2018-04-23 22:03:54 UTC
Description of problem:
In OCP 3.7, we introduced "temp" tables for raw data in Hawkular Metrics. A different table is used for raw data per every two hour block. After the two hour block has passed, the table is never written to again. The compression job that runs in the Hawkular Metrics server will subsequently fetch the raw data from the table, compress it, write the compressed data to the data_compressed table, and then finally drop the temp table. Handling raw data this way allows space on disk to be reclaimed more quickly, reduces compaction, which in turn reduces I/O and CPU usage.

Due to a bug in the job scheduling code in Hawkular Metrics, temp tables were not getting dropped. Hawkular Metrics is supposed to maintain a days worth of tables as a bit of a buffer to ensure that there is always a table to which to write. At any given time there should be 13 or 14 temp tables. I observed some clusters which had as many as 300 temp tables.

The large number of tables can result in performance problems. A table has about 1 MB of JVM heap space overhead. For Cassandra pods particularly with smaller heap sizes, this could contribute to excessive GC which would result in high CPU usage and requests timing out. 


Version-Release number of selected component (if applicable):


How reproducible:
We would end up with orphaned (in the sense that they won't ever get deleted) temp tables any time the compression job fails to complete successfully.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Junqi Zhao 2018-05-22 06:54:10 UTC
@John,

What is the checkpoint of this defect, is the checkpoint is "At any given time there should be 13 or 14 temp tables"? But I see there are 15 temp file in my env

$ ls -alR /cassandra_data/ | grep -i temp
drwxr-xr-x.  3 cassandra cassandra   21 May 22 03:00 data_temp_2018052200-46457d905d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 05:00 data_temp_2018052202-4617dde05d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra  258 May 22 05:00 data_temp_2018052204-4672f6305d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052206-8fa29e505d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052208-8e627d305d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052210-8e901ce05d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052212-8f194a105d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052214-8eeb83505d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052216-8fd065105d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052218-8f74d7905d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052220-8ebdbc905d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052222-8f4737e05d5611e881192b75a818259a
drwxr-xr-x.  4 cassandra cassandra   38 May 22 03:29 data_temp_2018052300-51e88e105d6711e881192b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 22 04:24 data_temp_2018052302-f6840a205d7711e88adb0b295fcd6148
drwxr-xr-x.  3 cassandra cassandra   21 May 22 06:24 data_temp_2018052304-ba086d505d8811e88adb0b295fcd6148

Comment 3 John Sanda 2018-05-22 16:13:22 UTC
There is a job, TempTableCreator.java, which runs every two hours and creates tables for the next 24 hours relative to the job start time. While it is possible to wind up with more, in general I think we should normally see 12 of these tables. 

Can you please provide logs for hawkular-metrics and the output of the following:

$ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx"

$ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * from hawkular_metrics.finished_jobs_idx"

Comment 4 Junqi Zhao 2018-05-23 02:27:52 UTC
(In reply to John Sanda from comment #3)
> There is a job, TempTableCreator.java, which runs every two hours and
> creates tables for the next 24 hours relative to the job start time. While
> it is possible to wind up with more, in general I think we should normally
> see 12 of these tables. 
It is 12 temp tables now in my new env, not sure why I get 15 tables yesterday
# oc get po
NAME                            READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-jnv86      1/1       Running     0          1h
hawkular-metrics-d5q7x          1/1       Running     0          1h
hawkular-metrics-schema-dlzch   0/1       Completed   0          1h
heapster-gkmsj                  1/1       Running     0          1h

# oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR /cassandra_data/ | grep -i temp
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:18 data_temp_2018052300-357621f05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:18 data_temp_2018052302-35485b305e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:18 data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a

> Can you please provide logs for hawkular-metrics and the output of the
> following:
> 
> $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select *
> from hawkular_metrics.scheduled_jobs_idx"
# oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx"

 time_slice               | job_id                               | job_name             | job_params | job_type             | status | trigger
--------------------------+--------------------------------------+----------------------+------------+----------------------+--------+--------------------------------------------------------------------------------------------------------------------
 2018-05-23 03:00:00+0000 | 4814c669-40fc-413d-b42b-17a8beedc6df | TEMP_DATA_COMPRESSOR |           {} | TEMP_DATA_COMPRESSOR |   null | {type: 1, trigger_time: 1527044400000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null}
 2018-05-23 03:20:00+0000 | 26a09d0e-f6de-4984-8087-66cc477d3e17 |   TEMP_TABLE_CREATOR |           {} |   TEMP_TABLE_CREATOR |   null | {type: 1, trigger_time: 1527045600000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null}

(2 rows)

> $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select *
> from hawkular_metrics.finished_jobs_idx"
# oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- cqlsh --ssl -e "select * from hawkular_metrics.finished_jobs_idx"

 time_slice | job_id
------------+--------

(0 rows)

Comment 5 Junqi Zhao 2018-05-23 04:36:09 UTC
I think the checkpoint is the number of temp files is maintained at 12~14, there should not be two many temp files, since after running for 2 hours, the number of temp files is 13, not 12.

# oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR /cassandra_data/ | grep -i temp
drwxr-xr-x.  3 cassandra cassandra   21 May 23 03:00 data_temp_2018052300-357621f05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra  241 May 23 03:00 data_temp_2018052302-35485b305e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:18 data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20 data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a
drwxr-xr-x.  3 cassandra cassandra   21 May 23 03:20 data_temp_2018052400-413238b05e3811e8a6a22b75a818259a

Comment 6 Junqi Zhao 2018-05-23 04:37:06 UTC
Created attachment 1440413 [details]
metrics pods log

Comment 7 John Sanda 2018-05-23 13:52:06 UTC
I failed to mention that the dates/time in the table name are reported in UTC. The temp tables listed in comment 5 look correct. I should also note that the oldest tables get dropped first and not until the raw data in those tables is compressed. If the data in a table does not get compressed for whatever reason, then by design it should not get dropped. 

The logs files also revealed a new 3.10 regression:

ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] (RxComputationScheduler-3) HAWKMETRICS200010: Failed to process request: java.lang.NullPointerException
        at org.hawkular.metrics.core.service.tags.SimpleTagQueryParser.lambda$tagValueFilter$40(SimpleTagQueryParser.java:557)
...

I will create a new ticket for the NPE.

Comment 8 Junqi Zhao 2018-05-24 00:40:56 UTC
(In reply to John Sanda from comment #7)
 
> The logs files also revealed a new 3.10 regression:
> 
> ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils]
> (RxComputationScheduler-3) HAWKMETRICS200010: Failed to process request:
> java.lang.NullPointerException
>         at
> org.hawkular.metrics.core.service.tags.SimpleTagQueryParser.
> lambda$tagValueFilter$40(SimpleTagQueryParser.java:557)
> ...
> 
> I will create a new ticket for the NPE.

It should be the same defect with https://bugzilla.redhat.com/show_bug.cgi?id=1581060, Endpoint POST /hawkular/metrics/m/stats/query fails with HTTP code: 500

Comment 9 John Sanda 2018-05-24 20:36:32 UTC
(In reply to Junqi Zhao from comment #5)
> I think the checkpoint is the number of temp files is maintained at 12~14,
> there should not be two many temp files, since after running for 2 hours,
> the number of temp files is 13, not 12.
> 
> # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR
> /cassandra_data/ | grep -i temp
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 03:00
> data_temp_2018052300-357621f05e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra  241 May 23 03:00
> data_temp_2018052302-35485b305e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:18
> data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 01:20
> data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a
> drwxr-xr-x.  3 cassandra cassandra   21 May 23 03:20
> data_temp_2018052400-413238b05e3811e8a6a22b75a818259a

Yes, that is correct. Everything looks good here.

Comment 10 Junqi Zhao 2018-05-29 02:02:58 UTC
The number temp file is growing after running metrics for one day, not between 12-14, it is 23 now
and the size of /cassandra_data/data/hawkular_metrics is 9.4M, I think we should check the data file size, not data files number.

Attached metrics logs, there is one exception in hawkular-metrics pod log,
WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)

# oc -n openshift-infra exec hawkular-cassandra-1-h4ckm -- ls -alR /cassandra_data/ | grep -i temp
drwxr-xr-x.  3 cassandra cassandra   21 May 28 05:00 data_temp_2018052802-43137220622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 07:00 data_temp_2018052804-42e5d270622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 09:00 data_temp_2018052806-434138e0622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 11:00 data_temp_2018052808-8b8bd830622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 13:00 data_temp_2018052810-8c70b9f0622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 15:00 data_temp_2018052812-8c152c70622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 17:00 data_temp_2018052814-8b5e1170622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 19:00 data_temp_2018052816-8b3098d0622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 21:00 data_temp_2018052818-8bb99ef0622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 23:00 data_temp_2018052820-8c9ea7c0622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 29 01:00 data_temp_2018052822-8c42f330622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra  241 May 29 01:00 data_temp_2018052900-8be765b0622511e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 05:16 data_temp_2018052902-4ebaef70623611e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 07:16 data_temp_2018052904-124d3550624711e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 09:16 data_temp_2018052906-d5e01770625711e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 11:16 data_temp_2018052908-99743210626811e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 13:16 data_temp_2018052910-5d0650e0627911e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 15:16 data_temp_2018052912-20990bf0628a11e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 17:16 data_temp_2018052914-e42d4da0629a11e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 19:16 data_temp_2018052916-a7c5fc2062ab11e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 21:16 data_temp_2018052918-6b5a16c062bc11e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 28 23:16 data_temp_2018052920-2ee6422062cd11e8a2c26330552d2565
drwxr-xr-x.  3 cassandra cassandra   21 May 29 01:16 data_temp_2018052922-f2794b5062dd11e8a2c26330552d2565


# oc -n openshift-infra  exec hawkular-cassandra-1-h4ckm -- du -sh /cassandra_data/data/hawkular_metrics
9.4M	/cassandra_data/data/hawkular_metrics

Comment 11 Junqi Zhao 2018-05-29 02:03:42 UTC
Created attachment 1445005 [details]
metrics pods log - 20180529

Comment 12 Junqi Zhao 2018-05-29 02:09:33 UTC
metrics version is v3.10.0-0.53.0.0

Comment 13 John Sanda 2018-06-07 21:15:11 UTC
Since I provided correct verification steps in https://bugzilla.redhat.com/show_bug.cgi?id=1571045#c5, can you please retest this?

Comment 14 Junqi Zhao 2018-06-08 13:02:36 UTC
There are not two many data_temp tables after metrics running for one day,
metrics version is v3.10.0-0.64.0.0

# oc -n openshift-infra exec hawkular-cassandra-1-n8h6q -- cqlsh --ssl -e "select table_name from system_schema.tables where keyspace_name = 'hawkular_metrics'"

 table_name
----------------------
   active_time_slices
             cassalog
                 data
               data_0
      data_compressed
 data_temp_2018060812
 data_temp_2018060814
 data_temp_2018060816
 data_temp_2018060818
 data_temp_2018060820
 data_temp_2018060822
 data_temp_2018060900
 data_temp_2018060902
 data_temp_2018060904
 data_temp_2018060906
 data_temp_2018060908
 data_temp_2018060910
    finished_jobs_idx
                 jobs
               leases
                locks
          metrics_idx
     metrics_tags_idx
       retentions_idx
   scheduled_jobs_idx
           sys_config
                tasks
              tenants

(28 rows)

Comment 16 errata-xmlrpc 2018-07-30 19:13:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.