1570982 – Large number of data_temp tables cause request timeouts and other performance problems

Bug 1570982 - Large number of data_temp tables cause request timeouts and other performance problems

Summary: Large number of data_temp tables cause request timeouts and other performance...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.9.z
Assignee:	John Sanda
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1570981 1571045
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-23 22:07 UTC by John Sanda
Modified:	2018-06-06 15:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1570981
Environment:
Last Closed:	2018-06-06 15:46:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	HWKMETRICS-780	Major	Closed	Temp tables are not getting deleted	2019-05-27 13:29:18 UTC
Red Hat Issue Tracker	HWKMETRICS-784	Major	Closed	Compress job should compress all possible tables	2019-05-27 13:29:18 UTC
Red Hat Product Errata	RHBA-2018:1796	None	None	None	2018-06-06 15:47:23 UTC

Description John Sanda 2018-04-23 22:07:11 UTC

+++ This bug was initially created as a clone of Bug #1570981 +++

Description of problem:
In OCP 3.7, we introduced "temp" tables for raw data in Hawkular Metrics. A different table is used for raw data per every two hour block. After the two hour block has passed, the table is never written to again. The compression job that runs in the Hawkular Metrics server will subsequently fetch the raw data from the table, compress it, write the compressed data to the data_compressed table, and then finally drop the temp table. Handling raw data this way allows space on disk to be reclaimed more quickly, reduces compaction, which in turn reduces I/O and CPU usage.

Due to a bug in the job scheduling code in Hawkular Metrics, temp tables were not getting dropped. Hawkular Metrics is supposed to maintain a days worth of tables as a bit of a buffer to ensure that there is always a table to which to write. At any given time there should be 13 or 14 temp tables. I observed some clusters which had as many as 300 temp tables.

The large number of tables can result in performance problems. A table has about 1 MB of JVM heap space overhead. For Cassandra pods particularly with smaller heap sizes, this could contribute to excessive GC which would result in high CPU usage and requests timing out. 


Version-Release number of selected component (if applicable):


How reproducible:
We would end up with orphaned (in the sense that they won't ever get deleted) temp tables any time the compression job fails to complete successfully.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Junqi Zhao 2018-05-30 09:17:35 UTC

There are not two many data_temp tables after metrics running for one day,
metrics version is v3.9.30-1

# oc -n openshift-infra exec hawkular-cassandra-1-nx9qk -- cqlsh --ssl -e "select table_name from system_schema.tables where keyspace_name = 'hawkular_metrics'"

 table_name
----------------------
   active_time_slices
             cassalog
                 data
               data_0
      data_compressed
 data_temp_2018053006
 data_temp_2018053008
 data_temp_2018053010
 data_temp_2018053012
 data_temp_2018053014
 data_temp_2018053016
 data_temp_2018053018
 data_temp_2018053020
 data_temp_2018053022
 data_temp_2018053100
 data_temp_2018053102
 data_temp_2018053104
    finished_jobs_idx
                 jobs
               leases
                locks
          metrics_idx
     metrics_tags_idx
       retentions_idx
   scheduled_jobs_idx
           sys_config
                tasks
              tenants

(28 rows)

Comment 4 errata-xmlrpc 2018-06-06 15:46:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1796

Note You need to log in before you can comment on or make changes to this bug.