Bug 1570981
| Summary: | Large number of data_temp tables cause request timeouts and other performance problems | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | John Sanda <jsanda> | ||||||
| Component: | Hawkular | Assignee: | John Sanda <jsanda> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 3.9.0 | CC: | aos-bugs, jsanda, juzhao | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 3.10.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1570982 1571045 (view as bug list) | Environment: | |||||||
| Last Closed: | 2018-07-30 19:13:42 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1570982, 1571045 | ||||||||
| Attachments: |
|
||||||||
|
Description
John Sanda
2018-04-23 22:03:54 UTC
@John, What is the checkpoint of this defect, is the checkpoint is "At any given time there should be 13 or 14 temp tables"? But I see there are 15 temp file in my env $ ls -alR /cassandra_data/ | grep -i temp drwxr-xr-x. 3 cassandra cassandra 21 May 22 03:00 data_temp_2018052200-46457d905d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 05:00 data_temp_2018052202-4617dde05d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 258 May 22 05:00 data_temp_2018052204-4672f6305d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052206-8fa29e505d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052208-8e627d305d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052210-8e901ce05d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052212-8f194a105d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052214-8eeb83505d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052216-8fd065105d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052218-8f74d7905d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052220-8ebdbc905d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052222-8f4737e05d5611e881192b75a818259a drwxr-xr-x. 4 cassandra cassandra 38 May 22 03:29 data_temp_2018052300-51e88e105d6711e881192b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 22 04:24 data_temp_2018052302-f6840a205d7711e88adb0b295fcd6148 drwxr-xr-x. 3 cassandra cassandra 21 May 22 06:24 data_temp_2018052304-ba086d505d8811e88adb0b295fcd6148 There is a job, TempTableCreator.java, which runs every two hours and creates tables for the next 24 hours relative to the job start time. While it is possible to wind up with more, in general I think we should normally see 12 of these tables. Can you please provide logs for hawkular-metrics and the output of the following: $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx" $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * from hawkular_metrics.finished_jobs_idx" (In reply to John Sanda from comment #3) > There is a job, TempTableCreator.java, which runs every two hours and > creates tables for the next 24 hours relative to the job start time. While > it is possible to wind up with more, in general I think we should normally > see 12 of these tables. It is 12 temp tables now in my new env, not sure why I get 15 tables yesterday # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-jnv86 1/1 Running 0 1h hawkular-metrics-d5q7x 1/1 Running 0 1h hawkular-metrics-schema-dlzch 0/1 Completed 0 1h heapster-gkmsj 1/1 Running 0 1h # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR /cassandra_data/ | grep -i temp drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 data_temp_2018052300-357621f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 data_temp_2018052302-35485b305e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a > Can you please provide logs for hawkular-metrics and the output of the > following: > > $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * > from hawkular_metrics.scheduled_jobs_idx" # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- cqlsh --ssl -e "select * from hawkular_metrics.scheduled_jobs_idx" time_slice | job_id | job_name | job_params | job_type | status | trigger --------------------------+--------------------------------------+----------------------+------------+----------------------+--------+-------------------------------------------------------------------------------------------------------------------- 2018-05-23 03:00:00+0000 | 4814c669-40fc-413d-b42b-17a8beedc6df | TEMP_DATA_COMPRESSOR | {} | TEMP_DATA_COMPRESSOR | null | {type: 1, trigger_time: 1527044400000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null} 2018-05-23 03:20:00+0000 | 26a09d0e-f6de-4984-8087-66cc477d3e17 | TEMP_TABLE_CREATOR | {} | TEMP_TABLE_CREATOR | null | {type: 1, trigger_time: 1527045600000, delay: 60000, interval: 7200000, repeat_count: null, execution_count: null} (2 rows) > $ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select * > from hawkular_metrics.finished_jobs_idx" # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- cqlsh --ssl -e "select * from hawkular_metrics.finished_jobs_idx" time_slice | job_id ------------+-------- (0 rows) I think the checkpoint is the number of temp files is maintained at 12~14, there should not be two many temp files, since after running for 2 hours, the number of temp files is 13, not 12. # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR /cassandra_data/ | grep -i temp drwxr-xr-x. 3 cassandra cassandra 21 May 23 03:00 data_temp_2018052300-357621f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 241 May 23 03:00 data_temp_2018052302-35485b305e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a drwxr-xr-x. 3 cassandra cassandra 21 May 23 03:20 data_temp_2018052400-413238b05e3811e8a6a22b75a818259a Created attachment 1440413 [details]
metrics pods log
I failed to mention that the dates/time in the table name are reported in UTC. The temp tables listed in comment 5 look correct. I should also note that the oldest tables get dropped first and not until the raw data in those tables is compressed. If the data in a table does not get compressed for whatever reason, then by design it should not get dropped. The logs files also revealed a new 3.10 regression: ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] (RxComputationScheduler-3) HAWKMETRICS200010: Failed to process request: java.lang.NullPointerException at org.hawkular.metrics.core.service.tags.SimpleTagQueryParser.lambda$tagValueFilter$40(SimpleTagQueryParser.java:557) ... I will create a new ticket for the NPE. (In reply to John Sanda from comment #7) > The logs files also revealed a new 3.10 regression: > > ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] > (RxComputationScheduler-3) HAWKMETRICS200010: Failed to process request: > java.lang.NullPointerException > at > org.hawkular.metrics.core.service.tags.SimpleTagQueryParser. > lambda$tagValueFilter$40(SimpleTagQueryParser.java:557) > ... > > I will create a new ticket for the NPE. It should be the same defect with https://bugzilla.redhat.com/show_bug.cgi?id=1581060, Endpoint POST /hawkular/metrics/m/stats/query fails with HTTP code: 500 (In reply to Junqi Zhao from comment #5) > I think the checkpoint is the number of temp files is maintained at 12~14, > there should not be two many temp files, since after running for 2 hours, > the number of temp files is 13, not 12. > > # oc -n openshift-infra exec hawkular-cassandra-1-jnv86 -- ls -alR > /cassandra_data/ | grep -i temp > drwxr-xr-x. 3 cassandra cassandra 21 May 23 03:00 > data_temp_2018052300-357621f05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 241 May 23 03:00 > data_temp_2018052302-35485b305e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:18 > data_temp_2018052304-35a3c1a05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052306-7ebef7605e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052308-7dac75f05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052310-7e954f505e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052312-7e35a3205e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052314-7f1a84e05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052316-7eecbe205e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052318-7dda15a05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052320-7e0803705e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 01:20 > data_temp_2018052322-7e6390f05e2711e8a6a22b75a818259a > drwxr-xr-x. 3 cassandra cassandra 21 May 23 03:20 > data_temp_2018052400-413238b05e3811e8a6a22b75a818259a Yes, that is correct. Everything looks good here. The number temp file is growing after running metrics for one day, not between 12-14, it is 23 now and the size of /cassandra_data/data/hawkular_metrics is 9.4M, I think we should check the data file size, not data files number. Attached metrics logs, there is one exception in hawkular-metrics pod log, WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) # oc -n openshift-infra exec hawkular-cassandra-1-h4ckm -- ls -alR /cassandra_data/ | grep -i temp drwxr-xr-x. 3 cassandra cassandra 21 May 28 05:00 data_temp_2018052802-43137220622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 07:00 data_temp_2018052804-42e5d270622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 09:00 data_temp_2018052806-434138e0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 11:00 data_temp_2018052808-8b8bd830622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 13:00 data_temp_2018052810-8c70b9f0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 15:00 data_temp_2018052812-8c152c70622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 17:00 data_temp_2018052814-8b5e1170622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 19:00 data_temp_2018052816-8b3098d0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 21:00 data_temp_2018052818-8bb99ef0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 23:00 data_temp_2018052820-8c9ea7c0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 29 01:00 data_temp_2018052822-8c42f330622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 241 May 29 01:00 data_temp_2018052900-8be765b0622511e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 05:16 data_temp_2018052902-4ebaef70623611e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 07:16 data_temp_2018052904-124d3550624711e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 09:16 data_temp_2018052906-d5e01770625711e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 11:16 data_temp_2018052908-99743210626811e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 13:16 data_temp_2018052910-5d0650e0627911e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 15:16 data_temp_2018052912-20990bf0628a11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 17:16 data_temp_2018052914-e42d4da0629a11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 19:16 data_temp_2018052916-a7c5fc2062ab11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 21:16 data_temp_2018052918-6b5a16c062bc11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 28 23:16 data_temp_2018052920-2ee6422062cd11e8a2c26330552d2565 drwxr-xr-x. 3 cassandra cassandra 21 May 29 01:16 data_temp_2018052922-f2794b5062dd11e8a2c26330552d2565 # oc -n openshift-infra exec hawkular-cassandra-1-h4ckm -- du -sh /cassandra_data/data/hawkular_metrics 9.4M /cassandra_data/data/hawkular_metrics Created attachment 1445005 [details]
metrics pods log - 20180529
metrics version is v3.10.0-0.53.0.0 Since I provided correct verification steps in https://bugzilla.redhat.com/show_bug.cgi?id=1571045#c5, can you please retest this? There are not two many data_temp tables after metrics running for one day,
metrics version is v3.10.0-0.64.0.0
# oc -n openshift-infra exec hawkular-cassandra-1-n8h6q -- cqlsh --ssl -e "select table_name from system_schema.tables where keyspace_name = 'hawkular_metrics'"
table_name
----------------------
active_time_slices
cassalog
data
data_0
data_compressed
data_temp_2018060812
data_temp_2018060814
data_temp_2018060816
data_temp_2018060818
data_temp_2018060820
data_temp_2018060822
data_temp_2018060900
data_temp_2018060902
data_temp_2018060904
data_temp_2018060906
data_temp_2018060908
data_temp_2018060910
finished_jobs_idx
jobs
leases
locks
metrics_idx
metrics_tags_idx
retentions_idx
scheduled_jobs_idx
sys_config
tasks
tenants
(28 rows)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |