1570140 – Hawkular is not clearing out old data even though hawkular.metrics.default-ttl is specified

Bug 1570140 - Hawkular is not clearing out old data even though hawkular.metrics.default-ttl is specified

Summary: Hawkular is not clearing out old data even though hawkular.metrics.default-tt...

Keywords:
Status:	CLOSED DUPLICATE of bug 1567222
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.7.z
Assignee:	John Sanda
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1567222
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-20 17:28 UTC by Luke Stanton
Modified:	2021-09-09 13:48 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-31 01:40:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Older output from du /cassandra_data/data/hawkular_metrics (30.72 KB, application/x-gzip) 2018-04-20 22:35 UTC, Luke Stanton	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1798	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.7 bug fix update	2018-06-26 22:41:33 UTC

Description Luke Stanton 2018-04-20 17:28:02 UTC

Description of problem:
Hawkular is not honoring the hawkular.metrics.default-ttl. User has set value and restarted metrics components to pick up the change but older data isn't being cleared out and the volume is running out of space.

One of the Hawkular logs has the following recurring error...

[31m2018-04-10 05:01:11,450 ERROR [org.hawkular.metrics.core.service.MetricsServiceImpl] (RxComputationScheduler-4) Failure while trying to apply compression, skipping block: 
  java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/10.131.7.65:9042 
  (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/10.131.7.65:9042] Timed out waiting for server response))...

How reproducible:
Consistently reproducible by customer

Steps to Reproduce:
1. Set -Dhawkular.metrics.default-ttl=3 in hawkular-metrics replication controller
2. Restart hawkular-metrics pod

Actual results:
Metrics data in Cassandra does not appear to get cleaned up per the ttl setting.

Expected results:
Older data should be cleared out automatically per the ttl setting.

Comment 2 John Sanda 2018-04-20 18:02:55 UTC

The customer might be hitting bug 1567222. Can I get the output of `du -h /cassandra_data/data/hawkular_metrics`.

Comment 3 John Sanda 2018-04-20 18:20:21 UTC

Cassandra is getting bogged down with garbage collection which is very likely the cause for most of the exceptions you are seeing in the logs. I recommend doubling the memory to 4 GB for the Cassandra pod.

Comment 4 Luke Stanton 2018-04-20 22:35:13 UTC

Created attachment 1424759 [details]
Older output from du /cassandra_data/data/hawkular_metrics

I don't know if this is still useful but customer had attached this data in an earlier comment.

Comment 5 John Sanda 2018-04-20 23:28:32 UTC

(In reply to Luke Stanton from comment #4)
> Created attachment 1424759 [details]
> Older output from du /cassandra_data/data/hawkular_metrics
> 
> I don't know if this is still useful but customer had attached this data in
> an earlier comment.

Definitely useful. It does look like the customer is hitting bug 1567222. As a temporary work around until the fix is pushed out run:

$ oc -n openshift-infra <cassandra pod> nodetool clearsnapshot

Comment 10 Junqi Zhao 2018-05-31 01:40:48 UTC


*** This bug has been marked as a duplicate of bug 1567222 ***

Note You need to log in before you can comment on or make changes to this bug.