Bug 1570140 - Hawkular is not clearing out old data even though hawkular.metrics.default-ttl is specified
Summary: Hawkular is not clearing out old data even though hawkular.metrics.default-tt...
Keywords:
Status: CLOSED DUPLICATE of bug 1567222
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.7.z
Assignee: John Sanda
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1567222
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-20 17:28 UTC by Luke Stanton
Modified: 2021-09-09 13:48 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-31 01:40:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Older output from du /cassandra_data/data/hawkular_metrics (30.72 KB, application/x-gzip)
2018-04-20 22:35 UTC, Luke Stanton
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1798 0 normal SHIPPED_LIVE OpenShift Container Platform 3.7 bug fix update 2018-06-26 22:41:33 UTC

Description Luke Stanton 2018-04-20 17:28:02 UTC
Description of problem:
Hawkular is not honoring the hawkular.metrics.default-ttl. User has set value and restarted metrics components to pick up the change but older data isn't being cleared out and the volume is running out of space.

One of the Hawkular logs has the following recurring error...

[31m2018-04-10 05:01:11,450 ERROR [org.hawkular.metrics.core.service.MetricsServiceImpl] (RxComputationScheduler-4) Failure while trying to apply compression, skipping block: 
  java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/10.131.7.65:9042 
  (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/10.131.7.65:9042] Timed out waiting for server response))...

How reproducible:
Consistently reproducible by customer

Steps to Reproduce:
1. Set -Dhawkular.metrics.default-ttl=3 in hawkular-metrics replication controller
2. Restart hawkular-metrics pod

Actual results:
Metrics data in Cassandra does not appear to get cleaned up per the ttl setting.

Expected results:
Older data should be cleared out automatically per the ttl setting.

Comment 2 John Sanda 2018-04-20 18:02:55 UTC
The customer might be hitting bug 1567222. Can I get the output of `du -h /cassandra_data/data/hawkular_metrics`.

Comment 3 John Sanda 2018-04-20 18:20:21 UTC
Cassandra is getting bogged down with garbage collection which is very likely the cause for most of the exceptions you are seeing in the logs. I recommend doubling the memory to 4 GB for the Cassandra pod.

Comment 4 Luke Stanton 2018-04-20 22:35:13 UTC
Created attachment 1424759 [details]
Older output from du /cassandra_data/data/hawkular_metrics

I don't know if this is still useful but customer had attached this data in an earlier comment.

Comment 5 John Sanda 2018-04-20 23:28:32 UTC
(In reply to Luke Stanton from comment #4)
> Created attachment 1424759 [details]
> Older output from du /cassandra_data/data/hawkular_metrics
> 
> I don't know if this is still useful but customer had attached this data in
> an earlier comment.

Definitely useful. It does look like the customer is hitting bug 1567222. As a temporary work around until the fix is pushed out run:

$ oc -n openshift-infra <cassandra pod> nodetool clearsnapshot

Comment 10 Junqi Zhao 2018-05-31 01:40:48 UTC

*** This bug has been marked as a duplicate of bug 1567222 ***


Note You need to log in before you can comment on or make changes to this bug.