Bug 1570140

Summary: Hawkular is not clearing out old data even though hawkular.metrics.default-ttl is specified
Product: OpenShift Container Platform Reporter: Luke Stanton <lstanton>
Component: HawkularAssignee: John Sanda <jsanda>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, juzhao, lstanton
Target Milestone: ---   
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-31 01:40:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1567222    
Bug Blocks:    
Attachments:
Description Flags
Older output from du /cassandra_data/data/hawkular_metrics none

Description Luke Stanton 2018-04-20 17:28:02 UTC
Description of problem:
Hawkular is not honoring the hawkular.metrics.default-ttl. User has set value and restarted metrics components to pick up the change but older data isn't being cleared out and the volume is running out of space.

One of the Hawkular logs has the following recurring error...

[31m2018-04-10 05:01:11,450 ERROR [org.hawkular.metrics.core.service.MetricsServiceImpl] (RxComputationScheduler-4) Failure while trying to apply compression, skipping block: 
  java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/10.131.7.65:9042 
  (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/10.131.7.65:9042] Timed out waiting for server response))...

How reproducible:
Consistently reproducible by customer

Steps to Reproduce:
1. Set -Dhawkular.metrics.default-ttl=3 in hawkular-metrics replication controller
2. Restart hawkular-metrics pod

Actual results:
Metrics data in Cassandra does not appear to get cleaned up per the ttl setting.

Expected results:
Older data should be cleared out automatically per the ttl setting.

Comment 2 John Sanda 2018-04-20 18:02:55 UTC
The customer might be hitting bug 1567222. Can I get the output of `du -h /cassandra_data/data/hawkular_metrics`.

Comment 3 John Sanda 2018-04-20 18:20:21 UTC
Cassandra is getting bogged down with garbage collection which is very likely the cause for most of the exceptions you are seeing in the logs. I recommend doubling the memory to 4 GB for the Cassandra pod.

Comment 4 Luke Stanton 2018-04-20 22:35:13 UTC
Created attachment 1424759 [details]
Older output from du /cassandra_data/data/hawkular_metrics

I don't know if this is still useful but customer had attached this data in an earlier comment.

Comment 5 John Sanda 2018-04-20 23:28:32 UTC
(In reply to Luke Stanton from comment #4)
> Created attachment 1424759 [details]
> Older output from du /cassandra_data/data/hawkular_metrics
> 
> I don't know if this is still useful but customer had attached this data in
> an earlier comment.

Definitely useful. It does look like the customer is hitting bug 1567222. As a temporary work around until the fix is pushed out run:

$ oc -n openshift-infra <cassandra pod> nodetool clearsnapshot

Comment 10 Junqi Zhao 2018-05-31 01:40:48 UTC

*** This bug has been marked as a duplicate of bug 1567222 ***