Description of problem:
Hawkular is not honoring the hawkular.metrics.default-ttl. User has set value and restarted metrics components to pick up the change but older data isn't being cleared out and the volume is running out of space.
One of the Hawkular logs has the following recurring error...
[31m2018-04-10 05:01:11,450 ERROR [org.hawkular.metrics.core.service.MetricsServiceImpl] (RxComputationScheduler-4) Failure while trying to apply compression, skipping block:
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/10.131.7.65:9042
(com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/10.131.7.65:9042] Timed out waiting for server response))...
Consistently reproducible by customer
Steps to Reproduce:
1. Set -Dhawkular.metrics.default-ttl=3 in hawkular-metrics replication controller
2. Restart hawkular-metrics pod
Metrics data in Cassandra does not appear to get cleaned up per the ttl setting.
Older data should be cleared out automatically per the ttl setting.
The customer might be hitting bug 1567222. Can I get the output of `du -h /cassandra_data/data/hawkular_metrics`.
Cassandra is getting bogged down with garbage collection which is very likely the cause for most of the exceptions you are seeing in the logs. I recommend doubling the memory to 4 GB for the Cassandra pod.
Created attachment 1424759 [details]
Older output from du /cassandra_data/data/hawkular_metrics
I don't know if this is still useful but customer had attached this data in an earlier comment.
(In reply to Luke Stanton from comment #4)
> Created attachment 1424759 [details]
> Older output from du /cassandra_data/data/hawkular_metrics
> I don't know if this is still useful but customer had attached this data in
> an earlier comment.
Definitely useful. It does look like the customer is hitting bug 1567222. As a temporary work around until the fix is pushed out run:
$ oc -n openshift-infra <cassandra pod> nodetool clearsnapshot
*** This bug has been marked as a duplicate of bug 1567222 ***