Bug 1567222

Summary: Metrics Casandra PV running out of space due to snapshots
Product: OpenShift Container Platform Reporter: Paul Dwyer <pdwyer>
Component: HawkularAssignee: John Sanda <jsanda>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, jsanda, lstanton, pdwyer, rvargasp
Target Milestone: ---   
Target Release: 3.7.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1567250 (view as bug list) Environment:
Last Closed: 2018-06-07 08:40:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1567251    
Bug Blocks: 1570140    

Description Paul Dwyer 2018-04-13 15:17:06 UTC
Description of problem:
The Casandra container used in Hawkular Metrics is configured by default with auto_snapshot=true, creating snapshots every time a table is dropped, these snapshots are filling up the PV's causing outages for metrics

Seeing lots of snapshots going back several months

Version-Release number of selected component (if applicable):
OpenShift 3.7

How reproducible:
Continuously

Snapshots can be cleared with:
# run against each cassandra pod
$ oc -n openshift-infra exec <cassandra pod> nodetool clearsnapshot hawkular_metrics

We need a better way of managing the snapshots

Comment 4 John Sanda 2018-04-13 16:17:05 UTC
This issue also exists in 3.9 and will have to be fixed for 3.10 as well. I will create separate BZs. When auto_snapshot is true, Cassandra will create a snapshot before truncating or dropping a table. Prior to 3.7 hawkular-metrics wrote raw data to a single single table. The data would subsequently get compressed and written to the data_compressed table. Starting in 3.7 raw data is written to a different table every two hours; so, there is a separate raw data table per two hour block. After that two hour block is compressed and written to the data_compressed table, the raw table is dropped. This will lead to a lot of snapshots over time that eat up disk space as reported.

Part of the solution is to turn off auto_snapshots. We do not need to snapshot the raw tables because we only drop them after the compressed version of the data is successfully stored.

I do think it would be good to have the ability to create a snapshot whenever schema changes are being made. This happens at installation time. The Cassandra image has a post startup script. I would like to add a flag to that script that determines whether or not to generate a snapshot of the keyspace (i.e., all tables). The flag will be turned off by default because we don't want to generate snapshots every time the container starts up. The installer though can override that flag so that we do generate snapshots at installation/upgrade time. This can be controlled with a property in the inventory file.

Paul, does this sound like a reasonable solution?

Comment 5 Paul Dwyer 2018-04-13 16:25:22 UTC
Thanks John,
Yes that sounds reasonable to me

Comment 17 Junqi Zhao 2018-05-28 06:30:28 UTC
Default value for openshift_metrics_cassandra_take_snapshot is false, environment variable TAKE_SNAPSHOT is also false, when set openshift_metrics_cassandra_take_snapshot to true, environment variable TAKE_SNAPSHOT will be true, and can take snapshot.

openshift-ansible vesion
openshift-ansible-3.7.51-1.git.0.f9b681c.el7.noarch

Images:
openshift3-metrics-cassandra-v3.7.51-1
metrics-hawkular-metrics-v3.7.51-1
openshift3-metrics-heapster-v3.7.51-1

Comment 18 Junqi Zhao 2018-05-31 01:40:48 UTC
*** Bug 1570140 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2018-06-07 08:40:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1798