Description of problem:
The Casandra container used in Hawkular Metrics is configured by default with auto_snapshot=true, creating snapshots every time a table is dropped, these snapshots are filling up the PV's causing outages for metrics
Seeing lots of snapshots going back several months
Version-Release number of selected component (if applicable):
Snapshots can be cleared with:
# run against each cassandra pod
$ oc -n openshift-infra exec <cassandra pod> nodetool clearsnapshot hawkular_metrics
We need a better way of managing the snapshots
This issue also exists in 3.9 and will have to be fixed for 3.10 as well. I will create separate BZs. When auto_snapshot is true, Cassandra will create a snapshot before truncating or dropping a table. Prior to 3.7 hawkular-metrics wrote raw data to a single single table. The data would subsequently get compressed and written to the data_compressed table. Starting in 3.7 raw data is written to a different table every two hours; so, there is a separate raw data table per two hour block. After that two hour block is compressed and written to the data_compressed table, the raw table is dropped. This will lead to a lot of snapshots over time that eat up disk space as reported.
Part of the solution is to turn off auto_snapshots. We do not need to snapshot the raw tables because we only drop them after the compressed version of the data is successfully stored.
I do think it would be good to have the ability to create a snapshot whenever schema changes are being made. This happens at installation time. The Cassandra image has a post startup script. I would like to add a flag to that script that determines whether or not to generate a snapshot of the keyspace (i.e., all tables). The flag will be turned off by default because we don't want to generate snapshots every time the container starts up. The installer though can override that flag so that we do generate snapshots at installation/upgrade time. This can be controlled with a property in the inventory file.
Paul, does this sound like a reasonable solution?
Yes that sounds reasonable to me
Default value for openshift_metrics_cassandra_take_snapshot is false, environment variable TAKE_SNAPSHOT is also false, when set openshift_metrics_cassandra_take_snapshot to true, environment variable TAKE_SNAPSHOT will be true, and can take snapshot.
*** Bug 1570140 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.