1567222 – Metrics Casandra PV running out of space due to snapshots

Bug 1567222 - Metrics Casandra PV running out of space due to snapshots

Summary: Metrics Casandra PV running out of space due to snapshots

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.7.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.7.z
Assignee:	John Sanda
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1570140 (view as bug list)
Depends On:	1567251
Blocks:	1570140
TreeView+	depends on / blocked

Reported:	2018-04-13 15:17 UTC by Paul Dwyer
Modified:	2021-09-09 13:42 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1567250 (view as bug list)
Environment:
Last Closed:	2018-06-07 08:40:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1798	0	None	None	None	2018-06-07 08:41:31 UTC

Description Paul Dwyer 2018-04-13 15:17:06 UTC

Description of problem:
The Casandra container used in Hawkular Metrics is configured by default with auto_snapshot=true, creating snapshots every time a table is dropped, these snapshots are filling up the PV's causing outages for metrics

Seeing lots of snapshots going back several months

Version-Release number of selected component (if applicable):
OpenShift 3.7

How reproducible:
Continuously

Snapshots can be cleared with:
# run against each cassandra pod
$ oc -n openshift-infra exec <cassandra pod> nodetool clearsnapshot hawkular_metrics

We need a better way of managing the snapshots

Comment 4 John Sanda 2018-04-13 16:17:05 UTC

This issue also exists in 3.9 and will have to be fixed for 3.10 as well. I will create separate BZs. When auto_snapshot is true, Cassandra will create a snapshot before truncating or dropping a table. Prior to 3.7 hawkular-metrics wrote raw data to a single single table. The data would subsequently get compressed and written to the data_compressed table. Starting in 3.7 raw data is written to a different table every two hours; so, there is a separate raw data table per two hour block. After that two hour block is compressed and written to the data_compressed table, the raw table is dropped. This will lead to a lot of snapshots over time that eat up disk space as reported.

Part of the solution is to turn off auto_snapshots. We do not need to snapshot the raw tables because we only drop them after the compressed version of the data is successfully stored.

I do think it would be good to have the ability to create a snapshot whenever schema changes are being made. This happens at installation time. The Cassandra image has a post startup script. I would like to add a flag to that script that determines whether or not to generate a snapshot of the keyspace (i.e., all tables). The flag will be turned off by default because we don't want to generate snapshots every time the container starts up. The installer though can override that flag so that we do generate snapshots at installation/upgrade time. This can be controlled with a property in the inventory file.

Paul, does this sound like a reasonable solution?

Comment 5 Paul Dwyer 2018-04-13 16:25:22 UTC

Thanks John,
Yes that sounds reasonable to me

Comment 17 Junqi Zhao 2018-05-28 06:30:28 UTC

Default value for openshift_metrics_cassandra_take_snapshot is false, environment variable TAKE_SNAPSHOT is also false, when set openshift_metrics_cassandra_take_snapshot to true, environment variable TAKE_SNAPSHOT will be true, and can take snapshot.

openshift-ansible vesion
openshift-ansible-3.7.51-1.git.0.f9b681c.el7.noarch

Images:
openshift3-metrics-cassandra-v3.7.51-1
metrics-hawkular-metrics-v3.7.51-1
openshift3-metrics-heapster-v3.7.51-1

Comment 18 Junqi Zhao 2018-05-31 01:40:48 UTC

*** Bug 1570140 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2018-06-07 08:40:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1798

Note You need to log in before you can comment on or make changes to this bug.