Created attachment 1244797 [details] cassandra log Description of problem: Cassandra gets OOM killed due to high tomestone cells, as a result heapster looses connection to cassandra and gets in a restart loop. Version-Release number of selected component (if applicable): 3.3.1.0 How reproducible: Always in our staging environment. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1244799 [details] heapster log
@wesley: can you please attach the Cassandra and Hawkular Metrics logs?
Can you also please provide the following: * Data retention used * Date ranges for queries I would like to know if it is for the past hour, past day, past week, etc. * Output of `nodetool tablestats hawkular_metrics` when the OOME happens
Created attachment 1244804 [details] hawkular log * Data retention used - It is the default whihc I believe it is 7 days * Date ranges for queries - On the web console it is a max of 1 week.
Created attachment 1244805 [details] cassandra2 log
I should have a patch to test later today.
Can you provide the output of the following: for f in /cassandra_data/data/hawkular_metrics/data-*/*Data.db; do meta=$(/opt/apache-cassandra/tools/binsstablemetadata $f); echo -e "Max:" $(date --date=@$(echo "$meta" | grep Maximum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') "Min:" $(date --date=@$(echo "$meta" | grep Minimum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') $(echo "$meta" | grep droppable) ' \t ' $(ls -lh $f | awk '{print $5" "$6" "$7" "$8" "$9}'); done | sort This will help in configuring the new compaction strategy.
We are running Cassandra 2.2.7 in OpenShift 3.3.1. Time Window Compaction Strategy (TWCS) was first included in later versions of Cassandra. We can use it in 2.2.7. The TWCS jar file needs to be placed in the Cassandra lib directory. Once that is done we can change the compaction strategy with CQL commands. Matt, can you assist with creating a new image that includes the TWCS jar? And to be clear, I am not asking to back port any changes. Right now I want to make the change only to Wesley's environment in hopes that it make things more stable.
Max: 01/26/2017 Min: 01/19/2017 Estimated droppable tombstones: 0.8471775259322556 494M Jan 26 13:00 /cassandra_data/data/hawkular_metrics/data-5d696540dd9611e6a550c1486de5a810/lb-295-big-Data.db Max: 01/30/2017 Min: 01/26/2017 Estimated droppable tombstones: 0.0 801M Jan 30 00:08 /cassandra_data/data/hawkular_metrics/data-5d696540dd9611e6a550c1486de5a810/lb-708-big-Data.db Max: 01/30/2017 Min: 01/30/2017 Estimated droppable tombstones: 0.0 198M Jan 30 23:48 /cassandra_data/data/hawkular_metrics/data-5d696540dd9611e6a550c1486de5a810/lb-809-big-Data.db Max: 01/31/2017 Min: 01/30/2017 Estimated droppable tombstones: 0.0 57M Jan 31 09:26 /cassandra_data/data/hawkular_metrics/data-5d696540dd9611e6a550c1486de5a810/lb-838-big-Data.db
Not sure why it cleared his needinfo flag.
I need to verify the data retention or TTL being used because if it is less than the default of seven days, then console queries going back a week will be scanning tombstone. Wesley for each of the Data.db files, can you run `/opt/apache-cassandra/tools/binsstablemetadata` on them and and also provide the file creation time? From that I will be able to determine the data retention.
Created attachment 1248093 [details] ls -l and sstablemetadata dump
*** Bug 1411427 has been marked as a duplicate of this bug. ***
This should be fixed with images openshift3/metrics-cassandra:3.3.1-3 openshift3/metrics-hawkular-metrics:3.3.1-4 or newer. These images should be in all regular testing areas. Attaching this to errata.
I'm very sorry. The images were built but were not pushed to the testing areas (registry.ops). They have been pushed, and I have verified that they are there now. # docker pull registry.ops.openshift.com/openshift3/metrics-hawkular-metrics:3.3.1-4 Trying to pull repository registry.ops.openshift.com/openshift3/metrics-hawkular-metrics ... 3.3.1-4: Pulling from registry.ops.openshift.com/openshift3/metrics-hawkular-metrics 239425a20f14: Already exists 019908b75ec4: Already exists 0deb2bff8875: Already exists b9187e9d6fd8: Already exists 7cac29ec0f61: Already exists 6cc45390c873: Already exists Digest: sha256:2f5c0f826f39cd3607ef726e901dd129df1d625743a7c52f89da40bf129be3b6 Status: Downloaded newer image for registry.ops.openshift.com/openshift3/metrics-hawkular-metrics:3.3.1-4 # docker pull registry.ops.openshift.com/openshift3/metrics-cassandra:3.3.1-3 Trying to pull repository registry.ops.openshift.com/openshift3/metrics-cassandra ... 3.3.1-3: Pulling from registry.ops.openshift.com/openshift3/metrics-cassandra 7bd78273b666: Pull complete c196631bd9ac: Pull complete c18565cb9832: Pull complete 759980c6d702: Pull complete 3a8066aceffb: Pull complete Digest: sha256:2f6b6f05d8421949d64ebfd00cf0afc970985531ec50c696f1ed2103d9c89f1f Status: Downloaded newer image for registry.ops.openshift.com/openshift3/metrics-cassandra:3.3.1-3
Verified the latest 3.3.1 Metrics use TWCS now. 21:29:54,776 INFO [org.hawkular.metrics.schema.SchemaService] (metricsservice-lifecycle-thread) The compaction strategy for the data table has been updated to com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy registry.ops.openshift.com/openshift3/metrics-deployer 3.3.1 d79a58d52ca8 3 days ago 759.2 MB registry.ops.openshift.com/openshift3/metrics-cassandra 3.3.1 6d3670affa15 9 days ago 533.1 MB registry.ops.openshift.com/openshift3/metrics-hawkular-metrics 3.3.1 306b85a45f53 9 days ago 1.772 GB registry.ops.openshift.com/openshift3/metrics-heapster 3.3.1 8234c1028f0f 11 days ago 277.8 MB
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0512
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days