Description of problem:
A little background is necessary. It is a common design with Cassandra to implement indexes as ordinary tables whose primary purpose is to help with and/or optimize a query or queries. Cassandra does have secondary indexes, but they are generally avoided for reasons outside the scope of this ticket. The "tags index" refers specifically to the metrics_tags_idx table.
The metrics_tags_idx table is used at least in a couple places in the console overview page for rendering graphs of deployments as one example.
Heapster collects and reports on a set of metrics for every pod in an OpenShift cluster. Each of those is stored in a separate time series. Each metric also has a set of tags associated with it. When a new pod is deployed in an OpenShift cluster, Heapster sends HTTP requests to Hawkular Metrics to store the tags for the new metrics associated with the pod. Those tags get stored in the metrics_tags_idx table. The metric data points that Heapster collects are stored elsewhere.
A project in OpenShift can have any number of pods. While it might theoretically be possible to have an unbounded number of pods in a project, there are certainly physical constraints that will limit the number of pods. If you consider all pods, including those that have been terminated, then you can effectively have an unbounded number of pods.
All of the data points that get stored have an expiration attached to them via a TTL. Metric data points are never stored indefinitely. When pods are deleted, there is no mechanism in place for removing corresponding tags from the metrics_tags_idx table.
There is a background job that runs in Hawkular Metrics that is supposed to help with cleaning up index tables; however, that job was not working as intended and can actually cause OutOfMemoryErrors. See bug 1559440 for details.
In one production cluster I recently saw a warning in the hawkular-metrics log that reported this:
/hawkular/metrics/m/stats/query took: 424148 ms, exceeds 10000 ms threshold, tenant-id: <tenant id>
That is over 7 minutes. I did some more investigation to figure out what was going on. I directly ran a Cassandra query that gets executed by the /stats/query endpoint.
$ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "select count(*) from hawkular_metrics.metrics_tags_idx where tenant_id = '<tenant id>' and tname = 'type' and tvalue = 'pod'"
There were over 1 million rows for just that one tag, and at the time there were only 44 pods in this particular project.
The Cassandra driver has paging built in, and we use a page size of 1,000. This means that more than 1,000 round trips to Cassandra are required to get all of the rows. This alone explains long HTTP response times that often result in exceptions in the hawkular-metrics log like this:
ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-2) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception
There are some other problems as well. Multiple tags queries, including the one above, get executed for the /stats/query endpoint. The result sets for those queries get fully realized in memory. When dealing with really large result sets, this will cause a lot of heap pressure in the Hawkular Metrics JVM which could result in lots of garbage collection or even possibly an OutOfMemoryError. Excessive GC can seriously degrade performance.
We need to put a proper solution in place for removing rows from the metrics_tags_idx table. This will most likely involve using Kubernetes watch APIs to get notifications of when pods and project are deleted so that we can perform necessary clean up.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
There is a bit of a brute force work around for this. First, scale down heapster and hawkular-metrics. Next, you need to truncate a couple tables in Cassandra.
# These commands only need to be run from one cassandra pod if there
# are multiple # cassandra pods.
$ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "truncate table hawkular_metrics.metrics_tags_idx"
$ oc -n openshift-infra exec <cassandra pod> -- cqlsh --ssl -e "truncate table hawkular_metrics.metrics_idx"
Scale hawkular-metrics back up. Scale heapster back up. On restart heapster will resend tags to repopulate the metrics_tags_idx and metrics_idx tables. When truncating a table, Cassandra creates a snapshot by default. If anything goes wrong, you can copy the files from the snapshot directory back into the parent directory to effectively revert the changes.
As This issue has not been fixed, We are seeing it appear more and more (attaching a new customer case).
If the truncate workaround works to resolve the problem, I would imagine that it would only be a temporary fix. How often would you guess we would need to rerun truncate process to keep the cluster happy?
It is hard to say how often it should be run, maybe weekly. Ruben and I had discussed the possibility of providing a script to automate this. I am going to reassign to him.
Eric, please discuss with Ruben about whether or not providing some automation for this makes sense. Thanks.
Hey John and Ruben,
I think that depends on how the automation would be implemented.
Are you thinking some tooling in the pod, or a script that customers could run to modify things as appropriate?
*** Bug 1614084 has been marked as a duplicate of this bug. ***
Any updates on it.
The customer(02395295) is checking to see if this is fixed there?
Associate Manager - Openshift
Created attachment 1640570 [details]
metrics pod logs as requested
metrics pod logs as requested