Description of problem: - A lot of these exceptions: 2017-03-23 18:57:19,001 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-15) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception Environment - 2 bare metal RHEL 7 servers, 64G RAM each, 1 regular node, 1 infra/master node - 239 pods of a simple node js app on the regular node Version-Release number of selected component (if applicable): - OSE: 3.5 - Metrics: 3.5 - RHEL 7.2 How reproducible: 100% Steps to Reproduce: 1. Install OSE, Metrics using ansible playbook 2. Install Hawkular OpenShift Agent 3. Deploy some user pods, scale up to 239 pods Actual results: - OpenShift console shows Mem/CPU, Network graphs as expected, but Hawkular metrics server log has a lot of RESTEASY002020 exceptions (see attached log)
Created attachment 1265877 [details] metric server log
Clarification: the exceptions occured when I ran the Python client against Metrics to gather raw metrics in 30-minute period.
@viet: can you run and paste the output of: oc exec -it ${HAWKULAR_CASSANDRA_POD_NAME} nodetool tpstats @jsanda: can you take a look?
Created attachment 1265916 [details] cnode-tpstats
(In reply to Viet Nguyen from comment #2) > Clarification: the exceptions occured when I ran the Python client against > Metrics to gather raw metrics in 30-minute period. Viet can you also attach Cassandra logs including the debug.log file? Can you either point me to or attach the python code you are running?
Created attachment 1266228 [details] cnode logs
Thanks for the Cassandra logs Viet. There are a lot of GC pauses which are almost certainly causing a lot of the problems you are seeing. If you cluster is still up can you provide the output for the following command run in your Cassandra pod: nodetool tablehistograms hawkular_metrics data nodetool tablehistograms hawkular_metrics metrics_idx nodetool tablestats hawkular_metrics If your Cassandra pod has been restarted since you ran your python script, then the output won't be as useful because the metrics will reset on restart. If the pod has restarted, could you run your Python program again and then get the nodetool output? Thanks.
Python client https://github.com/vnugent/pyme/tree/master/pyme
Created attachment 1266231 [details] cnode info
Created attachment 1266232 [details] ansible inventory file for metrics install As you can see each pod was given 2G
Created attachment 1266272 [details] 15 minutes raw metrics via VPN
Created attachment 1266274 [details] 30 minutes raw metrics on Master node
So it would appear the slow VPN is the cause of these issues?
While the problem may be network connectivity, there are two other issues I think we need to address. First, the error reporting coming out of Hawkular Metrics is pretty poor. We have no way of knowing what endpoints were involved for example. In this particular case, Viet gave us all of the relevant information; however, in production support tickets, that often is not the case. The second issue is the GC activity in the Cassandra log. Cassandra was logging lots of warning for long GC pauses for what does not seem to be a heavy read load. We probably need to look at doing some tuning. We need upstream Jira tickets for these. I will create them and refer back to this ticket.
Viet, Can you run your python program again and provide GC logs for Cassandra? I would like to review GC logs so we can do some tuning to reduce GC pause times.
Created attachment 1270510 [details] OOM logs -Increased number of metrics per pod to 30 (30s collection interval, heapster disabled) -Python client (running on Master) triggered OOM in Cassandra log
Created attachment 1271174 [details] cassandra oom log (230 pods)
This seems to be a bug in the Cassandra 3.0.9 and was fixed in 3.0.11 https://issues.apache.org/jira/browse/CASSANDRA-13114
What else do we need to do with this? Are we going to need to update our releases with 3.0.9 to 3.0.11?
3.0.13 is the newest and it has the fix for that Netty SSL / OpenSSL bug. In theory, the following tickets could have some effect: https://issues.apache.org/jira/browse/CASSANDRA-13126 https://issues.apache.org/jira/browse/CASSANDRA-13221 One option is of course to try fix them (to reduce potential SSL errors). Although these should not happen in our environment usually. The bug itself happens because Netty uses off-heap buffers to manage SSL connections and those are not necessarily freed enough quickly since our version of Cassandra uses DisableExplicitGC that would free them if it looks like off-heap memory is running out. The problem with enabling that is that technically it would be possible to flood the Cassandra to cause "stop-the-world-GC" that could reduce the performance during those GC stops. For stability reasons though, we might think about removing that option from our Cassandra configurations - it will prevent the out of direct memory errors. Discussed in the first ticket also.
Given how far back the dates have been moved for OCP 3.6, it makes sense for us to move to Cassandra 3.0.13.
Do we need this for 3.6 only? Or would we need to backport the update? It looks like this issue is affecting 3.5
We should back port this to 3.5 as well. What about 3.4?
Viet, can you retest with the latest 3.5 images? You may have hit a memory leak bug that was addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1457501.
Created attachment 1295969 [details] Rerun against OSE 3.5
Re-ran against new OSE 3.5 baremetal cluster (3 x 64GB RAM nodes, master only contains default, openshift-infra project). 400 test pods.
There is a lot of old gen GC which may be the main culprit. The young generation of the heap is set to 2400 MB while the max heap size, i.e., total heap size is set to 953 MB. I am surprised that didn't cause an error that would prevent the JVM from starting. In any event, I suggest increasing the max heap size to at least 2 GB, and set the you gen (via HEAP_NEWSIZE envar) to half the max heap.
Closing this issue as Insufficient_Data, please reopen if we get the data requested in comment 29