Created attachment 1309070 [details] Logs after increasing heap size Description of problem: A dedicated cluster customer reported seeing "An error occurred getting metrics" message in their OpenShift console. I looked at the logs for the hawkular-metrics pod and saw numerous instances of the message: 11:40:06,667 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (default task-161) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception See attached logs for full traceback and related Cassandra/Heapster logs Version-Release number of selected component (if applicable): oc v3.4.1.18 kubernetes v1.4.0+776c994
It looks like the problem here is that we are overloading the system currently and it can't keep up with the requests. Has anyone tried to scale up the number of Cassandra instances to 2? The Hawkular Metrics instance can also be scaled to 2 as well. Note: for Cassandra, you cannot just scale up its RC, you need to either set the deployer parameter for the number of Cassandra nodes (eg CASSANDRA_NODES) or use the template to deploy another instance. To use the template (assuming you are using persistent volume and want a 100Gi persistent volume) you can run the following: $ oc process hawkular-cassandra-node-pv \ -v IMAGE_VERSION=3.4.1 \ -v PV_SIZE=100Gi \ -v NODE=2" If you are not using a persistent volume, then the name of the template is just 'hawkular-cassandra-node-emptydir' and you don't need to set a PV_SIZE option. Attaching the output of 'oc get pods -o yaml -n openshift-infra' is also usually a required attachment for metric bugzillas.
As suggested, customer has changed the setup to 2 cassandra instances and 2 container for hawkular. Attached the output of "oc get pods -o yaml -n openshift-infra". But in the log of the first hawkular container they see the following Error again: 05:58:13,317 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (default task-205) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception
Can you udpate this BZ once images including HWKMETRICS-733 are available? I see ticket is resolved but I don't know whether it is already publicly released.
(In reply to Ruben Romero Montes from comment #16) > Can you udpate this BZ once images including HWKMETRICS-733 are available? > I see ticket is resolved but I don't know whether it is already publicly > released. The Jira tickets are resolved because the changes have been merged into the upstream branches and pushed out into hawkular-metrics releases upstream. This ticket is still set to ASSIGNED though since new images are not yet available. We will update this ticket when those images are available. Thanks