Description of problem: A dedicated cluster customer reported seeing "An error occurred getting metrics" message in their OpenShift console. I looked at the logs for the hawkular-metrics pod and saw numerous instances of the message: ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-1) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception See attached logs for full traceback and related Cassandra/Heapster logs Version-Release number of selected component (if applicable): oc v3.4.1.18 kubernetes v1.4.0+776c994
Created attachment 1295241 [details] another set of tablestats
Created attachment 1295248 [details] tem2r6b3-node-logs
Hello Matt, please find attached file logs-13-07-17.tar.gz with requested information. Lex.
There reason for the failure in the attached logs is 137, which means the process was sent a sigkill signal. So the process itself didn't exit in error, something killed it. The describe output doesn't show any events, but that is likely because events are only stored for a certain amount of time. I suspect this might have been caused by the liveness probe failing, but there is no way for me to confirm from this information. The most likely reason for this failure could be that Hawkular Metric is too much underload that its taking too long to respond back to the liveness probe. This could just be a situation where you need to run more Hawkular Metrics or Cassandra instances, or give them more resources. @jsanda: any thoughts on this?
Alexander, the logs you attached in comment 87 only include hawkular-metrics logs. Can you also upload cassandra and heapster logs?
Hello John, Please find attached logs. Here are the current limits and requests OT Cassandra Limits: cpu: 8 memory: 10G Requests: cpu: 1 memory: 8G Hawkular Limits: cpu: 2 memory: 5000Mi Requests: cpu: 500m memory: 2000Mi E2E Cassandra Limits: cpu: 8 memory: 10G Requests: cpu: 1 memory: 8G Hawkular Limits: cpu: 2 memory: 5000Mi Requests: cpu: 500m memory: 2000Mi
Created attachment 1302480 [details] logs after reducing chunk size in hawkular_metrics
The Cassandra logs look much better. There is still a good bit of GC activity that could be problematic. I suggest increasing the size of the new generation to address the messages like, INFO 06:39:06 ParNew GC in 419ms. CMS Old Gen: 1150861008 -> 1167920776; Par Eden Space: 671088640 -> 0; Par Survivor Space: 28148448 -> 26667320 Those are stop the world collections, meaning all application threads are paused. I recommend increasing the new generation to half of the total heap. This can be done by setting the HEAP_NEWSIZE envar. It looks like the total heap is 2384 MB, so I would do HEAP_NEWSIZE=1192M.
This ticket has been opened for quite a while and unfortunately has actually covered different issues that probably should have had separate tickets. Can we close this?
Created attachment 1308790 [details] Logs after increasing heap size
I am closing this out since the reporter has yet to reply as to whether or not the problem has been resolved. This ticket wound up getting used to cover several different issues. If there are still problems related to this ticket, please new tickets.