Bug 1449844 - RESTEASY002020: Unhandled asynchronous exception, sending back 500
Summary: RESTEASY002020: Unhandled asynchronous exception, sending back 500
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.7.0
Assignee: John Sanda
QA Contact: Liming Zhou
URL:
Whiteboard:
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2017-05-10 21:33 UTC by bmorriso
Modified: 2021-09-09 12:17 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-14 13:37:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
another set of tablestats (2.68 KB, application/x-gzip)
2017-07-07 09:47 UTC, Javier Ramirez
no flags Details
tem2r6b3-node-logs (17.39 MB, application/x-xz)
2017-07-07 09:59 UTC, Javier Ramirez
no flags Details
logs after reducing chunk size in hawkular_metrics (38.50 KB, application/zip)
2017-07-21 14:55 UTC, Javier Ramirez
no flags Details
Logs after increasing heap size (65.58 KB, application/zip)
2017-08-03 16:20 UTC, Javier Ramirez
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1457499 0 unspecified CLOSED Upgrade Cassandra to fix Netty memory leak 2021-02-22 00:41:40 UTC

Internal Links: 1457499

Description bmorriso 2017-05-10 21:33:01 UTC
Description of problem:

A dedicated cluster customer reported seeing "An error occurred getting metrics" message in their OpenShift console. I looked at the logs for the hawkular-metrics pod and saw numerous instances of the message:

ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-1) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is
committed, can't handle exception

See attached logs for full traceback and related Cassandra/Heapster logs

Version-Release number of selected component (if applicable):

oc v3.4.1.18
kubernetes v1.4.0+776c994

Comment 81 Javier Ramirez 2017-07-07 09:47:57 UTC
Created attachment 1295241 [details]
another set of tablestats

Comment 82 Javier Ramirez 2017-07-07 09:59:52 UTC
Created attachment 1295248 [details]
tem2r6b3-node-logs

Comment 87 Alexander Koksharov 2017-07-13 10:26:42 UTC
Hello Matt,
please find attached file logs-13-07-17.tar.gz with requested information.

Lex.

Comment 88 Matt Wringe 2017-07-13 13:05:44 UTC
There reason for the failure in the attached logs is 137, which means the process was sent a sigkill signal. So the process itself didn't exit in error, something killed it.

The describe output doesn't show any events, but that is likely because events are only stored for a certain amount of time.

I suspect this might have been caused by the liveness probe failing, but there is no way for me to confirm from this information. The most likely reason for this failure could be that Hawkular Metric is too much underload that its taking too long to respond back to the liveness probe.

This could just be a situation where you need to run more Hawkular Metrics or Cassandra instances, or give them more resources.

@jsanda: any thoughts on this?

Comment 89 John Sanda 2017-07-13 13:53:08 UTC
Alexander, the logs you attached in comment 87 only include hawkular-metrics logs. Can you also upload cassandra and heapster logs?

Comment 92 Alexander Koksharov 2017-07-17 10:07:34 UTC
Hello John,

Please find attached logs. Here are the current limits and requests

OT
Cassandra
   Limits:
     cpu:      8
     memory:   10G
   Requests:
     cpu:              1
     memory:           8G

Hawkular
   Limits:
     cpu:      2
     memory:   5000Mi
  Requests:
      cpu:              500m
      memory:           2000Mi


E2E
Cassandra
Limits:
   cpu:      8
   memory:   10G
Requests:
   cpu:              1
   memory:           8G

Hawkular
Limits:
   cpu:      2
   memory:   5000Mi
Requests:
   cpu:              500m
   memory:           2000Mi

Comment 96 Javier Ramirez 2017-07-21 14:55:06 UTC
Created attachment 1302480 [details]
logs after reducing chunk size in hawkular_metrics

Comment 97 John Sanda 2017-07-21 17:33:52 UTC
The Cassandra logs look much better. There is still a good bit of GC activity that could be problematic. I suggest increasing the size of the new generation to address the messages like,

INFO  06:39:06 ParNew GC in 419ms.  CMS Old Gen: 1150861008 -> 1167920776; Par Eden Space: 671088640 -> 0; Par Survivor Space: 28148448 -> 26667320

Those are stop the world collections, meaning all application threads are paused. I recommend increasing the new generation to half of the total heap. This can be done by setting the HEAP_NEWSIZE envar. It looks like the total heap is 2384 MB, so I would do HEAP_NEWSIZE=1192M.

Comment 98 John Sanda 2017-07-25 19:34:12 UTC
This ticket has been opened for quite a while and unfortunately has actually covered different issues that probably should have had separate tickets. Can we close this?

Comment 102 Javier Ramirez 2017-08-03 16:20:08 UTC
Created attachment 1308790 [details]
Logs after increasing heap size

Comment 105 John Sanda 2017-08-14 13:37:39 UTC
I am closing this out since the reporter has yet to reply as to whether or not the problem has been resolved. This ticket wound up getting used to cover several different issues. If there are still problems related to this ticket, please new tickets.


Note You need to log in before you can comment on or make changes to this bug.