Bug 1449844

Summary: RESTEASY002020: Unhandled asynchronous exception, sending back 500
Product: OpenShift Container Platform Reporter: bmorriso
Component: HawkularAssignee: John Sanda <jsanda>
Status: CLOSED WONTFIX QA Contact: Liming Zhou <lizhou>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.4.1CC: akaiser, akokshar, aos-bugs, bmorriso, bvincell, erich, erjones, javier.ramirez, jgoulding, jkaur, jsanda, mwhittin, mwringe, rromerom, rvargasp, snegrea, stwalter
Target Milestone: ---Keywords: OpsBlocker
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-14 13:37:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1303130    
Attachments:
Description Flags
another set of tablestats
none
tem2r6b3-node-logs
none
logs after reducing chunk size in hawkular_metrics
none
Logs after increasing heap size none

Description bmorriso 2017-05-10 21:33:01 UTC
Description of problem:

A dedicated cluster customer reported seeing "An error occurred getting metrics" message in their OpenShift console. I looked at the logs for the hawkular-metrics pod and saw numerous instances of the message:

ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-1) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is
committed, can't handle exception

See attached logs for full traceback and related Cassandra/Heapster logs

Version-Release number of selected component (if applicable):

oc v3.4.1.18
kubernetes v1.4.0+776c994

Comment 81 Javier Ramirez 2017-07-07 09:47:57 UTC
Created attachment 1295241 [details]
another set of tablestats

Comment 82 Javier Ramirez 2017-07-07 09:59:52 UTC
Created attachment 1295248 [details]
tem2r6b3-node-logs

Comment 87 Alexander Koksharov 2017-07-13 10:26:42 UTC
Hello Matt,
please find attached file logs-13-07-17.tar.gz with requested information.

Lex.

Comment 88 Matt Wringe 2017-07-13 13:05:44 UTC
There reason for the failure in the attached logs is 137, which means the process was sent a sigkill signal. So the process itself didn't exit in error, something killed it.

The describe output doesn't show any events, but that is likely because events are only stored for a certain amount of time.

I suspect this might have been caused by the liveness probe failing, but there is no way for me to confirm from this information. The most likely reason for this failure could be that Hawkular Metric is too much underload that its taking too long to respond back to the liveness probe.

This could just be a situation where you need to run more Hawkular Metrics or Cassandra instances, or give them more resources.

@jsanda: any thoughts on this?

Comment 89 John Sanda 2017-07-13 13:53:08 UTC
Alexander, the logs you attached in comment 87 only include hawkular-metrics logs. Can you also upload cassandra and heapster logs?

Comment 92 Alexander Koksharov 2017-07-17 10:07:34 UTC
Hello John,

Please find attached logs. Here are the current limits and requests

OT
Cassandra
   Limits:
     cpu:      8
     memory:   10G
   Requests:
     cpu:              1
     memory:           8G

Hawkular
   Limits:
     cpu:      2
     memory:   5000Mi
  Requests:
      cpu:              500m
      memory:           2000Mi


E2E
Cassandra
Limits:
   cpu:      8
   memory:   10G
Requests:
   cpu:              1
   memory:           8G

Hawkular
Limits:
   cpu:      2
   memory:   5000Mi
Requests:
   cpu:              500m
   memory:           2000Mi

Comment 96 Javier Ramirez 2017-07-21 14:55:06 UTC
Created attachment 1302480 [details]
logs after reducing chunk size in hawkular_metrics

Comment 97 John Sanda 2017-07-21 17:33:52 UTC
The Cassandra logs look much better. There is still a good bit of GC activity that could be problematic. I suggest increasing the size of the new generation to address the messages like,

INFO  06:39:06 ParNew GC in 419ms.  CMS Old Gen: 1150861008 -> 1167920776; Par Eden Space: 671088640 -> 0; Par Survivor Space: 28148448 -> 26667320

Those are stop the world collections, meaning all application threads are paused. I recommend increasing the new generation to half of the total heap. This can be done by setting the HEAP_NEWSIZE envar. It looks like the total heap is 2384 MB, so I would do HEAP_NEWSIZE=1192M.

Comment 98 John Sanda 2017-07-25 19:34:12 UTC
This ticket has been opened for quite a while and unfortunately has actually covered different issues that probably should have had separate tickets. Can we close this?

Comment 102 Javier Ramirez 2017-08-03 16:20:08 UTC
Created attachment 1308790 [details]
Logs after increasing heap size

Comment 105 John Sanda 2017-08-14 13:37:39 UTC
I am closing this out since the reporter has yet to reply as to whether or not the problem has been resolved. This ticket wound up getting used to cover several different issues. If there are still problems related to this ticket, please new tickets.