Bug 1435436 - Exceptions in Hawkular-Metrics pod
Summary: Exceptions in Hawkular-Metrics pod
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.5.z
Assignee: John Sanda
QA Contact: Liming Zhou
URL:
Whiteboard:
Depends On: 1433061
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-23 19:45 UTC by Viet Nguyen
Modified: 2020-06-11 13:28 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1433061
Environment:
Last Closed: 2017-10-05 19:15:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
metric server log (11.74 KB, application/zip)
2017-03-23 19:47 UTC, Viet Nguyen
no flags Details
cnode-tpstats (2.91 KB, text/plain)
2017-03-23 22:51 UTC, Viet Nguyen
no flags Details
cnode logs (1.55 MB, application/zip)
2017-03-24 19:05 UTC, Viet Nguyen
no flags Details
cnode info (18.60 KB, text/plain)
2017-03-24 19:51 UTC, Viet Nguyen
no flags Details
ansible inventory file for metrics install (678 bytes, text/plain)
2017-03-24 19:56 UTC, Viet Nguyen
no flags Details
15 minutes raw metrics via VPN (123.88 KB, text/plain)
2017-03-24 22:06 UTC, Viet Nguyen
no flags Details
30 minutes raw metrics on Master node (10.63 KB, application/zip)
2017-03-24 22:34 UTC, Viet Nguyen
no flags Details
OOM logs (82.41 KB, application/zip)
2017-04-10 14:50 UTC, Viet Nguyen
no flags Details
cassandra oom log (230 pods) (1.43 MB, application/zip)
2017-04-12 14:42 UTC, Viet Nguyen
no flags Details
Rerun against OSE 3.5 (120.07 KB, application/zip)
2017-07-10 22:49 UTC, Viet Nguyen
no flags Details

Description Viet Nguyen 2017-03-23 19:45:25 UTC
Description of problem:

- A lot of these exceptions:

2017-03-23 18:57:19,001 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-15) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception


Environment
- 2 bare metal RHEL 7 servers, 64G RAM each, 1 regular node, 1 infra/master node
- 239 pods of a simple node js app on the regular node

Version-Release number of selected component (if applicable):
- OSE: 3.5
- Metrics: 3.5
- RHEL 7.2


How reproducible:
100%

Steps to Reproduce:
1. Install OSE, Metrics using ansible playbook
2. Install Hawkular OpenShift Agent
3. Deploy some user pods, scale up to 239 pods

Actual results:
- OpenShift console shows Mem/CPU, Network graphs as expected, but Hawkular metrics server log has a lot of RESTEASY002020 exceptions (see attached log)

Comment 1 Viet Nguyen 2017-03-23 19:47:29 UTC
Created attachment 1265877 [details]
metric server log

Comment 2 Viet Nguyen 2017-03-23 19:53:29 UTC
Clarification:  the exceptions occured when I ran the Python client against Metrics to gather raw metrics in 30-minute period.

Comment 3 Matt Wringe 2017-03-23 21:27:05 UTC
@viet: can you run and paste the output of:

oc exec -it ${HAWKULAR_CASSANDRA_POD_NAME} nodetool tpstats

@jsanda: can you take a look?

Comment 4 Viet Nguyen 2017-03-23 22:51:48 UTC
Created attachment 1265916 [details]
cnode-tpstats

Comment 5 John Sanda 2017-03-24 02:40:51 UTC
(In reply to Viet Nguyen from comment #2)
> Clarification:  the exceptions occured when I ran the Python client against
> Metrics to gather raw metrics in 30-minute period.

Viet can you also attach Cassandra logs including the debug.log file? Can you either point me to or attach the python code you are running?

Comment 6 Viet Nguyen 2017-03-24 19:05:02 UTC
Created attachment 1266228 [details]
cnode logs

Comment 7 John Sanda 2017-03-24 19:25:52 UTC
Thanks for the Cassandra logs Viet. There are a lot of GC pauses which are almost certainly causing a lot of the problems you are seeing. If you cluster is still up can you provide the output for the following command run in your Cassandra pod:

nodetool tablehistograms hawkular_metrics data
nodetool tablehistograms hawkular_metrics metrics_idx
nodetool tablestats hawkular_metrics

If your Cassandra pod has been restarted since you ran your python script, then the output won't be as useful because the metrics will reset on restart. If the pod has restarted, could you run your Python program again and then get the nodetool output? Thanks.

Comment 8 Viet Nguyen 2017-03-24 19:42:51 UTC
Python client
https://github.com/vnugent/pyme/tree/master/pyme

Comment 9 Viet Nguyen 2017-03-24 19:51:23 UTC
Created attachment 1266231 [details]
cnode info

Comment 10 Viet Nguyen 2017-03-24 19:56:06 UTC
Created attachment 1266232 [details]
ansible inventory file for metrics install

As you can see each pod was given 2G

Comment 11 Viet Nguyen 2017-03-24 22:06:10 UTC
Created attachment 1266272 [details]
15 minutes raw metrics via VPN

Comment 12 Viet Nguyen 2017-03-24 22:34:36 UTC
Created attachment 1266274 [details]
30 minutes raw metrics on Master node

Comment 13 Michael Burman 2017-03-25 16:27:49 UTC
So it would appear the slow VPN is the cause of these issues?

Comment 14 John Sanda 2017-03-26 03:03:36 UTC
While the problem may be network connectivity, there are two other issues I think we need to address. First, the error reporting coming out of Hawkular Metrics is pretty poor. We have no way of knowing what endpoints were involved for example. In this particular case, Viet gave us all of the relevant information; however, in production support tickets, that often is not the case. The second issue is the GC activity in the Cassandra log. Cassandra was logging lots of warning for long GC pauses for what does not seem to be a heavy read load. We probably need to look at doing some tuning. We need upstream Jira tickets for these. I will create them and refer back to this ticket.

Comment 15 John Sanda 2017-03-29 15:53:38 UTC
Viet,

Can you run your python program again and provide GC logs for Cassandra? I would like to review GC logs so we can do some tuning to reduce GC pause times.

Comment 17 Viet Nguyen 2017-04-10 14:50:22 UTC
Created attachment 1270510 [details]
OOM logs

-Increased number of metrics per pod to 30 (30s collection interval, heapster disabled)
-Python client (running on Master) triggered OOM in Cassandra log

Comment 18 Viet Nguyen 2017-04-12 14:42:27 UTC
Created attachment 1271174 [details]
cassandra oom log (230 pods)

Comment 19 Michael Burman 2017-04-12 15:00:02 UTC
This seems to be a bug in the Cassandra 3.0.9 and was fixed in 3.0.11 

https://issues.apache.org/jira/browse/CASSANDRA-13114

Comment 21 Matt Wringe 2017-04-24 15:57:14 UTC
What else do we need to do with this? Are we going to need to update our releases with 3.0.9 to 3.0.11?

Comment 22 Michael Burman 2017-04-25 09:31:49 UTC
3.0.13 is the newest and it has the fix for that Netty SSL / OpenSSL bug. In theory, the following tickets could have some effect:

https://issues.apache.org/jira/browse/CASSANDRA-13126
https://issues.apache.org/jira/browse/CASSANDRA-13221

One option is of course to try fix them (to reduce potential SSL errors). Although these should not happen in our environment usually.

The bug itself happens because Netty uses off-heap buffers to manage SSL connections and those are not necessarily freed enough quickly since our version of Cassandra uses DisableExplicitGC that would free them if it looks like off-heap memory is running out. The problem with enabling that is that technically it would be possible to flood the Cassandra to cause "stop-the-world-GC" that could reduce the performance during those GC stops.

For stability reasons though, we might think about removing that option from our Cassandra configurations - it will prevent the out of direct memory errors.

Discussed in the first ticket also.

Comment 23 John Sanda 2017-04-25 13:53:26 UTC
Given how far back the dates have been moved for OCP 3.6, it makes sense for us to move to Cassandra 3.0.13.

Comment 24 Matt Wringe 2017-04-25 21:05:22 UTC
Do we need this for 3.6 only? Or would we need to backport the update? It looks like this issue is affecting 3.5

Comment 25 John Sanda 2017-04-28 01:03:37 UTC
We should back port this to 3.5 as well. What about 3.4?

Comment 26 John Sanda 2017-06-27 13:32:12 UTC
Viet, can you retest with the latest 3.5 images? You may have hit a memory leak bug that was addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1457501.

Comment 27 Viet Nguyen 2017-07-10 22:49:56 UTC
Created attachment 1295969 [details]
Rerun against OSE 3.5

Comment 28 Viet Nguyen 2017-07-10 23:10:33 UTC
Re-ran against new OSE 3.5 baremetal cluster (3 x 64GB RAM nodes, master only contains default, openshift-infra project). 400 test pods.

Comment 29 John Sanda 2017-07-11 00:40:58 UTC
There is a lot of old gen GC which may be the main culprit. The young generation of the heap is set to 2400 MB while the max heap size, i.e., total heap size is set to 953 MB. I am surprised that didn't cause an error that would prevent the JVM from starting. In any event, I suggest increasing the max heap size to at least 2 GB, and set the you gen (via HEAP_NEWSIZE envar) to half the max heap.

Comment 30 Matt Wringe 2017-10-05 19:15:24 UTC
Closing this issue as Insufficient_Data, please reopen if we get the data requested in comment 29


Note You need to log in before you can comment on or make changes to this bug.