Bug 1411427 - Hawkular High CPU usage
Summary: Hawkular High CPU usage
Keywords:
Status: CLOSED DUPLICATE of bug 1416850
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.3.1
Hardware: All
OS: All
medium
high
Target Milestone: ---
: 3.4.z
Assignee: John Sanda
QA Contact: Peng Li
URL:
Whiteboard:
: 1391731 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-09 17:03 UTC by Wesley Hearn
Modified: 2020-05-14 15:31 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: In some rare circumstances, Hawkular Metrics would start to consume too much CPU resources. Consequence: This could cause the Hawkular Metrics pod to stop responding and cause metrics to no longer be collected. Fix: The root of the problem appears to be with a Netty library used by the Cassandra driver. The pod has now been configured to use a different mechanism other than Netty. Result: The Hawkular Metrics pod should no longer fail in this manner due to high CPU usage.
Clone Of:
Environment:
Last Closed: 2017-02-17 21:45:32 UTC
Target Upstream Version:


Attachments (Terms of Use)
hawkular log (13.39 KB, text/plain)
2017-01-12 14:16 UTC, Wesley Hearn
no flags Details
hawkular metrics logs post FORCE_NIO=true (327.78 KB, text/plain)
2017-01-25 15:15 UTC, Joe Stewart
no flags Details
tuned profiles applied by openshift (469 bytes, text/plain)
2017-01-25 18:46 UTC, Joe Stewart
no flags Details
sysctl running values (84.48 KB, text/plain)
2017-01-25 18:46 UTC, Joe Stewart
no flags Details
thread dump (131.58 KB, text/plain)
2017-01-26 17:39 UTC, Joe Stewart
no flags Details
hot threads (198.65 KB, image/png)
2017-01-26 17:41 UTC, Joe Stewart
no flags Details
packet capture with connection reset issue (876.28 KB, application/octet-stream)
2017-01-26 20:30 UTC, Joe Stewart
no flags Details
local image info (6.38 KB, text/plain)
2017-01-30 18:18 UTC, Joe Stewart
no flags Details
redhat image info (6.47 KB, text/plain)
2017-01-30 18:19 UTC, Joe Stewart
no flags Details
thread dumps (5.46 MB, text/plain)
2017-01-30 19:15 UTC, Joe Stewart
no flags Details
even more dumps (9.17 MB, text/plain)
2017-01-31 18:58 UTC, Joe Stewart
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0218 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4.1.2 bug fix update 2017-02-01 01:18:20 UTC

Description Wesley Hearn 2017-01-09 17:03:39 UTC
Description of problem:
Hawkular is failing to return data and is using a high amount of CPU(244% on a 4 core system)

Version-Release number of selected component (if applicable):
3.3.0 image id is bb2780a2bbdb

How reproducible:
Unknown

Steps to Reproduce:
1. Unknown, we restarted the pod it worked fine for about an hour then started returning 500 errors
2.
3.

Actual results:


Expected results:


Additional info:
The pod was restarted and was working fine for an hour or so then started erroring on calls to the URL, calls to the IP just hang(maybe timeout at the ELB or the router pod closing the connection).

Comment 2 Heiko W. Rupp 2017-01-11 09:36:54 UTC
Can you please provide the (last part of the) log of the hawkular-metrics container?

Comment 3 Wesley Hearn 2017-01-12 14:16:21 UTC
Created attachment 1239942 [details]
hawkular log

It is the tail end of the log

Comment 10 John Sanda 2017-01-20 15:59:28 UTC
Wesley, do you know if the customer has applied this change and if so has it made any difference?

Comment 13 Peng Li 2017-01-25 09:46:20 UTC
3.4.1 image do have this code change

metrics-deployer   3.4.1               b9377449e4c2        5 hours ago         864.5 MB
metrics-hawkular-metrics   3.4.1               ea4c68d376ca        18 hours ago        1.5 GB

oc edit rc/hawkular-metrics -n openshift-infra
        - -Dcom.datastax.driver.FORCE_NIO=true

@whearn,if you still see this issue, you can re-open this bug.

Comment 14 Joe Stewart 2017-01-25 14:28:51 UTC
I believe we are also having this issue. It is unclear from above what the fix actually is. Are the new versions of metrics required?

Comment 15 Matt Wringe 2017-01-25 14:39:43 UTC
Sorry, I am not sure why exactly we had the instructions set to private.

Posting them below:

The issue appears to be with the netty-transport-native-epoll library that the Cassandra driver in Hawkular Metrics is using. This is an optional library, and we can configure it to use NIO instead.

The steps to follow are as below:

Note: you do not need a new image to test this out, but we are setting this to the default in new releases.

1) `oc edit rc hawkular-metrics`

2) under the container command section, please add the following line:

"- -Dcom.datastax.driver.FORCE_NIO=true"

Eg it should look something like this:

    spec:
      containers:
      - command:
        - /opt/hawkular/scripts/hawkular-metrics-wrapper.sh
        - -b
        - 0.0.0.0
        - -Dhawkular.metrics.cassandra.nodes=hawkular-cassandra
        - -Dhawkular.metrics.cassandra.use-ssl
        - -Dhawkular.metrics.openshift.auth-methods=openshift-oauth,htpasswd
        ....
        - -Dcom.datastax.driver.FORCE_NIO=true
        ....

3) scale down Hawkular Metrics to 0 and then back up.

From the logs you should be able to see a message indicating: "com.datastax.driver.FORCE_NIO is defined, using value true"

Hawkular Metrics should continue to function as normal when these changes are applied.


Something else you may want to try is to set CPU resource limits on your Hawkular Metrics pods. This will limit the amount of CPU the pod can us, but it doesn't prevent the issue of the pod wanting to use so much of the CPU.

Comment 16 Joe Stewart 2017-01-25 15:00:19 UTC
Thank you. We found that cloudforms made the problem happen for us. Our logs were identical. Our OpenShift is 3.3.

07:12:56,597 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-2) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception
               …
Caused by: java.io.IOException: Connection reset by peer …

After restart we see this -

09:56:31,421 WARN  [com.datastax.driver.core.SystemProperties] (metricsservice-lifecycle-thread) com.datastax.driver.FORCE_NIO is defined, using value true
09:56:31,432 INFO  [com.datastax.driver.core.NettyUtil] (metricsservice-lifecycle-thread) Found Netty's native epoll transport in the classpath, but NIO was forced through the FORCE_NIO system property.

Monitoring now.

-JS

Comment 17 John Sanda 2017-01-25 15:06:34 UTC
(In reply to Joe Stewart from comment #16)
> Thank you. We found that cloudforms made the problem happen for us. Our logs
> were identical. Our OpenShift is 3.3.
> 
> 07:12:56,597 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n]
> (RxComputationScheduler-2) RESTEASY002020: Unhandled asynchronous exception,
> sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770:
> Response is committed, can't handle exception
>                …
> Caused by: java.io.IOException: Connection reset by peer …
> 
> After restart we see this -
> 
> 09:56:31,421 WARN  [com.datastax.driver.core.SystemProperties]
> (metricsservice-lifecycle-thread) com.datastax.driver.FORCE_NIO is defined,
> using value true
> 09:56:31,432 INFO  [com.datastax.driver.core.NettyUtil]
> (metricsservice-lifecycle-thread) Found Netty's native epoll transport in
> the classpath, but NIO was forced through the FORCE_NIO system property.
> 
> Monitoring now.
> 
> -JS

Would you please report back to let us know whether the change makes any difference. Thanks.

Comment 18 Joe Stewart 2017-01-25 15:14:41 UTC
Did not seem to help us. Exceptions look similar and excessive CPU consumed continues. Attaching our logs.

Comment 19 Joe Stewart 2017-01-25 15:15:39 UTC
Created attachment 1244277 [details]
hawkular metrics logs post FORCE_NIO=true

Comment 20 Joe Stewart 2017-01-25 15:45:34 UTC
We just opened a support case - : 01779821

Comment 21 Dan Yocum 2017-01-25 17:25:17 UTC
Some suggestions:

Are any cassandra timeouts set - default_timeout, connect_timeout, control_connection_timeout, request_timeout?  If so, please increase these and test again.

Please set the following sysctl parameters on the host system, and test:

###############################
# The first set of settings is intended to open up the network stack performance by
# raising memory limits and adjusting features for high-bandwidth/low-latency
# networks.
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 2500
net.core.somaxconn = 65000
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_window_scaling = 1
net.ipv4.ip_local_port_range = 10000 65535

# this block is designed for and only known to work in a single physical DC
# TODO: validate on multi-DC and provide alternatives
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_orphan_retries = 1

# significantly reduce the amount of data the kernel is allowed to store
# in memory between fsyncs
# dirty_background_bytes says how many bytes can be dirty before the kernel
# starts flushing in the background. Set this as low as you can get away with.
# It is basically equivalent to trickle_fsync=true but more efficient since the
# kernel is doing it. Older kernels will need to use vm.dirty_background_ratio
# instead.
vm.dirty_background_bytes = 10485760

# Same deal as dirty_background but the whole system will pause when this threshold
# is hit, so be generous and set this to a much higher value, e.g. 1GB.
# Older kernels will need to use dirty_ratio instead.
vm.dirty_bytes = 1073741824

# disable zone reclaim for IO-heavy applications like Cassandra
vm.zone_reclaim_mode = 0

# there is no good reason to limit these on server systems, so set them
# to 2^31 to avoid any issues
# Very large values in max_map_count may cause instability in some kernels.
fs.file-max = 1073741824
vm.max_map_count = 1073741824

# only swap if absolutely necessary
# some kernels have trouble with 0 as a value, so stick with 1
vm.swappiness = 1
###############################

Comment 22 John Sanda 2017-01-25 17:28:12 UTC
(In reply to Joe Stewart from comment #18)
> Did not seem to help us. Exceptions look similar and excessive CPU consumed
> continues. Attaching our logs.

Thanks for the quick feedback Joe. I can see from the stack traces in your logs that there REST endpoints getting called to fetch metric data points. This looks consistent with other stack traces I have seen for this issue. Do you know if your environment any consumers of metric data besides the OpenShift console? And by any chance do you know how far back the queries are looking for data, i.e., 2 hours, a day, a week, etc.?

Comment 23 Joe Stewart 2017-01-25 17:40:22 UTC
We're using Cloudforms 4.1. As far as we can tell, CloudForms is what breaks it. When it's off everything seems fine. Turn it on, metrics falls over. Of course we have a 5 node, 3 infra, 3 master cluster with only a handful of containers running that aren't openshift specific right now, so we do not know 100% sure if CloudForms is truly the cause or if it's just consuming enough to find whatever limit breaks metrics.

We're using whatever length of time the default cloudforms provider uses. (We're having trouble finding out how to check or modify that.)

We'll look at the tuning stuff mentioned above.

Comment 24 Joe Stewart 2017-01-25 18:33:31 UTC
We're using the openshift set tuned settings and a little wary of changing them.

-JS

[root@qa0zpsnode1005 ~]# cat /etc/tuned/active_profile 
atomic-openshift-node-guest
[root@qa0zpsnode1005 ~]# cat /etc/tuned/tuned-main.conf 
# Global tuned configuration file.

# Whether to use daemon. Without daemon it just applies tuning. It is
# not recommended, because many functions don't work without daemon,
# e.g. there will be no D-Bus, no rollback of settings, no hotplug,
# no dynamic tuning, ...
daemon = 1

# Dynamicaly tune devices, if disabled only static tuning will be used.
dynamic_tuning = 0

# How long to sleep before checking for events (in seconds)
# higher number means lower overhead but longer response time.
sleep_interval = 1

# Update interval for dynamic tunings (in seconds).
# It must be multiply of the sleep_interval.
update_interval = 10

# Recommend functionality, if disabled "recommend" command will be not
# available in CLI, daemon will not parse recommend.conf but will return
# one hardcoded profile (by default "balanced").
recommend_command = 1

# Whether to reapply sysctl from the e.g /etc/sysctl.conf, /etc/sysctl.d, ...
# If enabled these sysctls will be re-appliead after Tuned sysctls are
# applied, i.e. Tuned sysctls will not override system sysctls.
reapply_sysctl = 1
[root@qa0zpsnode1005 ~]#

Comment 25 Joe Stewart 2017-01-25 18:46:33 UTC
Created attachment 1244357 [details]
tuned profiles applied by openshift

Comment 26 Joe Stewart 2017-01-25 18:46:55 UTC
Created attachment 1244358 [details]
sysctl running values

Comment 32 Dan Yocum 2017-01-26 16:22:38 UTC
Joe - the logs are truncated, cutting off the last 32 lines of errors.

When CPU utilization goes high again, please get a full thread dump of the java process consuming the CPU - run 'kill -3 <pid>' and attach the output.

Thanks.

Comment 33 Joe Stewart 2017-01-26 17:39:30 UTC
Done. Attaching.

-JS

Comment 34 Joe Stewart 2017-01-26 17:39:58 UTC
Created attachment 1244813 [details]
thread dump

Comment 35 Joe Stewart 2017-01-26 17:41:01 UTC
Created attachment 1244814 [details]
hot threads

Comment 36 Matt Wringe 2017-01-26 18:38:55 UTC
Did anyone try setting a CPU limit on Hawkular Metrics as mentioned here: https://bugzilla.redhat.com/show_bug.cgi?id=1391731#c46

I don't believe that is the root cause of this, but if its reproducible by setting the resource limit, it might give us some more information to dig down into the root cause.

Comment 38 Joe Stewart 2017-01-26 20:18:09 UTC
Set limit of 500 millicores and 4000mb. 

To be clear the issue isn't just the high CPU, the stats in the UI take 25 seconds or so to show up and occasionally have gaps and also are occasionally unavailable.

We did a tcpdump between cloudforms and the hawkular pods and see a lot of connection resets. The problem seems to be somehow related to that.

Comment 39 Joe Stewart 2017-01-26 20:30:26 UTC
Created attachment 1244841 [details]
packet capture with connection reset issue

packet capture with connection reset issue.

This is from cloudforms. The connection resets line up with the hawkular errors.

Comment 40 Matt Wringe 2017-01-27 19:23:46 UTC
Do we know the exact tag/hash for the docker images being used here? Just to make sure what I am testing with is exactly the same.

I am trying to trigger a similar reaction using various rest calls, but so far I have not been able to get anything close to this issue to occur.

Comment 41 Joe Stewart 2017-01-27 19:39:34 UTC
I just did a oc decribe all on the project. Please confirm that has what you need.

(Attaching)

Comment 43 Matt Wringe 2017-01-30 14:31:27 UTC
Can you please see if there is a newer docker image of the version you are using?

The sha for the docker image mentioned does not match the latest one we have.

Comment 44 Joe Stewart 2017-01-30 17:42:50 UTC
We think we have the correct one. We have a disconnected install and had to pull/push the images through a couple of registries to get there. That may have changed the ID#. We're confirming.

Comment 45 Joe Stewart 2017-01-30 18:18:06 UTC
We believe we have the same (correct) image. Attaching extended info for each. Can you confirm we're correct here?

Comment 46 Joe Stewart 2017-01-30 18:18:46 UTC
Created attachment 1245969 [details]
local image info

Comment 47 Joe Stewart 2017-01-30 18:19:14 UTC
Created attachment 1245970 [details]
redhat image info

Comment 48 Joe Stewart 2017-01-30 19:15:39 UTC
Created attachment 1245985 [details]
thread dumps

As requested.

Comment 51 Joe Stewart 2017-01-30 20:03:53 UTC
Please make last attachment private.

It's a capture from cloudforms side.

Comment 52 Joe Stewart 2017-01-31 14:00:07 UTC
Definitely not happy - This is with cloudforms running.

8:10:58 AM	
Pod
hawkular-metrics-mofte	
Unhealthy  
Readiness probe failed: Failed to access the status endpoint : timed out. This may be due to Hawkular Metrics not being ready yet. Will try again.
49 times in the last 18 hours
8:10:56 AM	
Pod
hawkular-metrics-mofte	
Unhealthy  
Liveness probe failed: Failed to access the status endpoint : timed out. Hawkular metrics has been running for 13738 seconds. Aborting
7:58:07 AM	
Pod
hawkular-metrics-mofte	
Unhealthy  
Liveness probe failed: Failed to access the status endpoint : timed out. Hawkular metrics has been running for 12969 seconds. Aborting
7:23:18 AM	
Pod
hawkular-metrics-mofte	
Unhealthy  
Liveness probe failed: Failed to access the status endpoint : timed out. Hawkular metrics has been running for 10879 seconds. Aborting
7:04:06 AM	
Pod
hawkular-metrics-mofte	
Unhealthy  
Liveness probe failed: Failed to access the status endpoint : timed out. Hawkular metrics has been running for 9728 seconds. Aborting

Comment 54 Joe Stewart 2017-01-31 18:56:26 UTC
Ran this for a while. Uploading.

sh-4.2$ while true ; do top -H -n 1 ; /etc/alternatives/java_sdk_1.8.0/bin/jstack 212 ; date ; echo +++++++++++++++++++++++++++NEW THREAD++++++++++++++++"; sleep 1 ; done

Comment 55 Joe Stewart 2017-01-31 18:57:30 UTC
Ran this for a while. Uploading.

sh-4.2$ while true ; do top -H -n 1 ; /etc/alternatives/java_sdk_1.8.0/bin/jstack 212 ; date ; echo +++++++++++++++++++++++++++NEW THREAD++++++++++++++++"; sleep 1 ; done

Comment 56 Joe Stewart 2017-01-31 18:58:27 UTC
Created attachment 1246394 [details]
even more dumps

Comment 58 errata-xmlrpc 2017-01-31 20:19:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0218

Comment 59 Matt Wringe 2017-01-31 20:22:05 UTC
This solution we had in place did not actually fix the issue, I am reopening this issue as it is not yet resolved.

Comment 60 Matt Wringe 2017-02-09 20:36:36 UTC
*** Bug 1391731 has been marked as a duplicate of this bug. ***

Comment 96 Matt Wringe 2017-02-17 21:45:32 UTC
The root cause of this issue has been deemed to be due to the compaction strategy being used and is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1416850

*** This bug has been marked as a duplicate of bug 1416850 ***


Note You need to log in before you can comment on or make changes to this bug.