Description of problem:
Hawkular is failing to return data and is using a high amount of CPU(244% on a 4 core system)
Version-Release number of selected component (if applicable):
3.3.0 image id is bb2780a2bbdb
Steps to Reproduce:
1. Unknown, we restarted the pod it worked fine for about an hour then started returning 500 errors
The pod was restarted and was working fine for an hour or so then started erroring on calls to the URL, calls to the IP just hang(maybe timeout at the ELB or the router pod closing the connection).
Can you please provide the (last part of the) log of the hawkular-metrics container?
Created attachment 1239942 [details]
It is the tail end of the log
Wesley, do you know if the customer has applied this change and if so has it made any difference?
3.4.1 image do have this code change
metrics-deployer 3.4.1 b9377449e4c2 5 hours ago 864.5 MB
metrics-hawkular-metrics 3.4.1 ea4c68d376ca 18 hours ago 1.5 GB
oc edit rc/hawkular-metrics -n openshift-infra
@whearn，if you still see this issue, you can re-open this bug.
I believe we are also having this issue. It is unclear from above what the fix actually is. Are the new versions of metrics required?
Sorry, I am not sure why exactly we had the instructions set to private.
Posting them below:
The issue appears to be with the netty-transport-native-epoll library that the Cassandra driver in Hawkular Metrics is using. This is an optional library, and we can configure it to use NIO instead.
The steps to follow are as below:
Note: you do not need a new image to test this out, but we are setting this to the default in new releases.
1) `oc edit rc hawkular-metrics`
2) under the container command section, please add the following line:
Eg it should look something like this:
3) scale down Hawkular Metrics to 0 and then back up.
From the logs you should be able to see a message indicating: "com.datastax.driver.FORCE_NIO is defined, using value true"
Hawkular Metrics should continue to function as normal when these changes are applied.
Something else you may want to try is to set CPU resource limits on your Hawkular Metrics pods. This will limit the amount of CPU the pod can us, but it doesn't prevent the issue of the pod wanting to use so much of the CPU.
Thank you. We found that cloudforms made the problem happen for us. Our logs were identical. Our OpenShift is 3.3.
07:12:56,597 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-2) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception
Caused by: java.io.IOException: Connection reset by peer …
After restart we see this -
09:56:31,421 WARN [com.datastax.driver.core.SystemProperties] (metricsservice-lifecycle-thread) com.datastax.driver.FORCE_NIO is defined, using value true
09:56:31,432 INFO [com.datastax.driver.core.NettyUtil] (metricsservice-lifecycle-thread) Found Netty's native epoll transport in the classpath, but NIO was forced through the FORCE_NIO system property.
(In reply to Joe Stewart from comment #16)
> Thank you. We found that cloudforms made the problem happen for us. Our logs
> were identical. Our OpenShift is 3.3.
> 07:12:56,597 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n]
> (RxComputationScheduler-2) RESTEASY002020: Unhandled asynchronous exception,
> sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770:
> Response is committed, can't handle exception
> Caused by: java.io.IOException: Connection reset by peer …
> After restart we see this -
> 09:56:31,421 WARN [com.datastax.driver.core.SystemProperties]
> (metricsservice-lifecycle-thread) com.datastax.driver.FORCE_NIO is defined,
> using value true
> 09:56:31,432 INFO [com.datastax.driver.core.NettyUtil]
> (metricsservice-lifecycle-thread) Found Netty's native epoll transport in
> the classpath, but NIO was forced through the FORCE_NIO system property.
> Monitoring now.
Would you please report back to let us know whether the change makes any difference. Thanks.
Did not seem to help us. Exceptions look similar and excessive CPU consumed continues. Attaching our logs.
Created attachment 1244277 [details]
hawkular metrics logs post FORCE_NIO=true
We just opened a support case - : 01779821
Are any cassandra timeouts set - default_timeout, connect_timeout, control_connection_timeout, request_timeout? If so, please increase these and test again.
Please set the following sysctl parameters on the host system, and test:
# The first set of settings is intended to open up the network stack performance by
# raising memory limits and adjusting features for high-bandwidth/low-latency
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 2500
net.core.somaxconn = 65000
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_window_scaling = 1
net.ipv4.ip_local_port_range = 10000 65535
# this block is designed for and only known to work in a single physical DC
# TODO: validate on multi-DC and provide alternatives
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_orphan_retries = 1
# significantly reduce the amount of data the kernel is allowed to store
# in memory between fsyncs
# dirty_background_bytes says how many bytes can be dirty before the kernel
# starts flushing in the background. Set this as low as you can get away with.
# It is basically equivalent to trickle_fsync=true but more efficient since the
# kernel is doing it. Older kernels will need to use vm.dirty_background_ratio
vm.dirty_background_bytes = 10485760
# Same deal as dirty_background but the whole system will pause when this threshold
# is hit, so be generous and set this to a much higher value, e.g. 1GB.
# Older kernels will need to use dirty_ratio instead.
vm.dirty_bytes = 1073741824
# disable zone reclaim for IO-heavy applications like Cassandra
vm.zone_reclaim_mode = 0
# there is no good reason to limit these on server systems, so set them
# to 2^31 to avoid any issues
# Very large values in max_map_count may cause instability in some kernels.
fs.file-max = 1073741824
vm.max_map_count = 1073741824
# only swap if absolutely necessary
# some kernels have trouble with 0 as a value, so stick with 1
vm.swappiness = 1
(In reply to Joe Stewart from comment #18)
> Did not seem to help us. Exceptions look similar and excessive CPU consumed
> continues. Attaching our logs.
Thanks for the quick feedback Joe. I can see from the stack traces in your logs that there REST endpoints getting called to fetch metric data points. This looks consistent with other stack traces I have seen for this issue. Do you know if your environment any consumers of metric data besides the OpenShift console? And by any chance do you know how far back the queries are looking for data, i.e., 2 hours, a day, a week, etc.?
We're using Cloudforms 4.1. As far as we can tell, CloudForms is what breaks it. When it's off everything seems fine. Turn it on, metrics falls over. Of course we have a 5 node, 3 infra, 3 master cluster with only a handful of containers running that aren't openshift specific right now, so we do not know 100% sure if CloudForms is truly the cause or if it's just consuming enough to find whatever limit breaks metrics.
We're using whatever length of time the default cloudforms provider uses. (We're having trouble finding out how to check or modify that.)
We'll look at the tuning stuff mentioned above.
We're using the openshift set tuned settings and a little wary of changing them.
[root@qa0zpsnode1005 ~]# cat /etc/tuned/active_profile
[root@qa0zpsnode1005 ~]# cat /etc/tuned/tuned-main.conf
# Global tuned configuration file.
# Whether to use daemon. Without daemon it just applies tuning. It is
# not recommended, because many functions don't work without daemon,
# e.g. there will be no D-Bus, no rollback of settings, no hotplug,
# no dynamic tuning, ...
daemon = 1
# Dynamicaly tune devices, if disabled only static tuning will be used.
dynamic_tuning = 0
# How long to sleep before checking for events (in seconds)
# higher number means lower overhead but longer response time.
sleep_interval = 1
# Update interval for dynamic tunings (in seconds).
# It must be multiply of the sleep_interval.
update_interval = 10
# Recommend functionality, if disabled "recommend" command will be not
# available in CLI, daemon will not parse recommend.conf but will return
# one hardcoded profile (by default "balanced").
recommend_command = 1
# Whether to reapply sysctl from the e.g /etc/sysctl.conf, /etc/sysctl.d, ...
# If enabled these sysctls will be re-appliead after Tuned sysctls are
# applied, i.e. Tuned sysctls will not override system sysctls.
reapply_sysctl = 1
Created attachment 1244357 [details]
tuned profiles applied by openshift
Created attachment 1244358 [details]
sysctl running values
Joe - the logs are truncated, cutting off the last 32 lines of errors.
When CPU utilization goes high again, please get a full thread dump of the java process consuming the CPU - run 'kill -3 <pid>' and attach the output.
Created attachment 1244813 [details]
Created attachment 1244814 [details]
Did anyone try setting a CPU limit on Hawkular Metrics as mentioned here: https://bugzilla.redhat.com/show_bug.cgi?id=1391731#c46
I don't believe that is the root cause of this, but if its reproducible by setting the resource limit, it might give us some more information to dig down into the root cause.
Set limit of 500 millicores and 4000mb.
To be clear the issue isn't just the high CPU, the stats in the UI take 25 seconds or so to show up and occasionally have gaps and also are occasionally unavailable.
We did a tcpdump between cloudforms and the hawkular pods and see a lot of connection resets. The problem seems to be somehow related to that.
Created attachment 1244841 [details]
packet capture with connection reset issue
packet capture with connection reset issue.
This is from cloudforms. The connection resets line up with the hawkular errors.
Do we know the exact tag/hash for the docker images being used here? Just to make sure what I am testing with is exactly the same.
I am trying to trigger a similar reaction using various rest calls, but so far I have not been able to get anything close to this issue to occur.
I just did a oc decribe all on the project. Please confirm that has what you need.
Can you please see if there is a newer docker image of the version you are using?
The sha for the docker image mentioned does not match the latest one we have.
We think we have the correct one. We have a disconnected install and had to pull/push the images through a couple of registries to get there. That may have changed the ID#. We're confirming.
We believe we have the same (correct) image. Attaching extended info for each. Can you confirm we're correct here?
Created attachment 1245969 [details]
local image info
Created attachment 1245970 [details]
redhat image info
Created attachment 1245985 [details]
Please make last attachment private.
It's a capture from cloudforms side.
Definitely not happy - This is with cloudforms running.
Readiness probe failed: Failed to access the status endpoint : timed out. This may be due to Hawkular Metrics not being ready yet. Will try again.
49 times in the last 18 hours
Liveness probe failed: Failed to access the status endpoint : timed out. Hawkular metrics has been running for 13738 seconds. Aborting
Liveness probe failed: Failed to access the status endpoint : timed out. Hawkular metrics has been running for 12969 seconds. Aborting
Liveness probe failed: Failed to access the status endpoint : timed out. Hawkular metrics has been running for 10879 seconds. Aborting
Liveness probe failed: Failed to access the status endpoint : timed out. Hawkular metrics has been running for 9728 seconds. Aborting
Ran this for a while. Uploading.
sh-4.2$ while true ; do top -H -n 1 ; /etc/alternatives/java_sdk_1.8.0/bin/jstack 212 ; date ; echo +++++++++++++++++++++++++++NEW THREAD++++++++++++++++"; sleep 1 ; done
Created attachment 1246394 [details]
even more dumps
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
This solution we had in place did not actually fix the issue, I am reopening this issue as it is not yet resolved.
*** Bug 1391731 has been marked as a duplicate of this bug. ***
The root cause of this issue has been deemed to be due to the compaction strategy being used and is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1416850
*** This bug has been marked as a duplicate of bug 1416850 ***