Red Hat Bugzilla – Bug 1472451
Periodically there are Gaps in Charts
Last modified: 2018-03-01 14:37:23 EST
Description of problem:
Pods are stable, not crashing, but periodically there are gaps in the metrics charts.
Customer even noted that the hawkular-metrics pod and one of the two cassandra pods saw a cpu/memory spike before the gap (but not the other cassandra pod or heapster)
Version-Release number of selected component (if applicable):
newest metrics components as of 2017/07/14
Attaching pod logs and screenshots shortly
The gaps I am seeing in the logs seem to indicate the metrics were not being collected during that time. This could be caused by a pod being restarted or one of the pods being unresponsive during that time.
Can you get the output of 'oc get pods -o yaml -n openshift-infra'? That should let us know if there has been any pod restarts and the cause of the restart.
That should help us better understand what happened.
@jsanda: any idea what would be causing the spikes which are seen for Hawkular Metrics and Cassandra pods? Could the compression job be causing something like this?
I cannot rule out the compression job, but it would not be my first guess. I do not know which what 3.4 release we are dealing with, but remember that we had the netty bug in Cassandra 3.0.9 that was pretty bad. As for hawkular-metrics it is hard to say.
Eric can please you get logs for hawkular-metrics, cassandra, and heapster.
Tested on 3.6, this issue sometimes is happened, most of time I did not see this error, see the attached picture, checked hawkular-metrics pods log, there are UnhandledException and java.io.IOException: Connection reset by peer, same with this defect
017-07-21 07:49:47,260 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-1) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
Created attachment 1302208 [details]
metrics gap in UI
Eric can you please provide logs for hawkular-metrics, cassandra, and heapster. Thanks.
The gaps in the charts are because Heapster was not able to push metrics out to Hawkular Metrics.
From the Heapster logs:
E0817 18:07:00.969883 1 client.go:243] Post https://hawkular-metrics:443/hawkular/metrics/counters/data: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
It looks like its taking Hawkular Metrics too long to process the request (and the Hawkular Metrics log confirm this with all the 'connection reset by peer' messages in the logs).
When this happens, it usually means that the metric pods are overloaded and are not able to keep up with the reads and writes.
You have a few options here:
- horizontally scale the Hawkular Metrics and/or Cassandra pods
- or verically scale the Hawkular Metrics and/or Cassandra pods (eg increase any resource limits they may have applied)
@jsanda: any good recommendation for what to try first?
(In reply to Matt Wringe from comment #26)
> The gaps in the charts are because Heapster was not able to push metrics out
> to Hawkular Metrics.
> From the Heapster logs:
> E0817 18:07:00.969883 1 client.go:243] Post
> https://hawkular-metrics:443/hawkular/metrics/counters/data: net/http:
> request canceled (Client.Timeout exceeded while awaiting headers)
> It looks like its taking Hawkular Metrics too long to process the request
> (and the Hawkular Metrics log confirm this with all the 'connection reset by
> peer' messages in the logs).
> When this happens, it usually means that the metric pods are overloaded and
> are not able to keep up with the reads and writes.
> You have a few options here:
> - horizontally scale the Hawkular Metrics and/or Cassandra pods
> - or verically scale the Hawkular Metrics and/or Cassandra pods (eg increase
> any resource limits they may have applied)
> @jsanda: any good recommendation for what to try first?
I am having trouble correlating anything in the Cassandra logs with the heapster timeout. Eric, can you see if we can get the Cassandra debug logs. They are located in the cassandra pods at /opt/apache-cassandra/logs/debug.log.
At the prompting of John, I reached out to the customer to confirm the timezone for the cluster:
[root@njrarltapp00171 ~]# timedatectl
Local time: Wed 2017-08-23 16:29:54 EDT
Universal time: Wed 2017-08-23 20:29:54 UTC
RTC time: Wed 2017-08-23 20:29:54
Time zone: America/New_York (EDT, -0400)
NTP enabled: yes
NTP synchronized: yes
RTC in local TZ: no
DST active: yes
Last DST change: DST began at
Sun 2017-03-12 01:59:59 EST
Sun 2017-03-12 03:00:00 EDT
Next DST change: DST ends (the clock jumps one hour backwards) at
Sun 2017-11-05 01:59:59 EDT
Sun 2017-11-05 01:00:00 EST
We are still working on getting the documentation updated to properly reflect how Cassandra should be scaled.
But for 3.4, the instructions are as follows.
If you are installing using the deployer, you can use the 'CASSANDRA_NODES' template parameter to specify how many Cassandra nodes you would like to deploy.
If you are already running a cluster and would just like to deploy another node, then you can add one while the cluster is still running. This needs to be done via a template.
For instance, if you want to deploy a second Cassandra node with 100Gi of disk space and using the 3.4.1 version, then you would run the following command:
$ oc process hawkular-cassandra-node-pv \
-v IMAGE_VERSION=3.4.1 \
-v PV_SIZE=100Gi \
If they are not using a persistent volume, then they would need to use the hawkular-cassandra-node-emptydir template instead (and they don't need to specify the PV_SIZE parameter).
Note: when you add a new Cassandra node to the cluster, data will need to move from one node to another and this can take time depending on how much data was originally being stored. You can 'exec' into the pod and run the following command to get a progress update on the file transfers: `nodetool netstats -H`
To scale down a Cassandra node, you will need to 'exec' into the pod and run the following command 'nodetool decommission'. This will move all the data from the second node back to the first and will also take time to complete.
Once the decommission has completed you can then scale down the second Cassandra rc to zero. At this point you can also remove the rc and pvc for the second Cassandra node, but it is recommended to keep this data around until you can verify that everything is functioning properly. Deleting the pvc can permanently delete that data.
@erjones: I have attached the instructions for how to scale up the Cassandra nodes. Let us know if that resolves the issue.
I am closing this issue as we never got back feedback if scaling up the cassandra instances fixed the issue or not.
If the scaling did not fix the problem, please reopen.
Reopening this issue since customer has reported gaps in metrics charts after scaling up cassandra instances.
Is there anymore information we can capture to help with troubleshooting?
Gaps in the charts is usually due to a performance problem where Heapster cannot push out to Hawkular Metrics all the metrics in time (it will try and push out metrics every 30 seconds, if it is still pushing out metrics during the next scheduled time it will skip pushing out new metrics until the next scheduled time).
If you have larger gaps in your charts, then its likely due to one of the pods being OOMKilled.
You will need to attach:
- the logs for Hawkular Metrics, Cassandra and Heapster. If your pods are being killed, you will also need to attach the logs for the failed pods (eg 'oc logs -p $POD_NAME)
- 'oc get pods -o yaml -n openshift-infra'
- 'oc describe pods -n openshift-infra'