Bug 1472451 - Periodically there are Gaps in Charts [NEEDINFO]
Periodically there are Gaps in Charts
Status: NEW
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.4.1
Unspecified Unspecified
high Severity high
: ---
: 3.4.z
Assigned To: Matt Wringe
Junqi Zhao
: Unconfirmed
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-18 15:03 EDT by Eric Jones
Modified: 2017-09-14 13:57 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mwringe: needinfo? (erjones)


Attachments (Terms of Use)
metrics gap in UI (121.48 KB, image/png)
2017-07-21 03:59 EDT, Junqi Zhao
no flags Details

  None (edit)
Description Eric Jones 2017-07-18 15:03:34 EDT
Description of problem:
Pods are stable, not crashing, but periodically there are gaps in the metrics charts.

Customer even noted that the hawkular-metrics pod and one of the two cassandra pods saw a cpu/memory spike before the gap (but not the other cassandra pod or heapster)

Version-Release number of selected component (if applicable):
3.4
newest metrics components as of 2017/07/14

Additional info:
Attaching pod logs and screenshots shortly
Comment 2 Matt Wringe 2017-07-19 11:39:11 EDT
The gaps I am seeing in the logs seem to indicate the metrics were not being collected during that time. This could be caused by a pod being restarted or one of the pods being unresponsive during that time.

Can you get the output of 'oc get pods -o yaml -n openshift-infra'? That should let us know if there has been any pod restarts and the cause of the restart.

That should help us better understand what happened.

@jsanda: any idea what would be causing the spikes which are seen for Hawkular Metrics and Cassandra pods? Could the compression job be causing something like this?
Comment 3 John Sanda 2017-07-19 13:16:08 EDT
I cannot rule out the compression job, but it would not be my first guess. I do not know which what 3.4 release we are dealing with, but remember that we had the netty bug in Cassandra 3.0.9 that was pretty bad. As for hawkular-metrics it is hard to say. 

Eric can please you get logs for hawkular-metrics, cassandra, and heapster.
Comment 6 Junqi Zhao 2017-07-21 03:58:30 EDT
Tested on 3.6, this issue sometimes is happened, most of time I did not see this error, see the attached picture, checked hawkular-metrics pods log, there are UnhandledException and  java.io.IOException: Connection reset by peer, same with this defect

017-07-21 07:49:47,260 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-1) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception
	at org.jboss.resteasy.core.SynchronousDispatcher.writeException(SynchronousDispatcher.java:174)
	at org.jboss.resteasy.core.SynchronousDispatcher.asynchronousExceptionDelivery(SynchronousDispatcher.java:444)
	at org.jboss.resteasy.core.AbstractAsynchronousResponse.internalResume(AbstractAsynchronousResponse.java:196)
	at org.jboss.resteasy.core.AbstractAsynchronousResponse.internalResume(AbstractAsynchronousResponse.java:185)
	at org.jboss.resteasy.plugins.server.servlet.Servlet3AsyncHttpRequest$Servlet3ExecutionContext$S

Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
Comment 7 Junqi Zhao 2017-07-21 03:59 EDT
Created attachment 1302208 [details]
metrics gap in UI
Comment 8 John Sanda 2017-07-25 22:26:41 EDT
Eric can you please provide logs for hawkular-metrics, cassandra, and heapster. Thanks.
Comment 26 Matt Wringe 2017-08-22 16:25:19 EDT
The gaps in the charts are because Heapster was not able to push metrics out to Hawkular Metrics.

From the Heapster logs:
E0817 18:07:00.969883       1 client.go:243] Post https://hawkular-metrics:443/hawkular/metrics/counters/data: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

It looks like its taking Hawkular Metrics too long to process the request (and the Hawkular Metrics log confirm this with all the 'connection reset by peer' messages in the logs).

When this happens, it usually means that the metric pods are overloaded and are not able to keep up with the reads and writes.

You have a few options here:

- horizontally scale the Hawkular Metrics and/or Cassandra pods

- or verically scale the Hawkular Metrics and/or Cassandra pods (eg increase any resource limits they may have applied)

@jsanda: any good recommendation for what to try first?
Comment 27 John Sanda 2017-08-23 10:29:08 EDT
(In reply to Matt Wringe from comment #26)
> The gaps in the charts are because Heapster was not able to push metrics out
> to Hawkular Metrics.
> 
> From the Heapster logs:
> E0817 18:07:00.969883       1 client.go:243] Post
> https://hawkular-metrics:443/hawkular/metrics/counters/data: net/http:
> request canceled (Client.Timeout exceeded while awaiting headers)
> 
> It looks like its taking Hawkular Metrics too long to process the request
> (and the Hawkular Metrics log confirm this with all the 'connection reset by
> peer' messages in the logs).
> 
> When this happens, it usually means that the metric pods are overloaded and
> are not able to keep up with the reads and writes.
> 
> You have a few options here:
> 
> - horizontally scale the Hawkular Metrics and/or Cassandra pods
> 
> - or verically scale the Hawkular Metrics and/or Cassandra pods (eg increase
> any resource limits they may have applied)
> 
> @jsanda: any good recommendation for what to try first?

I am having trouble correlating anything in the Cassandra logs with the heapster timeout. Eric, can you see if we can get the Cassandra debug logs. They are located in the cassandra pods at /opt/apache-cassandra/logs/debug.log.
Comment 29 Eric Jones 2017-08-23 17:24:08 EDT
At the prompting of John, I reached out to the customer to confirm the timezone for the cluster:

[root@njrarltapp00171 ~]# timedatectl
      Local time: Wed 2017-08-23 16:29:54 EDT
  Universal time: Wed 2017-08-23 20:29:54 UTC
        RTC time: Wed 2017-08-23 20:29:54
       Time zone: America/New_York (EDT, -0400)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Sun 2017-03-12 01:59:59 EST
                  Sun 2017-03-12 03:00:00 EDT
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2017-11-05 01:59:59 EDT
                  Sun 2017-11-05 01:00:00 EST
Comment 32 Matt Wringe 2017-09-13 16:29:06 EDT
We are still working on getting the documentation updated to properly reflect how Cassandra should be scaled.

But for 3.4, the instructions are as follows.

If you are installing using the deployer, you can use the 'CASSANDRA_NODES' template parameter to specify how many Cassandra nodes you would like to deploy.

If you are already running a cluster and would just like to deploy another node, then you can add one while the cluster is still running. This needs to be done via a template.

For instance, if you want to deploy a second Cassandra node with 100Gi of disk space and using the 3.4.1 version, then you would run the following command:

$ oc process hawkular-cassandra-node-pv \
-v IMAGE_VERSION=3.4.1 \
-v PV_SIZE=100Gi \
-v NODE=2" 

If they are not using a persistent volume, then they would need to use the hawkular-cassandra-node-emptydir template instead (and they don't need to specify the PV_SIZE parameter).

Note: when you add a new Cassandra node to the cluster, data will need to move from one node to another and this can take time depending on how much data was originally being stored. You can 'exec' into the pod and run the following command to get a progress update on the file transfers: `nodetool netstats -H`

To scale down a Cassandra node, you will need to 'exec' into the pod and run the following command 'nodetool decommission'. This will move all the data from the second node back to the first and will also take time to complete.

Once the decommission has completed you can then scale down the second Cassandra rc to zero. At this point you can also remove the rc and pvc for the second Cassandra node, but it is recommended to keep this data around until you can verify that everything is functioning properly. Deleting the pvc can permanently delete that data.
Comment 33 Matt Wringe 2017-09-13 16:29:56 EDT
@erjones: I have attached the instructions for how to scale up the Cassandra nodes. Let us know if that resolves the issue.

Note You need to log in before you can comment on or make changes to this bug.