Bug 1472451

Summary: Periodically there are Gaps in Charts
Product: OpenShift Container Platform Reporter: Eric Jones <erjones>
Component: HawkularAssignee: John Sanda <jsanda>
Status: CLOSED WORKSFORME QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.4.1CC: aos-bugs, erich, erjones, hasha, jcantril, jsanda, pweil, rbost, suchaudh
Target Milestone: ---Keywords: Reopened, Unconfirmed
Target Release: 3.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-12 19:14:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
metrics gap in UI none

Description Eric Jones 2017-07-18 19:03:34 UTC
Description of problem:
Pods are stable, not crashing, but periodically there are gaps in the metrics charts.

Customer even noted that the hawkular-metrics pod and one of the two cassandra pods saw a cpu/memory spike before the gap (but not the other cassandra pod or heapster)

Version-Release number of selected component (if applicable):
3.4
newest metrics components as of 2017/07/14

Additional info:
Attaching pod logs and screenshots shortly

Comment 2 Matt Wringe 2017-07-19 15:39:11 UTC
The gaps I am seeing in the logs seem to indicate the metrics were not being collected during that time. This could be caused by a pod being restarted or one of the pods being unresponsive during that time.

Can you get the output of 'oc get pods -o yaml -n openshift-infra'? That should let us know if there has been any pod restarts and the cause of the restart.

That should help us better understand what happened.

@jsanda: any idea what would be causing the spikes which are seen for Hawkular Metrics and Cassandra pods? Could the compression job be causing something like this?

Comment 3 John Sanda 2017-07-19 17:16:08 UTC
I cannot rule out the compression job, but it would not be my first guess. I do not know which what 3.4 release we are dealing with, but remember that we had the netty bug in Cassandra 3.0.9 that was pretty bad. As for hawkular-metrics it is hard to say. 

Eric can please you get logs for hawkular-metrics, cassandra, and heapster.

Comment 6 Junqi Zhao 2017-07-21 07:58:30 UTC
Tested on 3.6, this issue sometimes is happened, most of time I did not see this error, see the attached picture, checked hawkular-metrics pods log, there are UnhandledException and  java.io.IOException: Connection reset by peer, same with this defect

017-07-21 07:49:47,260 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-1) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception
	at org.jboss.resteasy.core.SynchronousDispatcher.writeException(SynchronousDispatcher.java:174)
	at org.jboss.resteasy.core.SynchronousDispatcher.asynchronousExceptionDelivery(SynchronousDispatcher.java:444)
	at org.jboss.resteasy.core.AbstractAsynchronousResponse.internalResume(AbstractAsynchronousResponse.java:196)
	at org.jboss.resteasy.core.AbstractAsynchronousResponse.internalResume(AbstractAsynchronousResponse.java:185)
	at org.jboss.resteasy.plugins.server.servlet.Servlet3AsyncHttpRequest$Servlet3ExecutionContext$S

Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)

Comment 7 Junqi Zhao 2017-07-21 07:59:02 UTC
Created attachment 1302208 [details]
metrics gap in UI

Comment 8 John Sanda 2017-07-26 02:26:41 UTC
Eric can you please provide logs for hawkular-metrics, cassandra, and heapster. Thanks.

Comment 26 Matt Wringe 2017-08-22 20:25:19 UTC
The gaps in the charts are because Heapster was not able to push metrics out to Hawkular Metrics.

From the Heapster logs:
E0817 18:07:00.969883       1 client.go:243] Post https://hawkular-metrics:443/hawkular/metrics/counters/data: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

It looks like its taking Hawkular Metrics too long to process the request (and the Hawkular Metrics log confirm this with all the 'connection reset by peer' messages in the logs).

When this happens, it usually means that the metric pods are overloaded and are not able to keep up with the reads and writes.

You have a few options here:

- horizontally scale the Hawkular Metrics and/or Cassandra pods

- or verically scale the Hawkular Metrics and/or Cassandra pods (eg increase any resource limits they may have applied)

@jsanda: any good recommendation for what to try first?

Comment 27 John Sanda 2017-08-23 14:29:08 UTC
(In reply to Matt Wringe from comment #26)
> The gaps in the charts are because Heapster was not able to push metrics out
> to Hawkular Metrics.
> 
> From the Heapster logs:
> E0817 18:07:00.969883       1 client.go:243] Post
> https://hawkular-metrics:443/hawkular/metrics/counters/data: net/http:
> request canceled (Client.Timeout exceeded while awaiting headers)
> 
> It looks like its taking Hawkular Metrics too long to process the request
> (and the Hawkular Metrics log confirm this with all the 'connection reset by
> peer' messages in the logs).
> 
> When this happens, it usually means that the metric pods are overloaded and
> are not able to keep up with the reads and writes.
> 
> You have a few options here:
> 
> - horizontally scale the Hawkular Metrics and/or Cassandra pods
> 
> - or verically scale the Hawkular Metrics and/or Cassandra pods (eg increase
> any resource limits they may have applied)
> 
> @jsanda: any good recommendation for what to try first?

I am having trouble correlating anything in the Cassandra logs with the heapster timeout. Eric, can you see if we can get the Cassandra debug logs. They are located in the cassandra pods at /opt/apache-cassandra/logs/debug.log.

Comment 29 Eric Jones 2017-08-23 21:24:08 UTC
At the prompting of John, I reached out to the customer to confirm the timezone for the cluster:

[root@njrarltapp00171 ~]# timedatectl
      Local time: Wed 2017-08-23 16:29:54 EDT
  Universal time: Wed 2017-08-23 20:29:54 UTC
        RTC time: Wed 2017-08-23 20:29:54
       Time zone: America/New_York (EDT, -0400)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Sun 2017-03-12 01:59:59 EST
                  Sun 2017-03-12 03:00:00 EDT
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2017-11-05 01:59:59 EDT
                  Sun 2017-11-05 01:00:00 EST

Comment 32 Matt Wringe 2017-09-13 20:29:06 UTC
We are still working on getting the documentation updated to properly reflect how Cassandra should be scaled.

But for 3.4, the instructions are as follows.

If you are installing using the deployer, you can use the 'CASSANDRA_NODES' template parameter to specify how many Cassandra nodes you would like to deploy.

If you are already running a cluster and would just like to deploy another node, then you can add one while the cluster is still running. This needs to be done via a template.

For instance, if you want to deploy a second Cassandra node with 100Gi of disk space and using the 3.4.1 version, then you would run the following command:

$ oc process hawkular-cassandra-node-pv \
-v IMAGE_VERSION=3.4.1 \
-v PV_SIZE=100Gi \
-v NODE=2" 

If they are not using a persistent volume, then they would need to use the hawkular-cassandra-node-emptydir template instead (and they don't need to specify the PV_SIZE parameter).

Note: when you add a new Cassandra node to the cluster, data will need to move from one node to another and this can take time depending on how much data was originally being stored. You can 'exec' into the pod and run the following command to get a progress update on the file transfers: `nodetool netstats -H`

To scale down a Cassandra node, you will need to 'exec' into the pod and run the following command 'nodetool decommission'. This will move all the data from the second node back to the first and will also take time to complete.

Once the decommission has completed you can then scale down the second Cassandra rc to zero. At this point you can also remove the rc and pvc for the second Cassandra node, but it is recommended to keep this data around until you can verify that everything is functioning properly. Deleting the pvc can permanently delete that data.

Comment 33 Matt Wringe 2017-09-13 20:29:56 UTC
@erjones: I have attached the instructions for how to scale up the Cassandra nodes. Let us know if that resolves the issue.

Comment 34 Matt Wringe 2017-10-05 18:37:05 UTC
I am closing this issue as we never got back feedback if scaling up the cassandra instances fixed the issue or not.

If the scaling did not fix the problem, please reopen.

Comment 35 Robert Bost 2018-01-08 14:18:09 UTC
Reopening this issue since customer has reported gaps in metrics charts after scaling up cassandra instances.

Is there anymore information we can capture to help with troubleshooting?

Comment 36 Matt Wringe 2018-01-08 14:54:24 UTC
Gaps in the charts is usually due to a performance problem where Heapster cannot push out to Hawkular Metrics all the metrics in time (it will try and push out metrics every 30 seconds, if it is still pushing out metrics during the next scheduled time it will skip pushing out new metrics until the next scheduled time).

If you have larger gaps in your charts, then its likely due to one of the pods being OOMKilled.

You will need to attach:

- the logs for Hawkular Metrics, Cassandra and Heapster. If your pods are being killed, you will also need to attach the logs for the failed pods (eg 'oc logs -p $POD_NAME)
- 'oc get pods -o yaml -n openshift-infra'
- 'oc describe pods -n openshift-infra'