Bug 1388815 - Metrics graphs empty for case when many nodes (210) and pods (18.7k ) in Openshift cluster
Summary: Metrics graphs empty for case when many nodes (210) and pods (18.7k ) in Open...
Keywords:
Status: CLOSED DUPLICATE of bug 1465532
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.3.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.7.0
Assignee: Matt Wringe
QA Contact: Peng Li
URL:
Whiteboard: aos-scalability-34
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-26 08:43 UTC by Elvir Kuric
Modified: 2017-08-04 15:47 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-04 15:47:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from hawkular oc get logs (28.18 KB, text/plain)
2016-10-26 08:43 UTC, Elvir Kuric
no flags Details
heapster log (323.54 KB, text/plain)
2016-10-26 08:45 UTC, Elvir Kuric
no flags Details
cassandra pod1 logs (6.21 MB, text/plain)
2016-10-26 08:46 UTC, Elvir Kuric
no flags Details
cassandra pod2 logs (2.22 MB, text/plain)
2016-10-26 08:47 UTC, Elvir Kuric
no flags Details

Description Elvir Kuric 2016-10-26 08:43:36 UTC
Created attachment 1214208 [details]
logs from hawkular  oc get logs

Description of problem:

Openshift metrics graphs are not showing metrics date for case when in cluster are many nodes and pods.

In this specific case there was 

212 OpenShift nodes
3 Openshift Masters 
3 Openshift ETCd servers
1 Openshift router 

in total 219 OpenShift machines 

There were 18700 pods scheduled to run in cluster.
One set of metrics pods were in openshift-infra  and were running 

Version-Release number of selected component (if applicable):

OpenShift packages :
atomic-openshift-dockerregistry-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-pod-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-clients-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-tests-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-clients-redistributable-3.3.1.1-1.git.0.629a1d8.el7.x86_64
tuned-profiles-atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-master-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-sdn-ovs-3.3.1.1-1.git.0.629a1d8.el7.x86_64

Metrics Images tagged with :  v3.3 

How reproducible:
I have seen this issue for first time, and I think if one create approximate number ( of more ) of pods it will lead to this issue 




Steps to Reproduce:
See "How reproducible" 

Actual results:

Metrics graphs are empty when accessed them via web interface 

Expected results:
Metrics graphs to show collected metrics date 

Additional info:
Log files attached

Comment 1 Elvir Kuric 2016-10-26 08:45:57 UTC
Created attachment 1214209 [details]
heapster log

Comment 2 Elvir Kuric 2016-10-26 08:46:41 UTC
Created attachment 1214211 [details]
cassandra pod1 logs

Comment 3 Elvir Kuric 2016-10-26 08:47:31 UTC
Created attachment 1214212 [details]
cassandra pod2 logs

Comment 4 Michael Burman 2016-10-26 12:11:26 UTC
So what happened here? 

E1019 10:11:41.473744       1 client.go:243] Post https://hawkular-metrics:443/hawkular/metrics/counters/data: dial tcp 172.27.125.238:443: getsockopt: no route to host

This is clearly not coming from Hawkular-Metrics. Even if it didn't answer this wouldn't be a reply. Did the Openshift/Kubernetes networking stack fail at this point? These requests never reached the network.

Next up:

All host(s) tried for query failed (tried: /172.20.214.4:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/172.20.214.4] Write attempt on defunct connection), hawkular-cassandra/172.26.122.99:9042 (com.datastax.driver.core.exceptions.ConnectionException: [hawkular-cassandra/172.26.122.99] Write attempt on defunct connection))

This is also a network connection error (network connection failed at some point and a defunct connection was placed to pool). I don't think this can be caused by overload either (you should see timeouts then).

Both of those can be same errors in the networking although the error message is different.

This is going to be trickier to track, but I think we'll need someone who knows the networking to answer these questions. These are not application errors.

Comment 5 Matt Wringe 2016-10-31 16:05:40 UTC
Is there any more information that the OpenShift Integration Services or Hawkular team can provide here?

Comment 7 Matt Wringe 2016-10-31 18:45:17 UTC
I don't really see anything in the Hawkular Metrics or Cassandra logs which would indicate to me that there is something wrong with either of them.

@jsanda: does anything in those logs make you think its something wrong with either Hawkular Metrics or Cassandra?

From the Heapster logs:

"Node dhcp8-207.example.net is not ready"

- I am assuming that there are a bunch of nodes which are not fully started yet (or have been shutdown) which are causing these issues and that there are a bunch of other nodes which are indeed running properly. If all your nodes are not in the ready state, then this would indicate something is wrong with the setup.

A bunch of potential issues about the network:

"error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.16.7.169:10250/stats/container/": Post https://172.16.7.169:10250/stats/container/: dial tcp 172.16.7.169:10250: getsockopt: connection refused"

This is coming from the kubelet itself, its not allowing a connection. You may want to check the logs for the node where this is coming from. This could be due to something like the policy not being updated after the OpenShift cluster was updated.

"Post https://hawkular-metrics:443/hawkular/metrics/counters/data: dial tcp 172.27.125.238:443: getsockopt: no route to host"

This is coming from the network setup and is not related to the metric components.

"Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: /172.20.214.4:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/172.20.214.4] Write attempt on defunct connection), hawkular-cassandra/172.26.122.99:9042 (com.datastax.driver.core.exceptions.ConnectionException: [hawkular-cassandra/172.26.122.99] Write attempt on defunct connection))"

I suspect this too is caused by something going wrong with the network.

@jsanda: any ideas about this issue?

Comment 8 Elvir Kuric 2016-10-31 18:56:58 UTC
On the very same environment all works fine with < 10k pods across 220 nodes. Once I start pumping number of pods towards 20k, it will stop showing metrics data on graphs.

Comment 10 Matt Wringe 2016-10-31 19:04:31 UTC
Do you have any luck increasing the number of Hawkular Metrics or Cassandra nodes?

Comment 13 John Sanda 2016-10-31 20:48:12 UTC
(In reply to Matt Wringe from comment #7)
> I don't really see anything in the Hawkular Metrics or Cassandra logs which
> would indicate to me that there is something wrong with either of them.
> 
> @jsanda: does anything in those logs make you think its something wrong with
> either Hawkular Metrics or Cassandra?
> 
> From the Heapster logs:
> 
> "Node dhcp8-207.example.net is not ready"
> 
> - I am assuming that there are a bunch of nodes which are not fully started
> yet (or have been shutdown) which are causing these issues and that there
> are a bunch of other nodes which are indeed running properly. If all your
> nodes are not in the ready state, then this would indicate something is
> wrong with the setup.
> 
> A bunch of potential issues about the network:
> 
> "error while getting containers from Kubelet: failed to get all container
> stats from Kubelet URL "https://172.16.7.169:10250/stats/container/": Post
> https://172.16.7.169:10250/stats/container/: dial tcp 172.16.7.169:10250:
> getsockopt: connection refused"
> 
> This is coming from the kubelet itself, its not allowing a connection. You
> may want to check the logs for the node where this is coming from. This
> could be due to something like the policy not being updated after the
> OpenShift cluster was updated.
> 
> "Post https://hawkular-metrics:443/hawkular/metrics/counters/data: dial tcp
> 172.27.125.238:443: getsockopt: no route to host"
> 
> This is coming from the network setup and is not related to the metric
> components.
> 
> "Could not update tags: Hawkular returned status code 500, error message:
> Failed to perform operation due to an error: All host(s) tried for query
> failed (tried: /172.20.214.4:9042
> (com.datastax.driver.core.exceptions.ConnectionException: [/172.20.214.4]
> Write attempt on defunct connection), hawkular-cassandra/172.26.122.99:9042
> (com.datastax.driver.core.exceptions.ConnectionException:
> [hawkular-cassandra/172.26.122.99] Write attempt on defunct connection))"
> 
> I suspect this too is caused by something going wrong with the network.
> 
> @jsanda: any ideas about this issue?

The Cassandra logs look normal. Nothing there to suggest an issue with Cassandra. The connection exceptions can happen with nodes restarting, but it doesn't sound like that is the case here.

Comment 37 Stefan Negrea 2017-08-04 15:47:06 UTC

*** This bug has been marked as a duplicate of bug 1465532 ***


Note You need to log in before you can comment on or make changes to this bug.