Created attachment 1214208 [details] logs from hawkular oc get logs Description of problem: Openshift metrics graphs are not showing metrics date for case when in cluster are many nodes and pods. In this specific case there was 212 OpenShift nodes 3 Openshift Masters 3 Openshift ETCd servers 1 Openshift router in total 219 OpenShift machines There were 18700 pods scheduled to run in cluster. One set of metrics pods were in openshift-infra and were running Version-Release number of selected component (if applicable): OpenShift packages : atomic-openshift-dockerregistry-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-pod-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-clients-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-tests-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-clients-redistributable-3.3.1.1-1.git.0.629a1d8.el7.x86_64 tuned-profiles-atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-master-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-sdn-ovs-3.3.1.1-1.git.0.629a1d8.el7.x86_64 Metrics Images tagged with : v3.3 How reproducible: I have seen this issue for first time, and I think if one create approximate number ( of more ) of pods it will lead to this issue Steps to Reproduce: See "How reproducible" Actual results: Metrics graphs are empty when accessed them via web interface Expected results: Metrics graphs to show collected metrics date Additional info: Log files attached
Created attachment 1214209 [details] heapster log
Created attachment 1214211 [details] cassandra pod1 logs
Created attachment 1214212 [details] cassandra pod2 logs
So what happened here? E1019 10:11:41.473744 1 client.go:243] Post https://hawkular-metrics:443/hawkular/metrics/counters/data: dial tcp 172.27.125.238:443: getsockopt: no route to host This is clearly not coming from Hawkular-Metrics. Even if it didn't answer this wouldn't be a reply. Did the Openshift/Kubernetes networking stack fail at this point? These requests never reached the network. Next up: All host(s) tried for query failed (tried: /172.20.214.4:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/172.20.214.4] Write attempt on defunct connection), hawkular-cassandra/172.26.122.99:9042 (com.datastax.driver.core.exceptions.ConnectionException: [hawkular-cassandra/172.26.122.99] Write attempt on defunct connection)) This is also a network connection error (network connection failed at some point and a defunct connection was placed to pool). I don't think this can be caused by overload either (you should see timeouts then). Both of those can be same errors in the networking although the error message is different. This is going to be trickier to track, but I think we'll need someone who knows the networking to answer these questions. These are not application errors.
Is there any more information that the OpenShift Integration Services or Hawkular team can provide here?
I don't really see anything in the Hawkular Metrics or Cassandra logs which would indicate to me that there is something wrong with either of them. @jsanda: does anything in those logs make you think its something wrong with either Hawkular Metrics or Cassandra? From the Heapster logs: "Node dhcp8-207.example.net is not ready" - I am assuming that there are a bunch of nodes which are not fully started yet (or have been shutdown) which are causing these issues and that there are a bunch of other nodes which are indeed running properly. If all your nodes are not in the ready state, then this would indicate something is wrong with the setup. A bunch of potential issues about the network: "error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://172.16.7.169:10250/stats/container/": Post https://172.16.7.169:10250/stats/container/: dial tcp 172.16.7.169:10250: getsockopt: connection refused" This is coming from the kubelet itself, its not allowing a connection. You may want to check the logs for the node where this is coming from. This could be due to something like the policy not being updated after the OpenShift cluster was updated. "Post https://hawkular-metrics:443/hawkular/metrics/counters/data: dial tcp 172.27.125.238:443: getsockopt: no route to host" This is coming from the network setup and is not related to the metric components. "Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: /172.20.214.4:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/172.20.214.4] Write attempt on defunct connection), hawkular-cassandra/172.26.122.99:9042 (com.datastax.driver.core.exceptions.ConnectionException: [hawkular-cassandra/172.26.122.99] Write attempt on defunct connection))" I suspect this too is caused by something going wrong with the network. @jsanda: any ideas about this issue?
On the very same environment all works fine with < 10k pods across 220 nodes. Once I start pumping number of pods towards 20k, it will stop showing metrics data on graphs.
Do you have any luck increasing the number of Hawkular Metrics or Cassandra nodes?
(In reply to Matt Wringe from comment #7) > I don't really see anything in the Hawkular Metrics or Cassandra logs which > would indicate to me that there is something wrong with either of them. > > @jsanda: does anything in those logs make you think its something wrong with > either Hawkular Metrics or Cassandra? > > From the Heapster logs: > > "Node dhcp8-207.example.net is not ready" > > - I am assuming that there are a bunch of nodes which are not fully started > yet (or have been shutdown) which are causing these issues and that there > are a bunch of other nodes which are indeed running properly. If all your > nodes are not in the ready state, then this would indicate something is > wrong with the setup. > > A bunch of potential issues about the network: > > "error while getting containers from Kubelet: failed to get all container > stats from Kubelet URL "https://172.16.7.169:10250/stats/container/": Post > https://172.16.7.169:10250/stats/container/: dial tcp 172.16.7.169:10250: > getsockopt: connection refused" > > This is coming from the kubelet itself, its not allowing a connection. You > may want to check the logs for the node where this is coming from. This > could be due to something like the policy not being updated after the > OpenShift cluster was updated. > > "Post https://hawkular-metrics:443/hawkular/metrics/counters/data: dial tcp > 172.27.125.238:443: getsockopt: no route to host" > > This is coming from the network setup and is not related to the metric > components. > > "Could not update tags: Hawkular returned status code 500, error message: > Failed to perform operation due to an error: All host(s) tried for query > failed (tried: /172.20.214.4:9042 > (com.datastax.driver.core.exceptions.ConnectionException: [/172.20.214.4] > Write attempt on defunct connection), hawkular-cassandra/172.26.122.99:9042 > (com.datastax.driver.core.exceptions.ConnectionException: > [hawkular-cassandra/172.26.122.99] Write attempt on defunct connection))" > > I suspect this too is caused by something going wrong with the network. > > @jsanda: any ideas about this issue? The Cassandra logs look normal. Nothing there to suggest an issue with Cassandra. The connection exceptions can happen with nodes restarting, but it doesn't sound like that is the case here.
*** This bug has been marked as a duplicate of bug 1465532 ***