Created attachment 1305626 [details] metrics container logs Description of problem: Deploy a cluster with 15K pause pods with metrics, restart heapster couple of times. CPU consumption on metrics pod goes high and Hawkular Metrics pod gets restarted. Cluster 1 Master 3 infra nodes 100 compute nodes Version-Release number of selected component (if applicable): openshift v3.6.151 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 metrics image versions : v3.6.152 Steps to Reproduce: 1. deploy a cluster with 15K pods 2. Install metrics 3. let metrics get stable 4. restart heapster, scale it down to 0 and back to 1 5. see metrics restarts How reproducible: Always with this configuration. Actual results: Metics restarts Expected results: Metrics should not restart Additional info: Please see metrics container logs attached, also attaching node logs.
/var/log/messages from infra node http://file.rdu.redhat.com/vlaad/logs/1476001/messages.zip
Created attachment 1305639 [details] cassandra logs
We tested this issue with 2 cassandra pods today: openshift_metrics_cassandra_replicas=2 Result: Restart of hawkular-metrics did not occur after bouncing heapster 3 times. This could be a solution/workaround of this bug. The catch-up of metrics data is slower than 1-cassandra case: 30 mins vs 23 mins. Catching-up means hawkular start to feed data again after bouncing heapster.
Do we know why Hawkular Metrics was restarted? Was it due to being OOMKilled, was it due to the liveness probe failing? Something else? When we encounter these types of problems, we need to make sure that we get the logs for heapster, hawkular metrics and cassandra. As well as the output of 'oc get pods -o yaml -n openshift-infra'. Without this information we cannot determine what went wrong. I am closing this issue as insufficient data. If you encounter this problem again in your testing, please reopen this issue and attach the requested information.
Did another round of testing with 15K pods attaching all the logs. This time metrics image which was deployed was docker.io/sanstefan/hawkular_metrics nobatch 99f4798319ef 27 hours ago 929.6 MB Network metrics from web console, when metrics is stable Metrics Sent 4208 KiB/s Recvd 3456 KiB/s Cassandra Sent 4195 KiB/s Recvd 895 KiB/s In this attempt every time I restart heapster metrics is restarting, attaching following logs for 2 attempts of heapster restart. 1. container logs for metrics container from "docker logs" command before the container was deleted. 2. yaml for all the pods under openshift-infra project 3. cassandra logs 4. heapster logs Please let me know if you need anything else.
Created attachment 1308782 [details] metrics_container_logs_second_restart
Created attachment 1308783 [details] metrics_container_logs_first_restart
Created attachment 1308784 [details] cassandra logs
Created attachment 1308785 [details] heapster logs
Created attachment 1308786 [details] metrics pod describe
Created attachment 1308787 [details] all pods under openshift-infra
Created attachment 1308805 [details] metrics rc
Created attachment 1308806 [details] cassandra rc
Created attachment 1308807 [details] heapster rc
The error for this is because the of an OutOfMemory exception which can be fixed by increasing the memory allocation for the Hawkular Metrics pods. Please re-open if you think we need to investigate further.