Red Hat Bugzilla – Bug 1476001
restarting heapster couple of times with 15K pods on cluster causes hawkular metrics pod to restart
Last modified: 2017-10-05 15:11:32 EDT
Created attachment 1305626 [details]
metrics container logs
Description of problem:
Deploy a cluster with 15K pause pods with metrics, restart heapster couple of times. CPU consumption on metrics pod goes high and Hawkular Metrics pod gets restarted.
3 infra nodes
100 compute nodes
Version-Release number of selected component (if applicable):
metrics image versions : v3.6.152
Steps to Reproduce:
1. deploy a cluster with 15K pods
2. Install metrics
3. let metrics get stable
4. restart heapster, scale it down to 0 and back to 1
5. see metrics restarts
Always with this configuration.
Metrics should not restart
Please see metrics container logs attached, also attaching node logs.
/var/log/messages from infra node http://file.rdu.redhat.com/vlaad/logs/1476001/messages.zip
Created attachment 1305639 [details]
We tested this issue with 2 cassandra pods today:
Restart of hawkular-metrics did not occur after bouncing heapster 3 times. This could be a solution/workaround of this bug.
The catch-up of metrics data is slower than 1-cassandra case: 30 mins vs 23 mins. Catching-up means hawkular start to feed data again after bouncing heapster.
Do we know why Hawkular Metrics was restarted? Was it due to being OOMKilled, was it due to the liveness probe failing? Something else?
When we encounter these types of problems, we need to make sure that we get the logs for heapster, hawkular metrics and cassandra.
As well as the output of 'oc get pods -o yaml -n openshift-infra'.
Without this information we cannot determine what went wrong.
I am closing this issue as insufficient data. If you encounter this problem again in your testing, please reopen this issue and attach the requested information.
Did another round of testing with 15K pods attaching all the logs. This time metrics image which was deployed was docker.io/sanstefan/hawkular_metrics nobatch 99f4798319ef 27 hours ago 929.6 MB
Network metrics from web console, when metrics is stable
Sent 4208 KiB/s
Recvd 3456 KiB/s
Sent 4195 KiB/s
Recvd 895 KiB/s
In this attempt every time I restart heapster metrics is restarting, attaching following logs for 2 attempts of heapster restart.
1. container logs for metrics container from "docker logs" command before the container was deleted.
2. yaml for all the pods under openshift-infra project
3. cassandra logs
4. heapster logs
Please let me know if you need anything else.
Created attachment 1308782 [details]
Created attachment 1308783 [details]
Created attachment 1308784 [details]
Created attachment 1308785 [details]
Created attachment 1308786 [details]
metrics pod describe
Created attachment 1308787 [details]
all pods under openshift-infra
Created attachment 1308805 [details]
Created attachment 1308806 [details]
Created attachment 1308807 [details]
The error for this is because the of an OutOfMemory exception which can be fixed by increasing the memory allocation for the Hawkular Metrics pods.
Please re-open if you think we need to investigate further.