Bug 1476001 - restarting heapster couple of times with 15K pods on cluster causes hawkular metrics pod to restart
Summary: restarting heapster couple of times with 15K pods on cluster causes hawkular ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.7.0
Assignee: Matt Wringe
QA Contact: Vikas Laad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-27 19:30 UTC by Vikas Laad
Modified: 2017-10-05 19:11 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-05 19:11:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
metrics container logs (102.04 KB, text/plain)
2017-07-27 19:30 UTC, Vikas Laad
no flags Details
cassandra logs (93.81 KB, text/plain)
2017-07-27 19:54 UTC, Vikas Laad
no flags Details
metrics_container_logs_second_restart (62.87 KB, text/plain)
2017-08-03 15:39 UTC, Vikas Laad
no flags Details
metrics_container_logs_first_restart (60.05 KB, text/plain)
2017-08-03 15:40 UTC, Vikas Laad
no flags Details
cassandra logs (59.18 KB, text/plain)
2017-08-03 15:41 UTC, Vikas Laad
no flags Details
heapster logs (12.93 KB, text/plain)
2017-08-03 15:41 UTC, Vikas Laad
no flags Details
metrics pod describe (5.98 KB, text/plain)
2017-08-03 15:41 UTC, Vikas Laad
no flags Details
all pods under openshift-infra (17.00 KB, text/plain)
2017-08-03 15:42 UTC, Vikas Laad
no flags Details
metrics rc (2.95 KB, text/plain)
2017-08-03 17:19 UTC, Vikas Laad
no flags Details
cassandra rc (2.15 KB, text/plain)
2017-08-03 17:19 UTC, Vikas Laad
no flags Details
heapster rc (2.72 KB, text/plain)
2017-08-03 17:20 UTC, Vikas Laad
no flags Details

Description Vikas Laad 2017-07-27 19:30:02 UTC
Created attachment 1305626 [details]
metrics container logs

Description of problem:
Deploy a cluster with 15K pause pods with metrics, restart heapster couple of times. CPU consumption on metrics pod goes high and Hawkular Metrics pod gets restarted.

Cluster
1 Master
3 infra nodes
100 compute nodes

Version-Release number of selected component (if applicable):
openshift v3.6.151
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

metrics image versions : v3.6.152

Steps to Reproduce:
1. deploy a cluster with 15K pods
2. Install metrics
3. let metrics get stable
4. restart heapster, scale it down to 0 and back to 1
5. see metrics restarts

How reproducible:
Always with this configuration.

Actual results:
Metics restarts

Expected results:
Metrics should not restart

Additional info:
Please see metrics container logs attached, also attaching node logs.

Comment 2 Vikas Laad 2017-07-27 19:39:49 UTC
/var/log/messages from infra node http://file.rdu.redhat.com/vlaad/logs/1476001/messages.zip

Comment 3 Vikas Laad 2017-07-27 19:54:02 UTC
Created attachment 1305639 [details]
cassandra logs

Comment 4 Hongkai Liu 2017-07-28 20:13:15 UTC
We tested this issue with 2 cassandra pods today:
openshift_metrics_cassandra_replicas=2

Result:
Restart of hawkular-metrics did not occur after bouncing heapster 3 times. This could be a solution/workaround of this bug.

The catch-up of metrics data is slower than 1-cassandra case: 30 mins vs 23 mins. Catching-up means hawkular start to feed data again after bouncing heapster.

Comment 9 Matt Wringe 2017-08-03 13:38:51 UTC
Do we know why Hawkular Metrics was restarted? Was it due to being OOMKilled, was it due to the liveness probe failing? Something else?

When we encounter these types of problems, we need to make sure that we get the logs for heapster, hawkular metrics and cassandra.

As well as the output of 'oc get pods -o yaml -n openshift-infra'.

Without this information we cannot determine what went wrong.

I am closing this issue as insufficient data. If you encounter this problem again in your testing, please reopen this issue and attach the requested information.

Comment 10 Vikas Laad 2017-08-03 15:39:18 UTC
Did another round of testing with 15K pods attaching all the logs. This time metrics image which was deployed was docker.io/sanstefan/hawkular_metrics                              nobatch             99f4798319ef        27 hours ago        929.6 MB

Network metrics from web console, when metrics is stable
Metrics
Sent 4208 KiB/s
Recvd 3456 KiB/s

Cassandra
Sent 4195 KiB/s
Recvd 895 KiB/s

In this attempt every time I restart heapster metrics is restarting, attaching following logs for 2 attempts of heapster restart.
1. container logs for metrics container from "docker logs" command before the container was deleted. 
2. yaml for all the pods under openshift-infra project
3. cassandra logs
4. heapster logs

Please let me know if you need anything else.

Comment 11 Vikas Laad 2017-08-03 15:39:52 UTC
Created attachment 1308782 [details]
metrics_container_logs_second_restart

Comment 12 Vikas Laad 2017-08-03 15:40:15 UTC
Created attachment 1308783 [details]
metrics_container_logs_first_restart

Comment 13 Vikas Laad 2017-08-03 15:41:01 UTC
Created attachment 1308784 [details]
cassandra logs

Comment 14 Vikas Laad 2017-08-03 15:41:22 UTC
Created attachment 1308785 [details]
heapster logs

Comment 15 Vikas Laad 2017-08-03 15:41:58 UTC
Created attachment 1308786 [details]
metrics pod describe

Comment 16 Vikas Laad 2017-08-03 15:42:37 UTC
Created attachment 1308787 [details]
all pods under openshift-infra

Comment 17 Vikas Laad 2017-08-03 17:19:14 UTC
Created attachment 1308805 [details]
metrics rc

Comment 18 Vikas Laad 2017-08-03 17:19:37 UTC
Created attachment 1308806 [details]
cassandra rc

Comment 19 Vikas Laad 2017-08-03 17:20:03 UTC
Created attachment 1308807 [details]
heapster rc

Comment 20 Matt Wringe 2017-10-05 19:11:32 UTC
The error for this is because the of an OutOfMemory exception which can be fixed by increasing the memory allocation for the Hawkular Metrics pods.

Please re-open if you think we need to investigate further.


Note You need to log in before you can comment on or make changes to this bug.