Bug 1476001 - restarting heapster couple of times with 15K pods on cluster causes hawkular metrics pod to restart
restarting heapster couple of times with 15K pods on cluster causes hawkular ...
Status: ASSIGNED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.6.0
Unspecified Unspecified
unspecified Severity high
: ---
: 3.7.0
Assigned To: Matt Wringe
Vikas Laad
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-27 15:30 EDT by Vikas Laad
Modified: 2017-08-03 13 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-03 09:38:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
metrics container logs (102.04 KB, text/plain)
2017-07-27 15:30 EDT, Vikas Laad
no flags Details
cassandra logs (93.81 KB, text/plain)
2017-07-27 15:54 EDT, Vikas Laad
no flags Details
metrics_container_logs_second_restart (62.87 KB, text/plain)
2017-08-03 11:39 EDT, Vikas Laad
no flags Details
metrics_container_logs_first_restart (60.05 KB, text/plain)
2017-08-03 11:40 EDT, Vikas Laad
no flags Details
cassandra logs (59.18 KB, text/plain)
2017-08-03 11:41 EDT, Vikas Laad
no flags Details
heapster logs (12.93 KB, text/plain)
2017-08-03 11:41 EDT, Vikas Laad
no flags Details
metrics pod describe (5.98 KB, text/plain)
2017-08-03 11:41 EDT, Vikas Laad
no flags Details
all pods under openshift-infra (17.00 KB, text/plain)
2017-08-03 11:42 EDT, Vikas Laad
no flags Details
metrics rc (2.95 KB, text/plain)
2017-08-03 13:19 EDT, Vikas Laad
no flags Details
cassandra rc (2.15 KB, text/plain)
2017-08-03 13:19 EDT, Vikas Laad
no flags Details
heapster rc (2.72 KB, text/plain)
2017-08-03 13:20 EDT, Vikas Laad
no flags Details

  None (edit)
Description Vikas Laad 2017-07-27 15:30:02 EDT
Created attachment 1305626 [details]
metrics container logs

Description of problem:
Deploy a cluster with 15K pause pods with metrics, restart heapster couple of times. CPU consumption on metrics pod goes high and Hawkular Metrics pod gets restarted.

Cluster
1 Master
3 infra nodes
100 compute nodes

Version-Release number of selected component (if applicable):
openshift v3.6.151
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

metrics image versions : v3.6.152

Steps to Reproduce:
1. deploy a cluster with 15K pods
2. Install metrics
3. let metrics get stable
4. restart heapster, scale it down to 0 and back to 1
5. see metrics restarts

How reproducible:
Always with this configuration.

Actual results:
Metics restarts

Expected results:
Metrics should not restart

Additional info:
Please see metrics container logs attached, also attaching node logs.
Comment 2 Vikas Laad 2017-07-27 15:39:49 EDT
/var/log/messages from infra node http://file.rdu.redhat.com/vlaad/logs/1476001/messages.zip
Comment 3 Vikas Laad 2017-07-27 15:54 EDT
Created attachment 1305639 [details]
cassandra logs
Comment 4 Hongkai Liu 2017-07-28 16:13:15 EDT
We tested this issue with 2 cassandra pods today:
openshift_metrics_cassandra_replicas=2

Result:
Restart of hawkular-metrics did not occur after bouncing heapster 3 times. This could be a solution/workaround of this bug.

The catch-up of metrics data is slower than 1-cassandra case: 30 mins vs 23 mins. Catching-up means hawkular start to feed data again after bouncing heapster.
Comment 9 Matt Wringe 2017-08-03 09:38:51 EDT
Do we know why Hawkular Metrics was restarted? Was it due to being OOMKilled, was it due to the liveness probe failing? Something else?

When we encounter these types of problems, we need to make sure that we get the logs for heapster, hawkular metrics and cassandra.

As well as the output of 'oc get pods -o yaml -n openshift-infra'.

Without this information we cannot determine what went wrong.

I am closing this issue as insufficient data. If you encounter this problem again in your testing, please reopen this issue and attach the requested information.
Comment 10 Vikas Laad 2017-08-03 11:39:18 EDT
Did another round of testing with 15K pods attaching all the logs. This time metrics image which was deployed was docker.io/sanstefan/hawkular_metrics                              nobatch             99f4798319ef        27 hours ago        929.6 MB

Network metrics from web console, when metrics is stable
Metrics
Sent 4208 KiB/s
Recvd 3456 KiB/s

Cassandra
Sent 4195 KiB/s
Recvd 895 KiB/s

In this attempt every time I restart heapster metrics is restarting, attaching following logs for 2 attempts of heapster restart.
1. container logs for metrics container from "docker logs" command before the container was deleted. 
2. yaml for all the pods under openshift-infra project
3. cassandra logs
4. heapster logs

Please let me know if you need anything else.
Comment 11 Vikas Laad 2017-08-03 11:39 EDT
Created attachment 1308782 [details]
metrics_container_logs_second_restart
Comment 12 Vikas Laad 2017-08-03 11:40 EDT
Created attachment 1308783 [details]
metrics_container_logs_first_restart
Comment 13 Vikas Laad 2017-08-03 11:41 EDT
Created attachment 1308784 [details]
cassandra logs
Comment 14 Vikas Laad 2017-08-03 11:41 EDT
Created attachment 1308785 [details]
heapster logs
Comment 15 Vikas Laad 2017-08-03 11:41 EDT
Created attachment 1308786 [details]
metrics pod describe
Comment 16 Vikas Laad 2017-08-03 11:42 EDT
Created attachment 1308787 [details]
all pods under openshift-infra
Comment 17 Vikas Laad 2017-08-03 13:19 EDT
Created attachment 1308805 [details]
metrics rc
Comment 18 Vikas Laad 2017-08-03 13:19 EDT
Created attachment 1308806 [details]
cassandra rc
Comment 19 Vikas Laad 2017-08-03 13:20 EDT
Created attachment 1308807 [details]
heapster rc

Note You need to log in before you can comment on or make changes to this bug.