Bug 1338794

Summary:	Heapster was constantly restarted because the hawkular metrics pod was not ready
Product:	OpenShift Container Platform	Reporter:	Miheer Salunke <misalunk>
Component:	Hawkular	Assignee:	Matt Wringe <mwringe>
Status:	CLOSED CURRENTRELEASE	QA Contact:	chunchen <chunchen>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.1.0	CC:	aos-bugs, boris.ruppert, misalunk, wsun
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-07-20 14:44:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Miheer Salunke 2016-05-23 12:34:31 UTC

Description of problem:
Heapster was constantly restarted because the hawkular metrics pod was not ready: 

[...]
[qxn7076@ose3adm ops]$ oc logs heapster-vfgt1
Starting Heapster with the following arguments: --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=WqwkwJMUf0031EE&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy --stats_resolution=30s
I0512 09:43:49.185681       1 heapster.go:60] heapster --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=WqwkwJMUf0031EE&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy --stats_resolution=30s
I0512 09:43:49.190906       1 heapster.go:61] Heapster version 0.18.0
I0512 09:43:49.191397       1 kube_factory.go:168] Using Kubernetes client with master "https://kubernetes.default.svc:443" and version "v1"
I0512 09:43:49.191412       1 kube_factory.go:169] Using kubelet port 10250
I0512 09:43:49.192312       1 driver.go:491] Initialised Hawkular Sink with parameters {_system https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=WqwkwJMUf0031EE&filter=label(container_name:^/system.slice.*|^/user.slice) 0xc20817eea0 }
I0512 09:43:50.592720       1 heapster.go:71] Starting heapster on port 8082
E0512 09:44:08.772517       1 model_handlers.go:620] unable to get pod list metric: the model is not populated yet
E0512 09:44:38.796927       1 model_handlers.go:620] unable to get pod list metric: the model is not populated yet
E0512 09:45:08.836620       1 model_handlers.go:620] unable to get pod list metric: the model is not populated yet
E0512 09:45:38.874711       1 model_handlers.go:620] unable to get pod list metric: the model is not populated yet
E0512 09:46:08.895800       1 model_handlers.go:620] unable to get pod list metric: the model is not populated yet
[qxn7076@ose3adm ops]
[...]

However, the hawkular metrics pod showed no errors and I had to manually restart it to make the metrics work again. 

Version-Release number of selected component (if applicable):
Openshift Enterprise 3.1

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Matt Wringe 2016-05-24 13:16:55 UTC

For 3.2 we have resolved this a bit by making the time in between reboots far longer, but we are still going to have a similar issue. If Heapster cannot properly connect to Hawkular Metrics after a certain grace period, then we consider this an error condition and restart the pod (just like how any pod should be restarted if it enters an error state).

For 3.2 we have also helped to make this easier by changing how the lifecycle of the pod functions and by having these error messages showing up in the events log (there are current edge cases in OpenShift where the old lifecycle handling did not function properly).

Heapster should have automatically connected to Hawkular Metrics once it was properly started though. Are you sure there wasn't any error messages in the Hawkular Metrics logs or that that the state was ready in the Hawkular Metrics status page? (eg by visiting https://HAWKULAR_METRICS_HOSTNAME/hawkular/metrics in a browser).

Comment 2 Matt Wringe 2016-07-20 14:44:24 UTC

Closing this as it been fixed in OSE 3.2