1413658 – Heapster reports no metrics for pods.

Bug 1413658 - Heapster reports no metrics for pods.

Summary: Heapster reports no metrics for pods.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Solly Ross
QA Contact:	Peng Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-16 15:59 UTC by Wesley Hearn
Modified:	2020-03-11 15:36 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	thin_ls support in the base operating system, coupled with thin_ls metrics support in cAdvisor, could cause cAdvisor's metrics collection interval to become significantly larger than expected. This in turn causes Heapster to skip rate calculation, leading to missing CPU metric values.
Clone Of:
Environment:
Last Closed:	2017-04-12 19:09:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0884	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Description Wesley Hearn 2017-01-16 15:59:20 UTC

Description of problem:
Pods in a ready state(1/1) are not reporting metrics to heapster. This does not seem to be limited to a single node, but instead from a quick look to be a namespace wide issue.

Version-Release number of selected component (if applicable):
heapster-1.1.0-3.el7.x86_64 in metrics install 3.3.0

How reproducible:
Unknown, first time we have seen this

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
I0116 10:55:50.566507       1 handlers.go:242] No metrics for container <application> in pod <namespace>/<application>-14-2p2kw
I0116 10:55:50.566514       1 handlers.go:178] No metrics for pod <namespace>/<application>-14-2p2kw
I0116 10:55:50.566537       1 handlers.go:178] No metrics for pod <namespace>/<application>-14-ajpt2
I0116 10:55:50.566551       1 handlers.go:178] No metrics for pod <namespace>/<application>-14-dlfm7
I0116 10:55:50.566562       1 handlers.go:178] No metrics for pod <namespace>/<application>-14-w0pia
I0116 10:55:50.566575       1 handlers.go:178] No metrics for pod <namespace>/<application>-14-3qxv0

Comment 1 Matt Wringe 2017-01-16 17:06:33 UTC

There are a few reasons why this might occur.

The nodes themselves only store metrics for a brief amount of time, if Heapster requests this information outside of this range then it will not receive any metrics back.

Things to check:
1) the version. There was an issue with older Heapster versions where it would get out of sync with the time range it would ask from the nodes. This should have been fixed with Heapster 1.0.

From IRC, it was determined that this is not the cause as its running Heapster 1.1.0

2) nodes clocks being out of sync. If the nodes are not in sync, then when Heapster requests a node to return metrics from a time range, that range may fall outside of what the node currently has stored. If this was the case, then only certain nodes would return metrics, while others would not.

From IRC, it was determined that the clocks on the nodes are synced with NTP and that this issue doesn't affect particular nodes only.

3) the pods are not in the running state. Heapster will sometimes try and get metrics for pods, even if the pod is not currently in the running state. Since the pod is not in the running state, there are not metrics to be collected and it will return an error like this.

From IRC, it was determined that these pods are all in the running state.

4) the metrics are being gathered and the graphs appear in the console, but someone has increased the logs and incorrectly assumes that metrics are not being gathered. This can happen if the node is collecting metrics every 20 seconds, but we are asking for metrics every 15 seconds. This can mean that some 15 second windows will not have any metric available in them, and we can get some debugging messages in the logs about this.

From IRC, it was determined that this is limited to a specific namespace. If this was the case, we would be seeing this for all pods and not just pods limited to a specific namespace.


Requests:
Can you please attach the full Heapster logs? That might help better than just seeing a small sniplet of the logs.

Can you please verify that this is limited to a specific namespace? Are all pods in this namespace not showing up, or only certain pods in this namespace? Is this limited to just one namespace, or are you seeing this from multiple namespace?

Can we please verify that other pod on this node are having their metrics collected?

Is there anything special about this namespace? Does it name have any special characters or something else within them?

For these pods, in the console, does anything show up in the graphs? or are they completely empty?

Comment 2 Matt Wringe 2017-01-16 17:08:16 UTC

Also, could you describe a bit the setup this is running under? Eg roughly how many nodes there are to the OpenShift cluster, how many pods are deployed, etc

Comment 3 Wesley Hearn 2017-01-16 19:36:05 UTC

passing the log through "grep -v <namespace>" there is not a single instance "No metrics for pod" for 2+ days.
Looking for other applications that are in the namespace shows up 0 results. So it may be limited to this specific application.

It is a scaled app and this error shows for pods on different nodes.

Nothing special for this NS from what I can see.

Number of running pods cluster wide:
[root@<snip>-master-93315 ~]# oc get pods --all-namespaces | grep Running | wc -l
62

Cluster Size:
[root@<snip>-master-93315 ~]# oc get nodes -l type=compute # Compute Nodes
NAME                                              STATUS    AGE
ip-node1.ap-southeast-1.compute.internal   Ready     33d
ip-node2.ap-southeast-1.compute.internal   Ready     33d
ip-node3.ap-southeast-1.compute.internal   Ready     33d
ip-node4.ap-southeast-1.compute.internal   Ready     33d

[root@<snip>-master-93315 ~]# oc get nodes -l type=infra # Infra Nodes
NAME                                              STATUS    AGE
ip-inode1.ap-southeast-1.compute.internal   Ready     33d
ip-inode2.ap-southeast-1.compute.internal   Ready     33d

Comment 13 Matt Wringe 2017-01-26 15:38:10 UTC

Assigning to Solly as he seems to know more about what is going on here with Heapster and how the HPA call could be affecting this.

Comment 20 Peng Li 2017-02-27 02:45:25 UTC

Set it to verified, since this is not a function issue but more Scalability/Stability issue and the underneath issue is fixed per https://bugzilla.redhat.com/show_bug.cgi?id=1405347#c25

Comment 24 errata-xmlrpc 2017-04-12 19:09:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.