Description of problem:
Pods in a ready state(1/1) are not reporting metrics to heapster. This does not seem to be limited to a single node, but instead from a quick look to be a namespace wide issue.
Version-Release number of selected component (if applicable):
heapster-1.1.0-3.el7.x86_64 in metrics install 3.3.0
Unknown, first time we have seen this
Steps to Reproduce:
I0116 10:55:50.566507 1 handlers.go:242] No metrics for container <application> in pod <namespace>/<application>-14-2p2kw
I0116 10:55:50.566514 1 handlers.go:178] No metrics for pod <namespace>/<application>-14-2p2kw
I0116 10:55:50.566537 1 handlers.go:178] No metrics for pod <namespace>/<application>-14-ajpt2
I0116 10:55:50.566551 1 handlers.go:178] No metrics for pod <namespace>/<application>-14-dlfm7
I0116 10:55:50.566562 1 handlers.go:178] No metrics for pod <namespace>/<application>-14-w0pia
I0116 10:55:50.566575 1 handlers.go:178] No metrics for pod <namespace>/<application>-14-3qxv0
There are a few reasons why this might occur.
The nodes themselves only store metrics for a brief amount of time, if Heapster requests this information outside of this range then it will not receive any metrics back.
Things to check:
1) the version. There was an issue with older Heapster versions where it would get out of sync with the time range it would ask from the nodes. This should have been fixed with Heapster 1.0.
From IRC, it was determined that this is not the cause as its running Heapster 1.1.0
2) nodes clocks being out of sync. If the nodes are not in sync, then when Heapster requests a node to return metrics from a time range, that range may fall outside of what the node currently has stored. If this was the case, then only certain nodes would return metrics, while others would not.
From IRC, it was determined that the clocks on the nodes are synced with NTP and that this issue doesn't affect particular nodes only.
3) the pods are not in the running state. Heapster will sometimes try and get metrics for pods, even if the pod is not currently in the running state. Since the pod is not in the running state, there are not metrics to be collected and it will return an error like this.
From IRC, it was determined that these pods are all in the running state.
4) the metrics are being gathered and the graphs appear in the console, but someone has increased the logs and incorrectly assumes that metrics are not being gathered. This can happen if the node is collecting metrics every 20 seconds, but we are asking for metrics every 15 seconds. This can mean that some 15 second windows will not have any metric available in them, and we can get some debugging messages in the logs about this.
From IRC, it was determined that this is limited to a specific namespace. If this was the case, we would be seeing this for all pods and not just pods limited to a specific namespace.
Can you please attach the full Heapster logs? That might help better than just seeing a small sniplet of the logs.
Can you please verify that this is limited to a specific namespace? Are all pods in this namespace not showing up, or only certain pods in this namespace? Is this limited to just one namespace, or are you seeing this from multiple namespace?
Can we please verify that other pod on this node are having their metrics collected?
Is there anything special about this namespace? Does it name have any special characters or something else within them?
For these pods, in the console, does anything show up in the graphs? or are they completely empty?
Also, could you describe a bit the setup this is running under? Eg roughly how many nodes there are to the OpenShift cluster, how many pods are deployed, etc
passing the log through "grep -v <namespace>" there is not a single instance "No metrics for pod" for 2+ days.
Looking for other applications that are in the namespace shows up 0 results. So it may be limited to this specific application.
It is a scaled app and this error shows for pods on different nodes.
Nothing special for this NS from what I can see.
Number of running pods cluster wide:
[root@<snip>-master-93315 ~]# oc get pods --all-namespaces | grep Running | wc -l
[root@<snip>-master-93315 ~]# oc get nodes -l type=compute # Compute Nodes
NAME STATUS AGE
ip-node1.ap-southeast-1.compute.internal Ready 33d
ip-node2.ap-southeast-1.compute.internal Ready 33d
ip-node3.ap-southeast-1.compute.internal Ready 33d
ip-node4.ap-southeast-1.compute.internal Ready 33d
[root@<snip>-master-93315 ~]# oc get nodes -l type=infra # Infra Nodes
NAME STATUS AGE
ip-inode1.ap-southeast-1.compute.internal Ready 33d
ip-inode2.ap-southeast-1.compute.internal Ready 33d
Assigning to Solly as he seems to know more about what is going on here with Heapster and how the HPA call could be affecting this.
Set it to verified, since this is not a function issue but more Scalability/Stability issue and the underneath issue is fixed per https://bugzilla.redhat.com/show_bug.cgi?id=1405347#c25
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.