1337853 – failed to get CPU consumption and request: metrics obtained for 0/X of pods every 2m25s

Bug 1337853 - failed to get CPU consumption and request: metrics obtained for 0/X of pods every 2m25s

Summary: failed to get CPU consumption and request: metrics obtained for 0/X of pods e...

Keywords:
Status:	CLOSED DUPLICATE of bug 1397593
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Solly Ross
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1267746 1397593
TreeView+	depends on / blocked

Reported:	2016-05-20 09:07 UTC by Takayoshi Kimura
Modified:	2020-04-15 14:29 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1397593 (view as bug list)
Environment:
Last Closed:	2017-02-02 20:47:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1479	0	normal	SHIPPED_LIVE	Red Hat OpenShift Enterprise 3.1 metrics image bug fix	2016-07-21 22:36:06 UTC

Description Takayoshi Kimura 2016-05-20 09:07:50 UTC

Description of problem:

This is likely a heapster issue.

In master log we see "failed to get CPU consumption and request: metrics obtained for 0/X of pods" logs every 2m25s * n intervals (5.0 min, 7.5 min etc.) when using horizontal pod autoscaler.

I created KCS article for this issue, the heapster metrics is not available when publish data to API side depends on the access timing. It looks like a race condition and it would  be great if we can ensure the heapster doesn't return the false empty metrics.

Periodic failed to get CPU consumption and request: metrics obtained for 0/X of pods logs in OpenShift 
https://access.redhat.com/solutions/2329271

Version-Release number of selected component (if applicable):

openshift v3.1.1.6-64-g80b61da

How reproducible:

Always, may depend on load

Steps to Reproduce:
1. Deploy some apps using hpa, 
2. Execute "journalctl -u atomic-openshift-master | grep Failed" after few hours
3.

Actual results:

failed to get CPU consumption and request: metrics obtained for 0/X of pods every 2m25s

Expected results:

no logs above under healthy condition

Additional info:

Comment 1 Boris Kurktchiev 2016-05-20 14:29:28 UTC

This also affects the speed at which the HPA scales pods described here https://access.redhat.com/solutions/2332541

Comment 4 Solly Ross 2016-05-23 20:58:09 UTC

I believe I may have reproduced the issue on my end (using shorter duration for the model resolution, cache duration, etc).  I'm currently investigating the cause.

Comment 7 Solly Ross 2016-05-26 19:10:09 UTC

A quick update: I've identified one issue that could be causing this and fixed it, but two more issues which yielded the same symptom have appeared.  I'm currently working to identify the exact cause of the latter of those.

Comment 9 Michael Napolis 2016-05-27 04:50:36 UTC

That's great news that the issue is identified and fixed.  Thanks for htat.  Unfortunately though that 2 more came up.
Are you able to provide more details (at least high level) to us on what so far have been identified as an issue and what is the fix for it?
Also what are the 2 new issues discovered as well?

Cheers,
Michael

Comment 10 Solly Ross 2016-05-27 15:55:48 UTC

The issues I found were as such:

- If the timing was just right, when updating the aggregated pod-level metrics for the the model's upper bound to be exactly equal to the cached metric time, which could cause the one of the in-memory stores to incorrectly report that no metrics were available at the requested time period.  The fix for this was to ensure that the correct data was considered by the in-memory store when the times were equal.

- The way the locking was written during the model updating phase, it was possible with the correct timing for the model API to serve a request in between the pod list update phase and the aggregation phase, which could cause it to appear that there were no metrics available.  The fix for this was to move the locks so that they surrounded the entire model update operation, and not just the individual steps.

I suspect that one of these two issues is the root cause here.

An additional compounding factor is that the pod-list metrics API code does not return an error when metrics are missing for a given pod (and instead returns empty metrics), while the single-pod API code does return an error in this case (making the results potentially slightly confusing, and potentially adding addition causes of the symptom described in this BZ).

Comment 27 Scott Dodson 2016-07-21 13:48:43 UTC

This bug was fixed in a previous errata however was not properly attached to that errata.

Please see https://access.redhat.com/errata/product/290/ver=3.2/rhel---7/x86_64/RHBA-2016:1343

Comment 28 dlbewley 2016-10-10 23:24:11 UTC

I see this error in project events on OCP v3.3.0.34.

Comment 29 Takayoshi Kimura 2016-10-11 00:43:50 UTC

(In reply to dlbewley from comment #28)
> I see this error in project events on OCP v3.3.0.34.

Please open a support case at https://access.redhat.com/support/cases/ so we can verify it.

Regards,
Takayoshi

Comment 35 Boris Kurktchiev 2016-10-17 15:47:01 UTC

Yeah I see this as well and although it now does not affect the scaling UP it makes scaling DOWN take a good 5-10minutes once you go above 1-2 pod scaling (had a 5 containers take nearly 8 minutes before it scaled back down to 1) with numerous X/Y pods CPU info not available.

Note You need to log in before you can comment on or make changes to this bug.