Description of problem:
This is likely a heapster issue.
In master log we see "failed to get CPU consumption and request: metrics obtained for 0/X of pods" logs every 2m25s * n intervals (5.0 min, 7.5 min etc.) when using horizontal pod autoscaler.
I created KCS article for this issue, the heapster metrics is not available when publish data to API side depends on the access timing. It looks like a race condition and it would be great if we can ensure the heapster doesn't return the false empty metrics.
Periodic failed to get CPU consumption and request: metrics obtained for 0/X of pods logs in OpenShift
Version-Release number of selected component (if applicable):
Always, may depend on load
Steps to Reproduce:
1. Deploy some apps using hpa,
2. Execute "journalctl -u atomic-openshift-master | grep Failed" after few hours
failed to get CPU consumption and request: metrics obtained for 0/X of pods every 2m25s
no logs above under healthy condition
This also affects the speed at which the HPA scales pods described here https://access.redhat.com/solutions/2332541
I believe I may have reproduced the issue on my end (using shorter duration for the model resolution, cache duration, etc). I'm currently investigating the cause.
A quick update: I've identified one issue that could be causing this and fixed it, but two more issues which yielded the same symptom have appeared. I'm currently working to identify the exact cause of the latter of those.
That's great news that the issue is identified and fixed. Thanks for htat. Unfortunately though that 2 more came up.
Are you able to provide more details (at least high level) to us on what so far have been identified as an issue and what is the fix for it?
Also what are the 2 new issues discovered as well?
The issues I found were as such:
- If the timing was just right, when updating the aggregated pod-level metrics for the the model's upper bound to be exactly equal to the cached metric time, which could cause the one of the in-memory stores to incorrectly report that no metrics were available at the requested time period. The fix for this was to ensure that the correct data was considered by the in-memory store when the times were equal.
- The way the locking was written during the model updating phase, it was possible with the correct timing for the model API to serve a request in between the pod list update phase and the aggregation phase, which could cause it to appear that there were no metrics available. The fix for this was to move the locks so that they surrounded the entire model update operation, and not just the individual steps.
I suspect that one of these two issues is the root cause here.
An additional compounding factor is that the pod-list metrics API code does not return an error when metrics are missing for a given pod (and instead returns empty metrics), while the single-pod API code does return an error in this case (making the results potentially slightly confusing, and potentially adding addition causes of the symptom described in this BZ).
This bug was fixed in a previous errata however was not properly attached to that errata.
Please see https://access.redhat.com/errata/product/290/ver=3.2/rhel---7/x86_64/RHBA-2016:1343
I see this error in project events on OCP v18.104.22.168.
(In reply to dlbewley from comment #28)
> I see this error in project events on OCP v22.214.171.124.
Please open a support case at https://access.redhat.com/support/cases/ so we can verify it.
Yeah I see this as well and although it now does not affect the scaling UP it makes scaling DOWN take a good 5-10minutes once you go above 1-2 pod scaling (had a 5 containers take nearly 8 minutes before it scaled back down to 1) with numerous X/Y pods CPU info not available.