Bug 1867477

Summary: HPA monitoring cpu utilization fails for deployments which have init containers
Product: OpenShift Container Platform Reporter: Arnab Ghosh <arghosh>
Component: NodeAssignee: Joel Smith <joelsmith>
Node sub component: Autoscaler (HPA, VPA) QA Contact: Weinan Liu <weinliu>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: adeshpan, akaris, amer.ezahir, andbartl, aos-bugs, christopher.obrien, ddelcian, fshaikh, jean.froment, jeder, jlee, joelsmith, john.macleod, jokerman, jseunghw, kperrier, ksathe, mfiedler, nmaynard, oarribas, ocasalsa, openshift-bugs-escalate, pbergene, pkanthal, rpalathi, sgarciam, skrenger, tkonishi, tmckay, tsweeney, vjaypurk, weinliu, xingli
Version: 4.5Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: HPA ignores pods with incomplete metrics like those sent by the prometheus adaptor in the case of pods with init containers. Consequence: Any pod with an init container would not be scaled. Fix: Make prometheus adaptor send complete metrics for init containers. Result: HPA can scale pods with init containers.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:15:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1895532    

Description Arnab Ghosh 2020-08-10 06:38:51 UTC
Description of problem:
This bug could be a duplicate of bug[1]. Creating this as the issue seems to be persisting even after upgrading the cluster to 4.5.4. The errata for bug[1] says that it has been fixed in Openshift version 4.5.1.

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1749468

~~~
$ oc get clusterversion -oyaml
...
    - lastTransitionTime: "2020-04-15T19:32:18Z"
      message: Done applying 4.5.4
      status: "True"
      type: Available

$ oc describe hpa mongo-ss
Name:                                                  mongo-ss
Namespace:                                             default
...
Reference:                                             StatefulSet/mongo-ss
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 12%
Min replicas:                                          1
Max replicas:                                          10
StatefulSet pods:                                      1 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: did not receive metrics for any ready pods
Events:
  Type     Reason                        Age                    From                       Message
  ----     ------                        ----                   ----                       -------
  Warning  FailedGetResourceMetric       4m39s (x3 over 5m9s)   horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  4m39s (x3 over 5m9s)   horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  2m24s (x9 over 4m24s)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods
  Warning  FailedGetResourceMetric       9s (x18 over 4m24s)    horizontal-pod-autoscaler  did not receive metrics for any ready pods
~~~

Version-Release number of selected component (if applicable):
OpenShift 4.5.4

How reproducible:
Always

Steps to Reproduce:
Reproducible steps in bug[1] was followed.

Actual results:
HPA is not showing proper status.

Expected results:
HPA should be able to handle init containers.

Additional info:
Refer to comment section of this bug.

Comment 4 Neelesh Agrawal 2020-09-09 14:07:04 UTC
*** Bug 1749468 has been marked as a duplicate of this bug. ***

Comment 29 Oscar Casal Sanchez 2020-11-09 07:37:08 UTC
Hello!

I was reviewing the bug linked to this Bugzilla and I was able to find for 4.6 and 4.4 target releases, but not for 4.5. Are you aware if does it exist already?

Regards,
Oscar

Comment 31 Weinan Liu 2020-11-11 10:26:54 UTC
Failed Test

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-11-033756   True        False         3h7m    Cluster version is 4.7.0-0.nightly-2020-11-11-033756

Got the same outpust as above

Comment 34 Weinan Liu 2020-11-12 06:53:57 UTC
Thanks, @Joel,
I though the warnings should also get cleared.

As per comment #33 and #30, issue got fixed on  $ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-11-033756   True        False         3h7m    Cluster version is 4.7.0-0.nightly-2020-11-11-033756

Comment 37 Joel Smith 2020-11-13 17:01:15 UTC
Zero CPU usage for the init container is the fix we added. It makes it so that HPA will not consider the metrics invalid. 

If the metrics report a container with a memory metric, but no CPU metric then HPA will think that something is wrong with the metrics and it won't scale. That's what caused this bug. So the metrics either have to completely remove the init container, or include it with zero values for both CPU and memory. We decided that the cleanest fix was to include it with the zero values.

If you see an init container metric like this:

        {
          "name": "empty-init",
          "usage": {
            "cpu": "0",
            "memory": "0"
          }
        },

that is good, and expected. Because the init container finishes running before the main container starts, we would expect its CPU usage to stay at zero for the rest of the pod's lifetime.

If you see an init container metric like this:

        {
          "name": "empty-init",
          "usage": {
            "memory": "0"
          }
        },

then HPA will fail due to the missing CPU metrics.

Comment 38 Weinan Liu 2020-11-18 09:08:10 UTC
@oarribas,

Do we have any other items to check on veryfing this issue?

Comment 44 Weinan Liu 2020-11-30 16:08:24 UTC
https://github.com/openshift/cucushift/pull/8246  	qe_test_coverage+

Comment 51 Amer EZAHIR 2021-02-16 09:33:20 UTC
Hi,

I'm on OCP4.5.19 and I'm facing the same issue,
is this has been resolved definitly on OCP4.6 or is there any workaround to work on this for the ocp4.5.19 please ?

Thanks
Kind regards

Comment 52 Tom Sweeney 2021-02-16 15:23:26 UTC
Joel do you have an answer to Amer's question in comment #51?

Comment 55 errata-xmlrpc 2021-02-24 15:15:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633