Bug 1711070 - OLM/OperatorHub components should not running in BestEffort QoS
Summary: OLM/OperatorHub components should not running in BestEffort QoS
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Kevin Rizza
QA Contact: Salvatore Colangelo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-16 21:14 UTC by Seth Jennings
Modified: 2019-10-16 06:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:28:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:29:11 UTC

Description Seth Jennings 2019-05-16 21:14:21 UTC
The following pods run in the BestEffort QoS with no resource requests

openshift-operator-lifecycle-manager/catalog-operator
openshift-operator-lifecycle-manager/olm-operator
openshift-operator-lifecycle-manager/olm-operators
openshift-operator-lifecycle-manager/packageserver

https://github.com/openshift/origin/pull/22787

This can cause eviction, OOMKilling, and CPU starvation.

Please add the following to the resource requests to the pods in this component:

Memory:
olm-operator     160Mi
catalog-operator 80Mi
olm-operators    50Mi
packageserver    50Mi

CPU:
10m for all

Comment 1 Seth Jennings 2019-05-16 21:42:01 UTC
Additionally

openshift-marketplace/certified-operators
openshift-marketplace/community-operators
openshift-marketplace/marketplace-operator
openshift-marketplace/redhat-operators

Memory:
certified-operators 80Mi
community-operators 80Mi
marketplace-operator 50Mi
redhat-operators 50Mi

CPU:
10m for all

Comment 2 Seth Jennings 2019-06-05 13:33:06 UTC
Could I get some feedback on this bug?

This effort targets 4.2 and blocks this story in Jira https://jira.coreos.com/browse/POD-144

This are the last components to bring into compliance.

Comment 3 Evan Cordell 2019-07-12 15:47:52 UTC
https://jira.coreos.com/browse/OLM-1130 is slated for this sprint, so it should be done soon.

Comment 6 Jian Zhang 2019-07-23 02:41:15 UTC
Looks good for the OLM component, and the `QoS Class` of pods are `Burstable` now.

mac:~ jianzhang$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-548956f758-vhldk   1/1     Running   0          24h
olm-operator-85f7475cf-kqb49        1/1     Running   0          23h
packageserver-7c6b67fc64-sh8td      1/1     Running   0          5h44m
packageserver-7c6b67fc64-z7mhq      1/1     Running   0          5h44m

mac:~ jianzhang$ oc describe pods |grep Requests: -A 2 
    Requests:
      cpu:      10m
      memory:   80Mi
--
    Requests:
      cpu:      10m
      memory:   160Mi
--
    Requests:
      cpu:        10m
      memory:     50Mi
--
    Requests:
      cpu:        10m
      memory:     50Mi

mac:~ jianzhang$ oc describe pods |grep "QoS Class" 
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable

But, for the marketplace part, the pods still use the `BestEffort` and no request specify. Change status to `ASSIGNED` and move on to marketplace sub-component.
mac:~ jianzhang$ oc describe pods -n openshift-marketplace | grep "Requests"
mac:~ jianzhang$ oc describe pods -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort

Comment 7 Kevin Rizza 2019-07-23 12:36:01 UTC
(In reply to Seth Jennings from comment #1)
> Additionally
> 
> openshift-marketplace/certified-operators
> openshift-marketplace/community-operators
> openshift-marketplace/marketplace-operator
> openshift-marketplace/redhat-operators
> 
> Memory:
> certified-operators 80Mi
> community-operators 80Mi
> marketplace-operator 50Mi
> redhat-operators 50Mi
> 
> CPU:
> 10m for all

I'm not sure that this is a good way to handle these. We can certainly add these constraints for the openshift-marketplace/marketplace-operator pod, but the other three are operands that happened to be created by default and their needs will change over time as the content they host changes. In fact, the fact that the suggestion was that certified-operators and community-operators should request different resources implies that their resource needs will change as the content they host changes: those pods are identical aside from external content they are serving.

How were these numbers obtained? Do you have a better suggestion than using BestEffort on these pods, given that they will definitely have different resource constraints depending on what content they are serving from quay?

Comment 9 Salvatore Colangelo 2019-08-28 13:01:12 UTC
[scolange@scolange operators]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-27-072819


Now we have:

[scolange@scolange operators]$ oc describe pods community-operators-69ff689f5d-rwpwk -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort

[scolange@scolange operators]$ oc describe pods certified-operators-64bc446dcf-9r4zb -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort


[scolange@scolange operators]$ oc describe pods redhat-operators-7b9cd9c994-9cqts -n openshift-marketplace | grep "QoS Class"
QoS Class:       Burstable
[scolange@scolange operators]$ oc describe pods marketplace-operator-df8d68d67-xddq9 -n openshift-marketplace | grep "QoS Class"
QoS Class:       Burstable

I thinks should be setted to Burstable to manage any change in the future. What do you think?

Comment 10 Kevin Rizza 2019-08-28 13:33:13 UTC
Hi Salvatore,

I'm not totally sure how that is possible, given that the pods `community-operators-*` `certified-operators-*` and `redhat-operators-*` are generated in code in the exact same way.

Can you give me some context on this cluster? Is it an upgrade from a previous version? Can you try killing the pods and recreating them, then checking again?

Can you also try to get all of the deployments to see if spec.resources.requests is specified in all of them, and what the values are if so?

Thanks!

Comment 11 Salvatore Colangelo 2019-08-28 16:32:27 UTC
Hi Kevin , 

 you are right maybe wrong cluster sorry!

Below the step:


[scolange@scolange ]$ oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.ci-2019-08-28-103038   True        False         153m    Cluster version is 4.2.0-0.ci-2019-08-28-103038

[scolange@scolange ]$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-77c9c6b9c9-znx47    1/1     Running   0          165m
community-operators-d5cb7dbf4-hgbzp     1/1     Running   0          165m
marketplace-operator-65d498f785-kvbqp   1/1     Running   0          165m
redhat-operators-66fdd79ff5-4h8mm       1/1     Running   0          104m



[scolange@scolange ]$ oc describe pods -n openshift-marketplace|grep "QoS Class" 
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable

[scolange@scolange ]$ oc describe pods -n openshift-marketplace | grep "Requests"
    Requests:
    Requests:
    Requests:
    Requests:

Comment 12 Seth Jennings 2019-08-28 22:23:35 UTC
I updated openshift-tests to not exclude openshift-marketplace and it works indeed.

Closing the gate to prevent regression
https://github.com/openshift/origin/pull/23690

Comment 15 Kevin Rizza 2019-08-29 14:42:16 UTC
Based on your comments (and the fact that I deployed a 4.2 cluster today and saw the same behavior), I am going to mark this as verified.

Comment 17 errata-xmlrpc 2019-10-16 06:28:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.