Bug 1711070

Summary: OLM/OperatorHub components should not running in BestEffort QoS
Product: OpenShift Container Platform Reporter: Seth Jennings <sjenning>
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OperatorHub QA Contact: Salvatore Colangelo <scolange>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bandrade, chezhang, chuo, ecordell, jfan, krizza, scolange
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:28:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Seth Jennings 2019-05-16 21:14:21 UTC
The following pods run in the BestEffort QoS with no resource requests

openshift-operator-lifecycle-manager/catalog-operator
openshift-operator-lifecycle-manager/olm-operator
openshift-operator-lifecycle-manager/olm-operators
openshift-operator-lifecycle-manager/packageserver

https://github.com/openshift/origin/pull/22787

This can cause eviction, OOMKilling, and CPU starvation.

Please add the following to the resource requests to the pods in this component:

Memory:
olm-operator     160Mi
catalog-operator 80Mi
olm-operators    50Mi
packageserver    50Mi

CPU:
10m for all

Comment 1 Seth Jennings 2019-05-16 21:42:01 UTC
Additionally

openshift-marketplace/certified-operators
openshift-marketplace/community-operators
openshift-marketplace/marketplace-operator
openshift-marketplace/redhat-operators

Memory:
certified-operators 80Mi
community-operators 80Mi
marketplace-operator 50Mi
redhat-operators 50Mi

CPU:
10m for all

Comment 2 Seth Jennings 2019-06-05 13:33:06 UTC
Could I get some feedback on this bug?

This effort targets 4.2 and blocks this story in Jira https://jira.coreos.com/browse/POD-144

This are the last components to bring into compliance.

Comment 3 Evan Cordell 2019-07-12 15:47:52 UTC
https://jira.coreos.com/browse/OLM-1130 is slated for this sprint, so it should be done soon.

Comment 6 Jian Zhang 2019-07-23 02:41:15 UTC
Looks good for the OLM component, and the `QoS Class` of pods are `Burstable` now.

mac:~ jianzhang$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-548956f758-vhldk   1/1     Running   0          24h
olm-operator-85f7475cf-kqb49        1/1     Running   0          23h
packageserver-7c6b67fc64-sh8td      1/1     Running   0          5h44m
packageserver-7c6b67fc64-z7mhq      1/1     Running   0          5h44m

mac:~ jianzhang$ oc describe pods |grep Requests: -A 2 
    Requests:
      cpu:      10m
      memory:   80Mi
--
    Requests:
      cpu:      10m
      memory:   160Mi
--
    Requests:
      cpu:        10m
      memory:     50Mi
--
    Requests:
      cpu:        10m
      memory:     50Mi

mac:~ jianzhang$ oc describe pods |grep "QoS Class" 
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable

But, for the marketplace part, the pods still use the `BestEffort` and no request specify. Change status to `ASSIGNED` and move on to marketplace sub-component.
mac:~ jianzhang$ oc describe pods -n openshift-marketplace | grep "Requests"
mac:~ jianzhang$ oc describe pods -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort

Comment 7 Kevin Rizza 2019-07-23 12:36:01 UTC
(In reply to Seth Jennings from comment #1)
> Additionally
> 
> openshift-marketplace/certified-operators
> openshift-marketplace/community-operators
> openshift-marketplace/marketplace-operator
> openshift-marketplace/redhat-operators
> 
> Memory:
> certified-operators 80Mi
> community-operators 80Mi
> marketplace-operator 50Mi
> redhat-operators 50Mi
> 
> CPU:
> 10m for all

I'm not sure that this is a good way to handle these. We can certainly add these constraints for the openshift-marketplace/marketplace-operator pod, but the other three are operands that happened to be created by default and their needs will change over time as the content they host changes. In fact, the fact that the suggestion was that certified-operators and community-operators should request different resources implies that their resource needs will change as the content they host changes: those pods are identical aside from external content they are serving.

How were these numbers obtained? Do you have a better suggestion than using BestEffort on these pods, given that they will definitely have different resource constraints depending on what content they are serving from quay?

Comment 9 Salvatore Colangelo 2019-08-28 13:01:12 UTC
[scolange@scolange operators]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-27-072819


Now we have:

[scolange@scolange operators]$ oc describe pods community-operators-69ff689f5d-rwpwk -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort

[scolange@scolange operators]$ oc describe pods certified-operators-64bc446dcf-9r4zb -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort


[scolange@scolange operators]$ oc describe pods redhat-operators-7b9cd9c994-9cqts -n openshift-marketplace | grep "QoS Class"
QoS Class:       Burstable
[scolange@scolange operators]$ oc describe pods marketplace-operator-df8d68d67-xddq9 -n openshift-marketplace | grep "QoS Class"
QoS Class:       Burstable

I thinks should be setted to Burstable to manage any change in the future. What do you think?

Comment 10 Kevin Rizza 2019-08-28 13:33:13 UTC
Hi Salvatore,

I'm not totally sure how that is possible, given that the pods `community-operators-*` `certified-operators-*` and `redhat-operators-*` are generated in code in the exact same way.

Can you give me some context on this cluster? Is it an upgrade from a previous version? Can you try killing the pods and recreating them, then checking again?

Can you also try to get all of the deployments to see if spec.resources.requests is specified in all of them, and what the values are if so?

Thanks!

Comment 11 Salvatore Colangelo 2019-08-28 16:32:27 UTC
Hi Kevin , 

 you are right maybe wrong cluster sorry!

Below the step:


[scolange@scolange ]$ oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.ci-2019-08-28-103038   True        False         153m    Cluster version is 4.2.0-0.ci-2019-08-28-103038

[scolange@scolange ]$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-77c9c6b9c9-znx47    1/1     Running   0          165m
community-operators-d5cb7dbf4-hgbzp     1/1     Running   0          165m
marketplace-operator-65d498f785-kvbqp   1/1     Running   0          165m
redhat-operators-66fdd79ff5-4h8mm       1/1     Running   0          104m



[scolange@scolange ]$ oc describe pods -n openshift-marketplace|grep "QoS Class" 
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable

[scolange@scolange ]$ oc describe pods -n openshift-marketplace | grep "Requests"
    Requests:
    Requests:
    Requests:
    Requests:

Comment 12 Seth Jennings 2019-08-28 22:23:35 UTC
I updated openshift-tests to not exclude openshift-marketplace and it works indeed.

Closing the gate to prevent regression
https://github.com/openshift/origin/pull/23690

Comment 15 Kevin Rizza 2019-08-29 14:42:16 UTC
Based on your comments (and the fact that I deployed a 4.2 cluster today and saw the same behavior), I am going to mark this as verified.

Comment 17 errata-xmlrpc 2019-10-16 06:28:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922