1711070 – OLM/OperatorHub components should not running in BestEffort QoS

Bug 1711070 - OLM/OperatorHub components should not running in BestEffort QoS

Summary: OLM/OperatorHub components should not running in BestEffort QoS

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Kevin Rizza
QA Contact:	Salvatore Colangelo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-16 21:14 UTC by Seth Jennings
Modified:	2019-10-16 06:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:28:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:29:11 UTC

Description Seth Jennings 2019-05-16 21:14:21 UTC

The following pods run in the BestEffort QoS with no resource requests

openshift-operator-lifecycle-manager/catalog-operator
openshift-operator-lifecycle-manager/olm-operator
openshift-operator-lifecycle-manager/olm-operators
openshift-operator-lifecycle-manager/packageserver

https://github.com/openshift/origin/pull/22787

This can cause eviction, OOMKilling, and CPU starvation.

Please add the following to the resource requests to the pods in this component:

Memory:
olm-operator     160Mi
catalog-operator 80Mi
olm-operators    50Mi
packageserver    50Mi

CPU:
10m for all

Comment 1 Seth Jennings 2019-05-16 21:42:01 UTC

Additionally

openshift-marketplace/certified-operators
openshift-marketplace/community-operators
openshift-marketplace/marketplace-operator
openshift-marketplace/redhat-operators

Memory:
certified-operators 80Mi
community-operators 80Mi
marketplace-operator 50Mi
redhat-operators 50Mi

CPU:
10m for all

Comment 2 Seth Jennings 2019-06-05 13:33:06 UTC

Could I get some feedback on this bug?

This effort targets 4.2 and blocks this story in Jira https://jira.coreos.com/browse/POD-144

This are the last components to bring into compliance.

Comment 3 Evan Cordell 2019-07-12 15:47:52 UTC

https://jira.coreos.com/browse/OLM-1130 is slated for this sprint, so it should be done soon.

Comment 4 Evan Cordell 2019-07-22 14:50:10 UTC

https://github.com/operator-framework/operator-lifecycle-manager/pull/955

Comment 6 Jian Zhang 2019-07-23 02:41:15 UTC

Looks good for the OLM component, and the `QoS Class` of pods are `Burstable` now.

mac:~ jianzhang$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-548956f758-vhldk   1/1     Running   0          24h
olm-operator-85f7475cf-kqb49        1/1     Running   0          23h
packageserver-7c6b67fc64-sh8td      1/1     Running   0          5h44m
packageserver-7c6b67fc64-z7mhq      1/1     Running   0          5h44m

mac:~ jianzhang$ oc describe pods |grep Requests: -A 2 
    Requests:
      cpu:      10m
      memory:   80Mi
--
    Requests:
      cpu:      10m
      memory:   160Mi
--
    Requests:
      cpu:        10m
      memory:     50Mi
--
    Requests:
      cpu:        10m
      memory:     50Mi

mac:~ jianzhang$ oc describe pods |grep "QoS Class" 
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable

But, for the marketplace part, the pods still use the `BestEffort` and no request specify. Change status to `ASSIGNED` and move on to marketplace sub-component.
mac:~ jianzhang$ oc describe pods -n openshift-marketplace | grep "Requests"
mac:~ jianzhang$ oc describe pods -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort
QoS Class:       BestEffort

Comment 7 Kevin Rizza 2019-07-23 12:36:01 UTC

(In reply to Seth Jennings from comment #1)
> Additionally
> 
> openshift-marketplace/certified-operators
> openshift-marketplace/community-operators
> openshift-marketplace/marketplace-operator
> openshift-marketplace/redhat-operators
> 
> Memory:
> certified-operators 80Mi
> community-operators 80Mi
> marketplace-operator 50Mi
> redhat-operators 50Mi
> 
> CPU:
> 10m for all

I'm not sure that this is a good way to handle these. We can certainly add these constraints for the openshift-marketplace/marketplace-operator pod, but the other three are operands that happened to be created by default and their needs will change over time as the content they host changes. In fact, the fact that the suggestion was that certified-operators and community-operators should request different resources implies that their resource needs will change as the content they host changes: those pods are identical aside from external content they are serving.

How were these numbers obtained? Do you have a better suggestion than using BestEffort on these pods, given that they will definitely have different resource constraints depending on what content they are serving from quay?

Comment 8 Kevin Rizza 2019-08-27 20:08:30 UTC

https://github.com/operator-framework/operator-marketplace/pull/229

Comment 9 Salvatore Colangelo 2019-08-28 13:01:12 UTC

[scolange@scolange operators]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-27-072819


Now we have:

[scolange@scolange operators]$ oc describe pods community-operators-69ff689f5d-rwpwk -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort

[scolange@scolange operators]$ oc describe pods certified-operators-64bc446dcf-9r4zb -n openshift-marketplace | grep "QoS Class"
QoS Class:       BestEffort


[scolange@scolange operators]$ oc describe pods redhat-operators-7b9cd9c994-9cqts -n openshift-marketplace | grep "QoS Class"
QoS Class:       Burstable
[scolange@scolange operators]$ oc describe pods marketplace-operator-df8d68d67-xddq9 -n openshift-marketplace | grep "QoS Class"
QoS Class:       Burstable

I thinks should be setted to Burstable to manage any change in the future. What do you think?

Comment 10 Kevin Rizza 2019-08-28 13:33:13 UTC

Hi Salvatore,

I'm not totally sure how that is possible, given that the pods `community-operators-*` `certified-operators-*` and `redhat-operators-*` are generated in code in the exact same way.

Can you give me some context on this cluster? Is it an upgrade from a previous version? Can you try killing the pods and recreating them, then checking again?

Can you also try to get all of the deployments to see if spec.resources.requests is specified in all of them, and what the values are if so?

Thanks!

Comment 11 Salvatore Colangelo 2019-08-28 16:32:27 UTC

Hi Kevin , 

 you are right maybe wrong cluster sorry!

Below the step:


[scolange@scolange ]$ oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.ci-2019-08-28-103038   True        False         153m    Cluster version is 4.2.0-0.ci-2019-08-28-103038

[scolange@scolange ]$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-77c9c6b9c9-znx47    1/1     Running   0          165m
community-operators-d5cb7dbf4-hgbzp     1/1     Running   0          165m
marketplace-operator-65d498f785-kvbqp   1/1     Running   0          165m
redhat-operators-66fdd79ff5-4h8mm       1/1     Running   0          104m



[scolange@scolange ]$ oc describe pods -n openshift-marketplace|grep "QoS Class" 
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable
QoS Class:       Burstable

[scolange@scolange ]$ oc describe pods -n openshift-marketplace | grep "Requests"
    Requests:
    Requests:
    Requests:
    Requests:

Comment 12 Seth Jennings 2019-08-28 22:23:35 UTC

I updated openshift-tests to not exclude openshift-marketplace and it works indeed.

Closing the gate to prevent regression
https://github.com/openshift/origin/pull/23690

Comment 15 Kevin Rizza 2019-08-29 14:42:16 UTC

Based on your comments (and the fact that I deployed a 4.2 cluster today and saw the same behavior), I am going to mark this as verified.

Comment 17 errata-xmlrpc 2019-10-16 06:28:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.