Bug 1836359

Summary: Update OSD requests and limits
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Kyle Bader <kbader>
Component: ocs-operatorAssignee: Jose A. Rivera <jarrpa>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: ebenahar, ekuric, jcall, madam, mbukatov, ocs-bugs, owasserm, sostapov
Target Milestone: ---Keywords: AutomationBackLog, FutureFeature
Target Release: OCS 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-15 10:17:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Yaniv Kaul 2020-06-24 08:34:59 UTC
This looks like an important item for performance/sizing. What's the status?

Comment 3 Jose A. Rivera 2020-06-29 15:26:14 UTC
I guess we can do this for OCS 4.5, though there's the question on whether we want to have OSDs with Guaranteed QoS or not (see https://bugzilla.redhat.com/show_bug.cgi?id=1781785). If so we'd have to make sure the CPU limits and requests match. Kyle?

Comment 4 Orit Wasserman 2020-06-30 08:01:02 UTC
Let's make it Guaranteed QoS:
Memory 5G
CPU: 2

Users that will want more performance will need to change the cpu requests/limits in the CRD.
We can improved it in 4.6.
Kyle works for you?

Comment 5 Michael Adam 2020-06-30 21:39:21 UTC
There was a PR for this (https://github.com/openshift/ocs-operator/pull/521) which has been closed again.
It was raised against release-4.5, and at least for my perception the suggested changes seemed to come a bit out of the blue.
But the performance test results mentioned in the description of this BZ give a good indication.

It seems that we first need a bit more discussion about changing these defaults.
And the patch needs to be done against master first.
If we can reach concensus and explain the changes in the PR, I agree that we could take it into 4.5.

Comment 6 Michael Adam 2020-06-30 21:47:22 UTC
I think as far as robustness is concerned, this is also somewhat related to the topic of adding priority classes: 

https://bugzilla.redhat.com/show_bug.cgi?id=1776876

Comment 7 Jose A. Rivera 2020-07-01 13:53:12 UTC
We don't have time to resolve this before this Thursday. If we're going to take it in OCs 4.5 we'll need an exception.

Comment 8 Kyle Bader 2020-07-01 16:41:14 UTC
I think Orit's proposal is a great compromise, short of doing something dynamically.

Comment 9 Michael Adam 2020-07-01 16:42:56 UTC
(In reply to Jose A. Rivera from comment #7)
> We don't have time to resolve this before this Thursday. If we're going to
> take it in OCs 4.5 we'll need an exception.

My point was that *if* we reach consensus, then it'll be easy to include it in 4.5 :-)

(In reply to Kyle Bader from comment #8)
> I think Orit's proposal is a great compromise, short of doing something
> dynamically.

It seems we got the consensus now :-D

Comment 10 Jose A. Rivera 2020-07-01 16:48:02 UTC
PR is up: https://github.com/openshift/ocs-operator/pull/597

It should also be noted that this will impact this Jira: https://issues.redhat.com/browse/RHSTOR-967

Comment 11 Elad 2020-07-01 17:27:58 UTC
Will require full regression testing, with an emphasis on performance, for verification.

Comment 12 Jose A. Rivera 2020-07-01 19:31:22 UTC
PR merged.

Comment 13 Michael Adam 2020-07-02 06:49:10 UTC
bot not working... adding missing ACKs.


POST is the correct status - that PR was the master PR

Comment 15 Michael Adam 2020-07-02 06:53:38 UTC
backport PR https://github.com/openshift/ocs-operator/pull/599

Comment 16 Michael Adam 2020-07-02 07:17:56 UTC
d/s patch merged

Comment 19 krishnaram Karthick 2020-08-18 13:38:30 UTC
With the 4.5.0-54.ci build, no memory or cpu related issues were seen in the performance automation run. 
Also, tests around heavy IO + OSD failures were run. Failure and recovery was seamless. 

Build used to verify:

oc get csv -n openshift-storage
NAME                        DISPLAY                       VERSION       REPLACES   PHASE
ocs-operator.v4.5.0-54.ci   OpenShift Container Storage   4.5.0-54.ci              Succeeded

I'll wait for tier1 & scale test automation analysis to ensure no issues are seen around this change before moving this bug to verified.

Comment 20 krishnaram Karthick 2020-08-21 03:46:49 UTC
No issues were seen around OCS related resources with other automation tiers. Moving the bug to verified.

Scale analysis thread: http://post-office.corp.redhat.com/archives/ocs-ci/2020-August/msg00456.html
Performance analysis thread: http://post-office.corp.redhat.com/archives/ocs-ci/2020-August/msg00427.html
tier1 analysis thread: http://post-office.corp.redhat.com/archives/ocs-ci/2020-August/msg00425.html

Moving the bug to verified.

Comment 22 errata-xmlrpc 2020-09-15 10:17:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754