Bug 1827157

Summary:	OSD hitting default CPU limit on AWS i3en.2xlarge instances limiting performance
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Manoj Pillai <mpillai>
Component:	ocs-operator	Assignee:	Jose A. Rivera <jarrpa>
Status:	CLOSED ERRATA	QA Contact:	krishnaram Karthick <kramdoss>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3	CC:	ebenahar, ekuric, jarrpa, kramdoss, madam, mbukatov, muagarwa, ocs-bugs, owasserm, sostapov
Target Milestone:	---	Keywords:	AutomationBackLog, Performance
Target Release:	OCS 4.6.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-17 06:22:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Manoj Pillai 2020-04-23 11:29:15 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Deploying OCS on 3 AWS i3en.2xlarge instances according to our documentation in:

https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.3/html-single/deploying_openshift_container_storage/index#installing-openshift-container-storage-using-local-storage-devices_rhocs

Cluster is created using only one of the local disks, so resulting OCS cluster has 3 nodes and 3 OSDs.

Running an fio random read test and monitoring CPU from OCP grafana dashboard as well as using top output collected as here: https://github.com/manojtpillai/kubuculum/blob/b3e4007c41923132911177b67b617bafaf93e2c9/roles/stats_sysstat/templates/sysstat_daemonset.j2#L57

The load is not evenly distributed across OSDs (see https://bugzilla.redhat.com/show_bug.cgi?id=1790500). But one of the OSDs is hitting the default CPU limit of 2:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 172302 167       20   0 4375496   3.1g  35456 S 199.6  4.9  34:57.29 ceph-osd
 306851 root      20   0       0      0      0 I  11.7  0.0   0:10.21 kworker/+
 271313 root      20   0       0      0      0 R  11.6  0.0   0:15.32 kworker/+

The other 2 OSD are below the limit:
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 230370 167       20   0 4327276   2.9g  35332 S 146.7  4.7  31:28.87 ceph-osd

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 137736 167       20   0 4171328   3.1g  35340 S 160.9  4.9  33:10.54 ceph-osd

When the OSD CPU limit is increased to 3 in the CR, applied and the test is rerun, all 3 OSDs cross the limit of 2:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 429126 167       20   0 4465464   3.0g  34664 S 219.8  4.9  17:10.58 ceph-osd

 243448 167       20   0 4333228   3.1g  34084 S 237.2  5.0  18:14.66 ceph-osd

 322644 167       20   0 4470988   3.1g  35400 S 280.6  5.0  20:21.99 ceph-osd

And the most overloaded OSD gets close to CPU limit of 3. The same observations are also seen in the grafana dashboard.

What does this translate to in terms of performance seen via fio? On the fio random read test IOPs increased from 42.3k IOPS to 69.3k IOPS. That's a whopping 63% improvement.


Version of all relevant components (if applicable):

OCS 4.3 installed via UI

LSO:
NAME                                         DISPLAY         VERSION               REPLACES   PHASE
local-storage-operator.4.3.13-202004131016   Local Storage   4.3.13-202004131016              Succeeded

Comment 2 Manoj Pillai 2020-04-23 12:07:24 UTC

(In reply to Manoj Pillai from comment #0)

Repeating the same on an fio random write test, I get an improvement in IOPS of ~56% going from OSDs with cpu limit 2 to OSDs with CPU limit 3. The CPU usage when the test is running with limit 3:


    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 429126 167       20   0 5466568   2.4g  34692 S 277.8  3.9  41:12.24 ceph-osd
  26172 1000140+  20   0 1551596   1.3g  43684 S  15.9  2.1  22:24.34 promethe+
   1412 root      20   0 2856260 246516  96412 S   8.8  0.4  14:40.98 hyperkube

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 243448 167       20   0 5567764   2.4g  34136 S 284.0  3.9  42:23.24 ceph-osd

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 322644 167       20   0 5639956   2.5g  35392 S 284.4  4.0  45:00.73 ceph-osd

Comment 3 Manoj Pillai 2020-04-23 19:00:21 UTC

The outcome I'm looking for here is that the OSD CPU limit picked by OCS should not leave something on the table as far as performance is concerned. A single static limit will probably not work well given the range of devices and configurations OCS needs to handle.

Comment 4 Yaniv Kaul 2020-04-24 08:36:38 UTC

(In reply to Manoj Pillai from comment #3)
> The outcome I'm looking for here is that the OSD CPU limit picked by OCS
> should not leave something on the table as far as performance is concerned.
> A single static limit will probably not work well given the range of devices
> and configurations OCS needs to handle.

The other way around: we should limit the range of devices and configuration we support (ESPECIALLY in the cloud!) and be opinionated about the deployment, not leaving it to the user to choose
whatever they feel may be the correct settings, as they are likely clueless about it.
We can and should consider 2 deployment options:
1. Converged and dedicated - are the OCS nodes running also app workloads, or are they dedicated to OCS pods?
2. OSD-only nodes vs. OCS nodes - do we want to run the control plane pods (MGR, MONs, RGWs to some extent?, MDS, Noobaa, ...?) on other workers, and have some nodes dedicated to OSDs only.

This should give us a large enough matrix of options, and add to it the different possible machine types (just the M5 4xlarge and i3en.2xlarge...) and you get enough options, I think, for deployment.

An alternative to this large matrix is 'performance' and 'balanced' profiles for the OSDs perhaps.

Comment 5 Jose A. Rivera 2020-04-24 13:49:47 UTC

In general, we are only able to provide static resource limits, they need to be manually changed. And we don't want to introduce the ability to arbitrarily change these values, that puts them at high risk of not being supported. Having an option of various tunings for performance vs. resource consumption may be viable.

However, since nothing is crashing and no data is being lost, this is not a blocker and not something we need to consider right away. Moving this to OCS 4.6.

Comment 6 Manoj Pillai 2020-04-27 13:28:00 UTC

(In reply to Yaniv Kaul from comment #4)
> (In reply to Manoj Pillai from comment #3)
> > The outcome I'm looking for here is that the OSD CPU limit picked by OCS
> > should not leave something on the table as far as performance is concerned.
> > A single static limit will probably not work well given the range of devices
> > and configurations OCS needs to handle.
> 
> The other way around: we should limit the range of devices and configuration
> we support (ESPECIALLY in the cloud!) and be opinionated about the
> deployment, not leaving it to the user to choose
> whatever they feel may be the correct settings, as they are likely clueless
> about it.

Agreed, it should not be left to the user to choose the correct settings. OCS should choose, and that choice should be based on some knowledge of the configuration. An operator after all is expected to encode the knowledge of a smart admin.

> We can and should consider 2 deployment options:
> 1. Converged and dedicated - are the OCS nodes running also app workloads,
> or are they dedicated to OCS pods?
> 2. OSD-only nodes vs. OCS nodes - do we want to run the control plane pods
> (MGR, MONs, RGWs to some extent?, MDS, Noobaa, ...?) on other workers, and
> have some nodes dedicated to OSDs only.
> 
> This should give us a large enough matrix of options, and add to it the
> different possible machine types (just the M5 4xlarge and i3en.2xlarge...)
> and you get enough options, I think, for deployment.

AFAICT, our docs are currently specifying minimum requirements for OCS nodes, they are not enumerating the supported instances. Enumerating the supported instances would probably make this specific problem somewhat simpler. Not sure if you're saying we should do that.

> 
> An alternative to this large matrix is 'performance' and 'balanced' profiles
> for the OSDs perhaps.

Is any of the above work-in-progress or it needs to start?

Comment 8 Jose A. Rivera 2020-10-09 13:53:35 UTC

(In reply to Manoj Pillai from comment #6)
> (In reply to Yaniv Kaul from comment #4)
> > 
> > An alternative to this large matrix is 'performance' and 'balanced' profiles
> > for the OSDs perhaps.
> 
> Is any of the above work-in-progress or it needs to start?


This is not being worked on right now, and it is not on our radar for the immediate future. As such, moving this to OCS 4.7.

Comment 9 Jose A. Rivera 2020-10-09 13:56:46 UTC

My mistake, we have a JIRA for this already: https://issues.redhat.com/browse/KNIP-1472

As such, bringing it back to OCS 4.6 and moving it to MODIFIED.

Comment 14 krishnaram Karthick 2020-11-19 14:23:49 UTC

(In reply to Jose A. Rivera from comment #9)
> My mistake, we have a JIRA for this already:
> https://issues.redhat.com/browse/KNIP-1472
> 
> As such, bringing it back to OCS 4.6 and moving it to MODIFIED.

I don't understand how KNIP-1472 provides a complete solution for what is asked here. can you please help me to understand how are we addressing the 'performance' part with KNIP-1472?

IIUC, KNIP-1472 provides a way to deploy OCS with fewer resources (for entry-level deployments). Low CPU request, Lower performance. But, Do we have a way for more CPU for OCS deployments that need better performance?

Comment 15 krishnaram Karthick 2020-11-23 04:20:18 UTC

Based on comment#14 moving this bug to assigned as I believe there is still some work left.

Comment 16 Mudit Agarwal 2020-11-23 11:51:38 UTC

This is not a release blocker, moving it out of 4.6 till we have a clarity.

Comment 17 Yaniv Kaul 2020-11-23 12:40:28 UTC

(In reply to krishnaram Karthick from comment #15)
> Based on comment#14 moving this bug to assigned as I believe there is still
> some work left.

Isn't the same (manual) way to deploy OCS with fewer resources allows you to also deploy it with more resources?

Comment 18 krishnaram Karthick 2020-11-23 13:27:56 UTC

(In reply to Yaniv Kaul from comment #17)
> (In reply to krishnaram Karthick from comment #15)
> > Based on comment#14 moving this bug to assigned as I believe there is still
> > some work left.
> 
> Isn't the same (manual) way to deploy OCS with fewer resources allows you to
> also deploy it with more resources?

Yes, but I'd at least like to see the recommended values if someone expects the best performance out of OCS. We ideally want our operator to handle all of this as Manoj has requested in comment#6

Comment 19 Yaniv Kaul 2020-11-23 13:31:18 UTC

(In reply to krishnaram Karthick from comment #18)
> (In reply to Yaniv Kaul from comment #17)
> > (In reply to krishnaram Karthick from comment #15)
> > > Based on comment#14 moving this bug to assigned as I believe there is still
> > > some work left.
> > 
> > Isn't the same (manual) way to deploy OCS with fewer resources allows you to
> > also deploy it with more resources?
> 
> Yes, but I'd at least like to see the recommended values if someone expects
> the best performance out of OCS. We ideally want our operator to handle all
> of this as Manoj has requested in comment#6

I'd argue that this is a default work item that we need to complete - the operator doesn't know right now what the user wants - a dedicated node for storage, consume all resources, have all (OSDs, MDS, Noobaa) on the nodes, or just OSDs, etc.

We'll enable some kind of simple configuration 'profiles' in the future, and those will make the decision. Right now, let's go with the command line manual settings. Please ensure it works.

Comment 20 krishnaram Karthick 2020-11-23 13:46:47 UTC

(In reply to Yaniv Kaul from comment #19)
> (In reply to krishnaram Karthick from comment #18)
> > (In reply to Yaniv Kaul from comment #17)
> > > (In reply to krishnaram Karthick from comment #15)
> > > > Based on comment#14 moving this bug to assigned as I believe there is still
> > > > some work left.
> > > 
> > > Isn't the same (manual) way to deploy OCS with fewer resources allows you to
> > > also deploy it with more resources?
> > 
> > Yes, but I'd at least like to see the recommended values if someone expects
> > the best performance out of OCS. We ideally want our operator to handle all
> > of this as Manoj has requested in comment#6
> 
> I'd argue that this is a default work item that we need to complete - the
> operator doesn't know right now what the user wants - a dedicated node for
> storage, consume all resources, have all (OSDs, MDS, Noobaa) on the nodes,
> or just OSDs, etc.
> 
> We'll enable some kind of simple configuration 'profiles' in the future, and
> those will make the decision. Right now, let's go with the command line
> manual settings. Please ensure it works.

Manoj has already tried this as part of the original test. i.e., by trying to have the CPU request set to 3. 
What we don't have for the 'performance' profile is a recommendation for CPU & RAM configurations like we have for the 'balanced' profile in KNIP-1472.

Comment 21 Manoj Pillai 2020-11-23 13:59:27 UTC

(In reply to krishnaram Karthick from comment #20)
> (In reply to Yaniv Kaul from comment #19)

> > 
> > We'll enable some kind of simple configuration 'profiles' in the future, and
> > those will make the decision. Right now, let's go with the command line
> > manual settings. Please ensure it works.
> 
> Manoj has already tried this as part of the original test. i.e., by trying
> to have the CPU request set to 3. 
> What we don't have for the 'performance' profile is a recommendation for CPU
> & RAM configurations like we have for the 'balanced' profile in KNIP-1472.

See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1828883#c8

In that case, a CPU limit of 5 was roughly the right setting for a 'performance profile'. Hopefully, you can build on that.

Comment 23 errata-xmlrpc 2020-12-17 06:22:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Comment 24 Red Hat Bugzilla 2023-09-14 05:55:48 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days