1848907 – OCS 4.4 on Azure: Random I/O performance on RBD/CEPHfs is below expectations

Bug 1848907 - OCS 4.4 on Azure: Random I/O performance on RBD/CEPHfs is below expectations

Summary: OCS 4.4 on Azure: Random I/O performance on RBD/CEPHfs is below expectations

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Greg Farnum
QA Contact:	Yuli Persky
Docs Contact:
URL:
Whiteboard:
Depends On:	1873161 1928197
Blocks:	1797475
TreeView+	depends on / blocked

Reported:	2020-06-19 08:38 UTC by Shekhar Berry
Modified:	2023-09-15 00:32 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-04 07:16:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rook rook issues 6153	0	None	closed	Ability to set 'fast' deviceClass (tuneFastDeviceClass )	2021-02-16 12:21:45 UTC

Description Shekhar Berry 2020-06-19 08:38:09 UTC

Description of problem (please be detailed as possible and provide log
snippests):

On an OCS 4.4 setup which was hosted on Microsoft Azure Platform, performance analysis was done for both RBD and CEPHfs volume types.

Flexible IO generator (FIO) tool was used to measure both sequential and random read/write performance.

Instance type used on Azure is D16s_v3 which is capable of delivering 32000IOPS or 256MBPS throughput. The disk type we used was a p40 premium SSD in both the cases which is capable of delivering 7500 IOPS or 250MBPS if VM instance type supports it.

So based on these limits we wanted to see our Random write performance should be close to 7500 IOPS but we are getting Random write performance in the range of 3500IOPS to 3800IOPS for various block size. This is almost 50% lesser than our expectation of 7500IOPS. This is true for both RBD and CEPHfs volume types.

For detailed report of various performance tests conducted please refer to this google document: https://docs.google.com/document/d/1XJPXMcV-DOEcXVKhuxOCSth9fCWfBAhDClRe_K0yTrA/edit#

Version of all relevant components (if applicable):

oc version
Client Version: 4.4.0-0.nightly-2020-06-01-021027
Server Version: 4.4.3
Kubernetes Version: v1.17.1

ceph version
ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable)

Steps to Reproduce
1. Setup OCS 4.4 on Azure
2. Run Random performance tests using kubuculum tool (https://github.com/manojtpillai/kubuculum)
3. See Results

Additional Information

cat jobfile.fiorandwrite
[randomwrite]
rw=randwrite
ioengine=libaio
direct=1
iodepth=32
time_based=1
runtime=120
directory=/dataset
filename_format=f.$jobnum.$filenum
bs=8k
filesize=32g
numjobs=2

OCS must-gather can be found here: http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/OCS_on_Azure/ocs_44_azure_must_gather/

Comment 2 Sahina Bose 2020-06-22 16:32:57 UTC

Orit, who can look at this?

Comment 3 Orit Wasserman 2020-06-23 07:31:01 UTC

(In reply to Sahina Bose from comment #2)
> Orit, who can look at this?

We are suspecting the network is the issue.
Shekhar, can you run a networking benchmark? we need to understand the network better.
As cloud providers cap the network, this need to be a long test as the capping takes effect after a while.

Comment 4 Shekhar Berry 2020-06-23 08:18:01 UTC

Hi Orit,

Thank you for your reply.

Here are few of my thoughts which does point to a network issue:

1) When we scale the number of OSDs from 1 to 2 on the same OCS machines, the write performance improves from 3.5K IOPS to 6.5K IOPS (Again the expectation now was 15K IOPS, since we have 2 OSDs now). If we were getting bottle necked at VM network we wouldn't have seen jump in IOPS with increased OSDs, IMO.

2) On the environment where we have only 1 OSD per VM we see read performance of 35K IOPS. If we were getting bound by network we should have probably seen much lower Read performance.

3) I went to a higher instance type (D32s_v3) and tested the random RBD write performance and we were still getting it in the same range around 3.5K/3.8K with 1 2TB OSD per VM. 

Please let me know your thoughts on this.

Also I am sorry, I don't have the setup with me anymore to run further test at this point of time.

--Shekhar

Comment 5 Shekhar Berry 2020-06-23 08:19:30 UTC

(In reply to Shekhar Berry from comment #4)
> Hi Orit,
> 
> Thank you for your reply.
> 
> Here are few of my thoughts which does point to a network issue:

Sorry for typo. I meant "which does not point to a network issue:"

> 
> 1) When we scale the number of OSDs from 1 to 2 on the same OCS machines,
> the write performance improves from 3.5K IOPS to 6.5K IOPS (Again the
> expectation now was 15K IOPS, since we have 2 OSDs now). If we were getting
> bottle necked at VM network we wouldn't have seen jump in IOPS with
> increased OSDs, IMO.
> 
> 2) On the environment where we have only 1 OSD per VM we see read
> performance of 35K IOPS. If we were getting bound by network we should have
> probably seen much lower Read performance.
> 
> 3) I went to a higher instance type (D32s_v3) and tested the random RBD
> write performance and we were still getting it in the same range around
> 3.5K/3.8K with 1 2TB OSD per VM. 
> 
> Please let me know your thoughts on this.
> 
> Also I am sorry, I don't have the setup with me anymore to run further test
> at this point of time.
> 
> --Shekhar

Comment 6 Yaniv Kaul 2020-06-25 12:29:42 UTC

(In reply to Shekhar Berry from comment #4)
> Hi Orit,
> 
> Thank you for your reply.
> 
> Here are few of my thoughts which does point to a network issue:
> 
...

> 2) On the environment where we have only 1 OSD per VM we see read
> performance of 35K IOPS. If we were getting bound by network we should have
> probably seen much lower Read performance.

Unlikely - I don't see why read would use the same amount of network as writes.

Comment 7 Josh Durgin 2020-06-25 21:43:25 UTC

Moving to 4.6 since this is an investigation with no cluster available now.

Comment 8 Manoj Pillai 2020-09-08 13:15:39 UTC

[I'm looking at Azure performance on a setup loaned from Shekhar]

The advertised limits of the instance/disk is captured in comment #0. I'll repeat them here for clarity with a focus on IOPS since we are dealing with random I/O:

The instance has read caching enabled with cache size of 400GiB. It can deliver upto 32K IOPS. The disk type can deliver 7.5K IOPS. That gives us the following targets for different workloads:

1. random read that is cache-friendly (data set size << 400GB): 32K IOPS per instance
2. random read on a data set much larger than 400GB: 7.5K IOPS per instance (or more because of some data being cached).
3. random write: 7.5K IOPS per instance

fio random I/O tests on a single instance support these expectations:

fio test using the managed-premium storage class (and a 2TB PVC) with a 128g data set:

read: IOPS=30.8k
write: IOPS=7611

fio test using the managed-premium storage class (and a 2TB PVC) with a 960g data set:

read: IOPS=12.3k
write: IOPS=7631

So far so good. For a 3-node OCS setup with 1 managed premium disk per node, that gives us the following targets:

128g data set: approx. 96K IOPS on random read and 7.5K IOPS on random write.
960gdata set: approx. 22.5K IOPS on random read and 7.5K IOPS on random write.

The results from the fio tests are far below this:

fio (single instance) test using the ocs-storagecluster-ceph-rbd storage class with a 128g data set:

read: IOPS=19.9k
write: IOPS=3121

So random read is giving about 20K instead of 96K; random write is giving about 3K instead of 7.5K.

fio (single instance) test using the ocs-storagecluster-ceph-rbd storage class with a 960g data set:

read: IOPS=3545
write: IOPS=2269

So random read is giving 3.5K instead of 22.5K; random write is giving 2.2K instead of 7.5K.


Let's focus on the cache-unfriendly random read test (the one with 960g data set). Why is it giving 3.5K instead of 22.5K? Can network be the limiting factor? I don't think so, because the same network supported 19.9K in the cache-friendly test (the one with 128g data set).

This data strongly points to some bottleneck at the OSD that is preventing us from hitting the expected IOPS targets. I'm hoping to follow up with some more analysis.

Comment 9 Yaniv Kaul 2020-09-08 14:00:11 UTC

There's a different bug that Ceph identifies the Azure disks as HDDs and not SSDs, could that explain this?

Comment 10 Manoj Pillai 2020-09-08 14:50:00 UTC

(In reply to Yaniv Kaul from comment #9)
> There's a different bug that Ceph identifies the Azure disks as HDDs and not
> SSDs, could that explain this?

Yes, that's bz #1873161 . That's a prime suspect, ATM.

We know that Bluestore has different policies for HDDs vs SSDs, e.g. choice of minimum allocation size is different for HDDs vs SSDs. We are trying to find out if there are other differences as well that might explain the numbers we are seeing here.

Another thing to try is to explicitly indicate that the disks are SSD while setting up OCS. Based on https://bugzilla.redhat.com/show_bug.cgi?id=1873161#c8 it seems that is possible.

Comment 11 Manoj Pillai 2020-09-08 19:51:20 UTC

(In reply to Manoj Pillai from comment #10)
> (In reply to Yaniv Kaul from comment #9)
> > There's a different bug that Ceph identifies the Azure disks as HDDs and not
> > SSDs, could that explain this?
> 
> Yes, that's bz #1873161 . That's a prime suspect, ATM.
> 
> We know that Bluestore has different policies for HDDs vs SSDs, e.g. choice
> of minimum allocation size is different for HDDs vs SSDs. We are trying to
> find out if there are other differences as well that might explain the
> numbers we are seeing here.

We had this discussion in the Perf sync-up call today. Wrongly detecting SSDs as HDDs can apparently affect performance adversely in a number of ways. 
Josh, can you please list out some of the big ones?

So the focus is squarely on bz #1873161.

Comment 12 Josh Durgin 2020-09-09 13:17:48 UTC

(In reply to Manoj Pillai from comment #11)
> (In reply to Manoj Pillai from comment #10)
> > (In reply to Yaniv Kaul from comment #9)
> > > There's a different bug that Ceph identifies the Azure disks as HDDs and not
> > > SSDs, could that explain this?
> > 
> > Yes, that's bz #1873161 . That's a prime suspect, ATM.
> > 
> > We know that Bluestore has different policies for HDDs vs SSDs, e.g. choice
> > of minimum allocation size is different for HDDs vs SSDs. We are trying to
> > find out if there are other differences as well that might explain the
> > numbers we are seeing here.
> 
> We had this discussion in the Perf sync-up call today. Wrongly detecting
> SSDs as HDDs can apparently affect performance adversely in a number of
> ways. 
> Josh, can you please list out some of the big ones?
> 
> So the focus is squarely on bz #1873161.

There are many options that have different defaults for ssd and hdd - here's a list for nautilus:

https://gist.github.com/jdurgin/cf63bf0ec61bdc9a3ce4e60a5b0c4b30

The most likely to have an effect here are those related to threads/shards/cache/alloc size.

For any of these options, you can override the disk-specific variant by setting the option without an ssd/hdd suffix, e.g. osd_op_num_threads_per_shard=16 would take effect regardless of the disk type.

If you switch to all the ssd settings, the next bottleneck is likely to be the OSD cpu limit, which OCS defaults to a very low 2 or 3 per osd.

Comment 13 Shekhar Berry 2020-09-14 15:32:59 UTC

As mentioned by Manoj in Comment 8 above, we have narrowed the performance bottleneck at OSD level. For further experiments, I deployed a fresh new OCS cluster on Azure and captured default random performance baseline out of the box:

Default Write IOPS: 3090
Default Read IOPS:  16.1K

As mentioned in comments 9,10, 11 above we felt that primary reason for slow performance was bz #1873161 (SSD being detected as HDD in device class). So based on this assumption, we changed the class of recognized device to SSD from HDD using below command:

ceph osd crush rm-device-class osd.0 osd.1 osd.2
ceph osd crush set-device-class ssd osd.0 osd.1 osd.2

But the above change did not chnage the performance and IOPS remained same.

Write Performance after Device was changed to SSD: 2970
Read Performance after Device was changed to SSD:  16K

We next tried to manually set the options for SSD in ceph config based on the list provided by Josh in Comment 12. We edited the Ceph Config and changed the value of osd_op_num_threads_per_shard to 2 , 3 , 4 for our various experiments. The command to do this change is:

ceph config set osd osd_op_num_threads_per_shard 2

ceph config dump
WHO                                                 MASK LEVEL    OPTION                             VALUE                              RO 
global                                                   advanced mon_allow_pool_delete              true                                  
global                                                   advanced mon_pg_warn_min_per_osd            0                                     
global                                                   advanced osd_pool_default_pg_autoscale_mode on                                    
global                                                   advanced rbd_default_features               3                                     
  mgr                                                    advanced mgr/balancer/active                true                                  
  mgr                                                    advanced mgr/balancer/mode                  upmap                                 
  mgr                                                    advanced mgr/orchestrator_cli/orchestrator  rook                               *  
  osd                                                    advanced osd_op_num_threads_per_shard       2                                  *  
    mds.ocs-storagecluster-cephfilesystem-a              basic    mds_cache_memory_limit             4294967296                            
    mds.ocs-storagecluster-cephfilesystem-b              basic    mds_cache_memory_limit             4294967296                            
    client.rgw.ocs.storagecluster.cephobjectstore.a      advanced rgw_enable_usage_log               true                                  
    client.rgw.ocs.storagecluster.cephobjectstore.a      advanced rgw_log_nonexistent_bucket         true                                  
    client.rgw.ocs.storagecluster.cephobjectstore.a      advanced rgw_log_object_name_utc            true                                  
    client.rgw.ocs.storagecluster.cephobjectstore.a      advanced rgw_zone                           ocs-storagecluster-cephobjectstore *  
    client.rgw.ocs.storagecluster.cephobjectstore.a      advanced rgw_zonegroup                      ocs-storagecluster-cephobjectstore *  

For some of the experiments we also edited the storagecluster CR and changed the CPU core count per OSD to 3, 4 for various experiments we conducted. Here are few lines that we added to storagecluster CR:

 resources:
    limits:
      cpu: "3"
      memory: "8Gi"
    requests:
      cpu: "3"
      memory: "4Gi"

===============================================================================================================================================================================================================

Here I will summarize the results of the experiments performed after tuning the setup as described above:

1st Experiment
==============
1 FIO Instance, osd_op_num_threads_per_shard 2, Write IOPS: 3048 
1 FIO Instance, osd_op_num_threads_per_shard 2, Read IOPS:  23.7K

As you see we see improvement in Read from 16.1k to 23.7K when we change osd_op_num_threads_per_shard to 2 from default of 0 (for HDD) while write remains the same.

2nd Experiment
==============
1 FIO Instance, osd_op_num_threads_per_shard 3, Write IOPS: 2824 
1 FIO Instance, osd_op_num_threads_per_shard 3, Read IOPS:  23.3K

When osd_op_num_threads_per_shard was increased to 3 the value remained same as when it was 2. This pointed to a bottleneck in CPU as one of the OSD was fully utilizing the entire CPU allotted to it.

3rd Experiment
==============
1 FIO Instance, osd_op_num_threads_per_shard 2, CPU core/OSD is 3 Write IOPS: 3362
1 FIO Instance, osd_op_num_threads_per_shard 2, CPU Core/OSD is 3 Read IOPS:  26.7K

Increasing the CPU count to 3 bumped up the performance slightly and showed that CPU was indeed a limiting factor. Next experiment was to test with osd_op_num_threads_per_shard 3 and increased CPU core/OSD count.

4th Experiment
==============
1 FIO Instance, osd_op_num_threads_per_shard 3, CPU core/OSD is 3 Write IOPS: 3052
1 FIO Instance, osd_op_num_threads_per_shard 3, CPU Core/OSD is 3 Read IOPS:  27.4K

Here we observed that when shard is set to 3, CPU is again becoming a bottleneck, so in next experiment we increased the CPU core/osd to 4 now keeping osd_op_num_threads_per_shard as 3.

5th Experiment
==============
1 FIO Instance, osd_op_num_threads_per_shard 3, CPU core/OSD is 4 Write IOPS: 3607
1 FIO Instance, osd_op_num_threads_per_shard 3, CPU Core/OSD is 4 Read IOPS:  27K

We saw Write performance improving but Read was still the same. There was no bottleneck on CPU and we felt the need of increasing the IOs in the data-path by increasing the FIO instance to 3.

6th Experiment
==============
3 FIO Instance, osd_op_num_threads_per_shard 3, CPU core/OSD is 4 Write IOPS: 4409
3 FIO Instance, osd_op_num_threads_per_shard 3, CPU Core/OSD is 4 Read IOPS:  36.1K

As you see, we got a significant improvement here. Read performance is actually 125% better than default value and write performance is 42% better than default.

==============================================================================================================================================================================================================

To summarize we see two problems here which is affecting Random performance in Azure:

1) Performance is degraded because OCS is not recognizing devices correctly (SSD is seen as HDD).
2) CPU core/osd will be a bottleneck once OCS starts recognizing the devices correctly.

Question:

What other option from the list https://gist.github.com/jdurgin/cf63bf0ec61bdc9a3ce4e60a5b0c4b30 do you feel that may significantly improve performance? 

We are still conducting more experiments and analysis. Will update the BZ once we have more to share.

Comment 14 Josh Durgin 2020-09-16 15:38:16 UTC

Rook is working on a workaround for the disks being marked rotational: https://github.com/rook/rook/issues/6153

For now I'd suggest setting all of those ssd-specific tunings and redeploying the osds (min_alloc_size can't be changed without redeployment).

As you saw with experiment 6, you may need to increase client load to saturate the cluster.

Comment 15 Shekhar Berry 2020-09-17 05:53:14 UTC

In continuation to my experiments describe in Comment 13, I did few more whose results are described below:

7th Experiment
==============
3 FIO Instance, osd_op_num_threads_per_shard 4, CPU core/OSD is 6 Write IOPS: 4211
3 FIO Instance, osd_op_num_threads_per_shard 4, CPU Core/OSD is 6 Read IOPS:  38.7K

As you see, if we increase osd_op_num_threads_per_shard from 3 to 4 and correspondingly also increase CPU cores/osd to prevent CPU saturation, my write IOPS reduces (4409 --> 4211) but there is increase in Read IOPS (36.1K --> 38.7K).

8th Experiment
==============
To further see the effect of increasing osd_op_num_threads_per_shard we conducted one more experiment by increasing it to 8

3 FIO Instance, osd_op_num_threads_per_shard 8, CPU core/OSD is 6 Write IOPS: 3818
3 FIO Instance, osd_op_num_threads_per_shard 8, CPU Core/OSD is 6 Read IOPS:  46.9K

Once again we see read improving drastically to 46.9K but write IOPS reduces further. This shows that the current default value of osd_op_num_threads_per_shard_ssd (2) may be low for achieving good performance and we cannot have it too large enough as well as it starts affecting write adversely. 

The experiments have shown that we can improve the Random performance by tuning the Ceph config and CPU cores/osd accordingly. Bug 1873161 needs to be fixed for this performance issue to be resolved. 

For better understanding of the user using OCS, we also need to explicitly mention in our product documents that with current default CPU cores/osd (request 1, Limit 2) we would get limited Random performance from OCS on Azure system. If any customer/user wants to achieve more we need to increase this value per OSD.

As of now we are stopping our experiments here. We will revisit this bug and conduct more testing to confirm the state of performance once Bug 1873161 is resolved. 

Thank you

Comment 16 Yaniv Kaul 2020-09-17 07:46:01 UTC

1. Can you verify, especially in the case of write, that network is not the bottleneck?
2. I hope that Multus will improve this - it'll reduce CPU consumption and improve network throughput.

Comment 17 Shekhar Berry 2020-09-22 07:15:59 UTC

(In reply to Yaniv Kaul from comment #16)
> 1. Can you verify, especially in the case of write, that network is not the
> bottleneck?

During Read Operation, here's the network utilization: (Reason why we are seeing both rxkB/s and txkB/s during read is because we hosted FIO client POD and OCS on the same node, so between FIO pod and OSD one is transmitting and other is receiving)

      IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
      eth0  90483.20  85532.50 104163.45  95172.32      0.00      0.00      0.00

During Write Operation, here's the network utilization:

      IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
      eth0  33713.80  33205.40  40940.57  39787.91      0.00      0.00      0.00

If you see from above during write workload, network can still be utilized more than what it's being done compared to Read workload. So network is not a bottleneck here during write.

Comment 18 Mudit Agarwal 2020-10-09 11:00:09 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1873161 is approved for 4.7

Comment 21 Sahina Bose 2021-01-29 05:39:34 UTC

Addressed by bug 1903973

Comment 22 Shekhar Berry 2021-02-18 07:24:14 UTC

Retested OCS performance on Azure with OCP 4.6 and OCS 4.6.3 but no improvement in performance is seen as OSDs are still not recognized correctly as SSD. OCS still sees OSDs as HDD and applies tuning accordingly.

See https://bugzilla.redhat.com/show_bug.cgi?id=1925004#c9 for more details.

Comment 23 Shekhar Berry 2021-02-23 10:08:43 UTC

Based on discussion over Google Chat,OCS 4.6.3_RC5 build introduced a change by virtue of which it will apply faster tuning to all OSDs,

Snippet from oc get cephcluster -oyaml

resources:
          limits:
            cpu: "2"
            memory: 5Gi
          requests:
            cpu: "2"
            memory: 5Gi
        tuneFastDeviceClass: true

Just FYI on the OCS side the OSD is still visible as HDDs, See output of ceph osd tree

ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                                                     STATUS REWEIGHT PRI-AFF 
 -1       6.00000 root default                                                                          
 -5       6.00000     region eastus                                                                     
-14       2.00000         zone eastus-1                                                                 
-13       2.00000             host ocs-deviceset-managed-premium-1-data-0-nvxqx                         
  1   hdd 2.00000                 osd.1                                             up  1.00000 1.00000 
-10       2.00000         zone eastus-2                                                                 
 -9       2.00000             host ocs-deviceset-managed-premium-2-data-0-sd266                         
  2   hdd 2.00000                 osd.2                                             up  1.00000 1.00000 
 -4       2.00000         zone eastus-3                                                                 
 -3       2.00000             host ocs-deviceset-managed-premium-0-data-0-vqs5m                         
  0   hdd 2.00000                 osd.0                                             up  1.00000 1.00000 


Nevertheless Re-Evaluated OCS performance On OCP 4.6 with OCS 4.6.3_RC5 build and I still observe poor RBD Random Performance:

Write IOPS : 3855 (Expected Value: 7500) which is almost 50% below expectation
Read IOPS  : 19400 (Expected Value: 90000) which is almost 75% below expectation

Based on results above it seems like tuneFastDeviceClass: true has no affect on tuning values being set for OSD. I think OCS is still setting the value based on OSD CLASS type (which is HDD here). to confirm this I am setting SSD values manually and re-ruuning the tests. Will update the bz with results.

oc version
Client Version: 4.6.16
Server Version: 4.6.16
Kubernetes Version: v1.19.0+e49167a

oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.3-732.ci   OpenShift Container Storage   4.6.3-732.ci              Succeeded

Comment 24 Sahina Bose 2021-02-23 10:31:08 UTC

(In reply to Shekhar Berry from comment #23)
> Based on discussion over Google Chat,OCS 4.6.3_RC5 build introduced a change
> by virtue of which it will apply faster tuning to all OSDs,
> 
> Snippet from oc get cephcluster -oyaml
> 
> resources:
>           limits:
>             cpu: "2"
>             memory: 5Gi
>           requests:
>             cpu: "2"
>             memory: 5Gi
>         tuneFastDeviceClass: true
> 
> Just FYI on the OCS side the OSD is still visible as HDDs, See output of
> ceph osd tree
> 
> ceph osd tree
> ID  CLASS WEIGHT  TYPE NAME                                                 
> STATUS REWEIGHT PRI-AFF 
>  -1       6.00000 root default                                              
> 
>  -5       6.00000     region eastus                                         
> 
> -14       2.00000         zone eastus-1                                     
> 
> -13       2.00000             host
> ocs-deviceset-managed-premium-1-data-0-nvxqx                         
>   1   hdd 2.00000                 osd.1                                     
> up  1.00000 1.00000 
> -10       2.00000         zone eastus-2                                     
> 
>  -9       2.00000             host
> ocs-deviceset-managed-premium-2-data-0-sd266                         
>   2   hdd 2.00000                 osd.2                                     
> up  1.00000 1.00000 
>  -4       2.00000         zone eastus-3                                     
> 
>  -3       2.00000             host
> ocs-deviceset-managed-premium-0-data-0-vqs5m                         
>   0   hdd 2.00000                 osd.0                                     
> up  1.00000 1.00000 
> 
> 
> Nevertheless Re-Evaluated OCS performance On OCP 4.6 with OCS 4.6.3_RC5
> build and I still observe poor RBD Random Performance:
> 
> Write IOPS : 3855 (Expected Value: 7500) which is almost 50% below
> expectation
> Read IOPS  : 19400 (Expected Value: 90000) which is almost 75% below
> expectation
> 
> Based on results above it seems like tuneFastDeviceClass: true has no affect
> on tuning values being set for OSD. I think OCS is still setting the value
> based on OSD CLASS type (which is HDD here). to confirm this I am setting
> SSD values manually and re-ruuning the tests. Will update the bz with
> results.

Before doing this, can you check the config assigned on the OSD via ceph config show osd.<id>
And check if the tuning values are applied?

> 
> oc version
> Client Version: 4.6.16
> Server Version: 4.6.16
> Kubernetes Version: v1.19.0+e49167a
> 
> oc get csv
> NAME                         DISPLAY                       VERSION       
> REPLACES   PHASE
> ocs-operator.v4.6.3-732.ci   OpenShift Container Storage   4.6.3-732.ci     
> Succeeded

Comment 25 Sébastien Han 2021-02-23 10:45:12 UTC

To properly validate that flags are being passed correctly you must exec into an OSD pod and look at the startup flag of the ceph-osd process. So just "ps fauwwwwwx" and you should see a few line of "bluestore_cache_size", etc etc

Comment 26 Shekhar Berry 2021-02-23 10:56:00 UTC

Here's an output of ceph config show osd.0. 

ceph config show osd.0   
NAME                                 VALUE                                                                                                                               SOURCE   OVERRIDES             IGNORES 
bluestore_cache_size                 3221225472                                                                                                                          cmdline                                
bluestore_compression_max_blob_size  65536                                                                                                                               cmdline                                
bluestore_compression_min_blob_size  8912                                                                                                                                cmdline                                
bluestore_deferred_batch_ops         16                                                                                                                                  cmdline                                
bluestore_max_blob_size              65536                                                                                                                               cmdline                                
bluestore_min_alloc_size             4096                                                                                                                                cmdline                                
bluestore_prefer_deferred_size       0                                                                                                                                   cmdline                                
bluestore_throttle_cost_per_io       4000                                                                                                                                cmdline                                
crush_location                       root=default host=ocs-deviceset-managed-premium-0-data-0-vqs5m region=eastus zone=eastus-3                                          cmdline                                
daemonize                            false                                                                                                                               override                               
err_to_stderr                        true                                                                                                                                cmdline                                
keyring                              $osd_data/keyring                                                                                                                   default                                
leveldb_log                                                                                                                                                              default                                
log_file                                                                                                                                                                 mon                                    
log_stderr_prefix                    debug                                                                                                                               cmdline                                
log_to_file                          false                                                                                                                               default                                
log_to_stderr                        true                                                                                                                                cmdline                                
mon_allow_pool_delete                true                                                                                                                                mon                                    
mon_cluster_log_file                                                                                                                                                     mon                                    
mon_cluster_log_to_file              false                                                                                                                               default                                
mon_cluster_log_to_stderr            true                                                                                                                                cmdline                                
mon_host                             [v2:172.30.77.5:3300,v1:172.30.77.5:6789],[v2:172.30.4.107:3300,v1:172.30.4.107:6789],[v2:172.30.26.243:3300,v1:172.30.26.243:6789] override                               
mon_max_pg_per_osd                   600                                                                                                                                 file                                   
mon_osd_backfillfull_ratio           0.800000                                                                                                                            file                                   
mon_osd_full_ratio                   0.850000                                                                                                                            file                                   
mon_osd_nearfull_ratio               0.750000                                                                                                                            file                                   
mon_pg_warn_min_per_osd              0                                                                                                                                   mon                                    
ms_learn_addr_from_peer              false                                                                                                                               cmdline                                
osd_delete_sleep                     0.000000                                                                                                                            cmdline                                
osd_memory_target                    2684354560                                                                                                                          env      (default[2684354560])         
osd_memory_target_cgroup_limit_ratio 0.500000                                                                                                                            file                                   
osd_op_num_shards                    8                                                                                                                                   cmdline                                
osd_op_num_threads_per_shard         2                                                                                                                                   cmdline                                
osd_pool_default_pg_autoscale_mode   on                                                                                                                                  mon                                    
osd_recovery_sleep                   0.000000                                                                                                                            cmdline                                
osd_snap_trim_sleep                  0.000000                                                                                                                            cmdline                                
rbd_default_features                 3                                                                                                                                   mon      default[61]                   
setgroup                             ceph                                                                                                                                cmdline                                
setuser                              ceph                                                                                                                                cmdline            

From above it seems that values are being passed correctly as per SSD but we may have to tune further to extract more performance from OCS. I will work on this.

Also from inside OSD POD, 

ps fauwwwwwx | grep osd

root       50753  0.0  0.0 143476  2784 ?        Ssl  06:55   0:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981/userdata -c bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-storage_rook-ceph-osd-0-555f66cdd5-dnngn_c5fe2c96-6487-4a32-b155-029d8917e805/osd/0.log --log-level info -n k8s_osd_rook-ceph-osd-0-555f66cdd5-dnngn_openshift-storage_c5fe2c96-6487-4a32-b155-029d8917e805_0 -P /var/run/containers/storage/overlay-containers/bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981 -s
ceph       50764 25.8  2.2 4215944 1460224 ?     Ssl  06:55  61:24  \_ ceph-osd --foreground --id 0 --fsid f54a79a1-7607-4e07-8003-dfc1376f11d6 --setuser ceph --setgroup ceph --crush-location=root=default host=ocs-deviceset-managed-premium-0-data-0-vqs5m region=eastus zone=eastus-3 --osd-op-num-shards=8 --osd-delete-sleep=0 --bluestore-compression-min-blob-size=8912 --bluestore-cache-size=3221225472 --bluestore-deferred-batch-ops=16 --osd-op-num-threads-per-shard=2 --osd-snap-trim-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-throttle-cost-per-io=4000 --osd-recovery-sleep=0 --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false

Comment 28 Sébastien Han 2021-02-23 15:56:18 UTC

Ok

Comment 34 Blaine Gardner 2021-03-03 15:56:26 UTC

You should be able to edit the rook-ceph-config-override configmap to apply settings that cannot be modified at runtime.

Upstream docs can be found here:
https://rook.io/docs/rook/v1.5/ceph-advanced-configuration.html#custom-cephconf-settings

Comment 37 Blaine Gardner 2021-03-03 17:31:12 UTC

I have been looking at the cluster, and it appears that the `--osd-op-num-threads-per-shard=2` CLI flag is set on OSD Pods when running with the tune fast settings, so there is no way to override the value other than to set the tune fast config to false.

Comment 39 Yuli Persky 2021-03-09 10:45:17 UTC

@all,

Similar problematic performance was seen during Performance tests run on Azure for both RBD and CephFS. 

OCP version :4.6.17
OCS version: 4.6.3-271.ci
Ceph version: 14.2.11-95.el8cp

RBD Sequential FIO test results : 

4KiB Read IO rate is 4.5:  30,176
4KiB Read IO rate is 4.6:  33,540
4KiB Read IO rate is 4.6.3 ( 2 independent runs): 25206 and 31778.67  

The test results can be found here: 

http://10.0.78.167:9200/ripsaw-fio-fullres/_search?q=uuid:3b78c61f-b459-56cf-93bf-747b4d98604f 
and
http://10.0.78.167:9200/ripsaw-fio-fullres/_search?q=uuid:2999146d-4be3-57ba-991d-a5a95dfa439e


CephFS Sequential FIO test results: 

4KiB Read IO rate is 4.5:  33,181
4KiB Read IO rate is 4.6:  30,641
4KiB Read IO rate is 4.6.3 ( 2 independent runs): 22134.33 and  27953.67

The test results can be found here: 
http://10.0.78.167:9200/ripsaw-fio-fullres/_search?q=uuid:acef724d-736c-5227-a646-a6593942394a
and 
http://10.0.78.167:9200/ripsaw-fio-fullres/_search?q=uuid:3b78c61f-b459-56cf-93bf-747b4d98604f

The must gather logs output will be uploaded shortly.

Comment 40 Josh Durgin 2021-03-09 16:51:54 UTC

You should be able to change the settings of a running daemon (don't restart the pod or these will get lost) via 'ceph tell osd.* osd config set' - this changes the config of a running daemon directly, rather than updating the centralized config like 'ceph config set'.

Comment 42 Pulkit Kundra 2021-03-09 16:55:47 UTC

(In reply to Josh Durgin from comment #40)
> You should be able to change the settings of a running daemon (don't restart
> the pod or these will get lost) via 'ceph tell osd.* osd config set' - this
> changes the config of a running daemon directly, rather than updating the
> centralized config like 'ceph config set'.

we tried this https://bugzilla.redhat.com/show_bug.cgi?id=1848907#c33.

Comment 43 Yuli Persky 2021-03-10 22:30:50 UTC

In continuation of comment#39  - the must gather logs are located here: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz_1848907/

Comment 44 Yuli Persky 2021-03-11 06:30:34 UTC

In continuation of comment#39 - Azure 4.6.3 Performance report it available here: https://docs.google.com/document/d/1xohm7HPNqI4vhcx9LtRKZ6eXzW-kYK4QGHy4hnrm8j4/edit#

Comment 45 Sébastien Han 2021-03-11 14:42:09 UTC

I don't why the config set is not applied, you can look at the OSD socket directly for the config too.
Or else try Josh's suggestion from https://bugzilla.redhat.com/show_bug.cgi?id=1848907#c40

Comment 46 Mudit Agarwal 2021-03-11 17:16:20 UTC

Did we try https://bugzilla.redhat.com/show_bug.cgi?id=1896810#c40

Comment 47 Blaine Gardner 2021-03-15 16:29:11 UTC

We were seeing `ceph config set ...` not working for some configs when `tuneFastDeviceClass` was set to `true` because some params are set on the commandline. Make sure all `tune...` settings on the storageClassDeviceSet are `false` or unset.

Comment 48 Pulkit Kundra 2021-03-16 12:29:48 UTC

(In reply to Mudit Agarwal from comment #46)
> Did we try https://bugzilla.redhat.com/show_bug.cgi?id=1896810#c40

yes we did try that https://bugzilla.redhat.com/show_bug.cgi?id=1848907#c33

For now by editing the options in deployment it is working.

Comment 50 Sahina Bose 2021-03-22 09:50:59 UTC

Jason, Shekhar, are there any recommendation on improving performance on Azure that we can implement? 
This bug is currently acked for 4.7, but apart from the SSD tunings to be applied (which is already done), we have not heard of other settings to improve performance

Comment 51 Mudit Agarwal 2021-03-29 11:01:32 UTC

Based on the offline discussion with Karthick moving this out of 4.7, there is still WIP which is required to achieve expected performance.

https://chat.google.com/room/AAAAREGEba8/i5Ecuobu2a4

Comment 56 Shekhar Berry 2021-04-06 07:44:03 UTC

Hi All,

TL;DR version: By increasing CPU cores/OSD and pumping more IO through network pipe OCS achieves 6400 (85% of Azure Capability) write IOPS and 71200 (80% of Azure capability) Read IOPS. This is 107% and 326% improvement for Write and Read respectively over OCS 4.4 in OCS 4.7. We can further improve performance if all OCS nodes are on the same Availability Zone but that will affect HA.

Here’s a detailed description of where things stand related to this bug:

-- With D16s_v3	instance type and p40 drive type in Azure expected OCS performance was ~7500 for random write IOPS and ~90000 for random read IOPS.

-- Currently we are	evaluating OCS 4.7 performance in Azure configuration. In OCS 4.7 TuneFastSetting corresponding to SSD is set. Out of the box with this configuration we were getting ~4000 Random Write IOPS (53% of	Azure H/W capability) 30000 Random Read IOPS (33% of Azure Hardware	Capability).

-- In order to	troubleshoot the difference in what Azure can deliver and what OCS is able to extract from it following troubleshooting exercise was performed:	
         ---Configured a OCS cluster with all OCS nodes on same Availability Zones with	Proximity Placement Group Enabled (this ensures Azure will create VMs on the same rack and reduce latency between them) and did Uperf analysis between OCS nodes.
         --- Compared the above Uperf analysis with configuration where OCS nodes were on 3 different availability zones (default configuration). 
         --- Results of	Uperf analysis showed that 3AZ configuration was almost 2.5 times slower than singe AZ configuration and thus proving our hypothesis that network latency was one of the major bottlenecks in OCS Random Performance.

-- Now we moved	back to default 3 AZ cluster configuration with an aim to pump more IO in the network pipe to overcome network round trip latency which we determined above.

-- Once we increased the number of threads writing parallely (and thus filling the network pipe and queuing up a large number of Outstanding IO) on the SSD drive we started to see an increase in IO performance. 

-- With network	latency bottlenecks taken care of by pumping more IO we started to hit the CPU bottleneck on the OSDs. To overcome CPU bottleneck we increased	CPU cores/OSD from default 2 to 3 to 4 and to 5 and captured IO	performance.

Now here’s the IOPS numbers based on above changes in configuration and with OCS nodes on 3 different Availability Zone (default):

2 CPU Cores/ OSD, 5G Memory (Default OCS resource Configuration)

Random Write: 5143
Random Read: 35600

3 CPU Cores/ OSD, 5G Memory

Random Write: 5915
Random Read: 45700

4 CPU Cores/ OSD, 5G Memory

Random Write: 5952
Random Read: 65300

5 CPU Cores/ OSD, 5G Memory

Random Write: 6400(85% of Azure Capability)
Random Read: 71200(80% of Azure Capability)	

If all OCS nodes areon Same AZ with proximity placement group enabled, here’s the performance numbers we get:

2 CPU Cores/ OSD, 5G Memory (Default OCS resource Configuration)

Random Write: 5200
Random Read: 37500

3 CPU Cores/ OSD, 5G Memory

Random Write: 6881(~90% of Azure capability)
Random Read: 54400

Tests with higher CPU values were not performed as we moved to default OCS configuration(3AZ) testing,

Please let me know if you have any questions.

Shekhar

Comment 59 Sahina Bose 2021-06-04 07:16:10 UTC

Looking at comment 56 perf results from Shekhar, if we assign 5 CPU Cores/ OSD, 5G Memory we reach around 80% of Azure performance with 3 AZ, and within single AZ- 90% of Azure performance. 

We can recommend cu to increase Requests and limits on OSD deployment, but we don't want to make this the default.
Closing this bug as the network round trip latency seems to cause the issue.

Please re-open if any other fix is required.

Comment 61 Red Hat Bugzilla 2023-09-15 00:32:57 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.