Description of problem (please be detailed as possible and provide log snippests): On an OCS 4.4 setup which was hosted on Microsoft Azure Platform, performance analysis was done for both RBD and CEPHfs volume types. Flexible IO generator (FIO) tool was used to measure both sequential and random read/write performance. Instance type used on Azure is D16s_v3 which is capable of delivering 32000IOPS or 256MBPS throughput. The disk type we used was a p40 premium SSD in both the cases which is capable of delivering 7500 IOPS or 250MBPS if VM instance type supports it. So based on these limits we wanted to see our Random write performance should be close to 7500 IOPS but we are getting Random write performance in the range of 3500IOPS to 3800IOPS for various block size. This is almost 50% lesser than our expectation of 7500IOPS. This is true for both RBD and CEPHfs volume types. For detailed report of various performance tests conducted please refer to this google document: https://docs.google.com/document/d/1XJPXMcV-DOEcXVKhuxOCSth9fCWfBAhDClRe_K0yTrA/edit# Version of all relevant components (if applicable): oc version Client Version: 4.4.0-0.nightly-2020-06-01-021027 Server Version: 4.4.3 Kubernetes Version: v1.17.1 ceph version ceph version 14.2.4-125.el8cp (db63624068590e593c47150c7574d08c1ec0d3e4) nautilus (stable) Steps to Reproduce 1. Setup OCS 4.4 on Azure 2. Run Random performance tests using kubuculum tool (https://github.com/manojtpillai/kubuculum) 3. See Results Additional Information cat jobfile.fiorandwrite [randomwrite] rw=randwrite ioengine=libaio direct=1 iodepth=32 time_based=1 runtime=120 directory=/dataset filename_format=f.$jobnum.$filenum bs=8k filesize=32g numjobs=2 OCS must-gather can be found here: http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/OCS_on_Azure/ocs_44_azure_must_gather/
Orit, who can look at this?
(In reply to Sahina Bose from comment #2) > Orit, who can look at this? We are suspecting the network is the issue. Shekhar, can you run a networking benchmark? we need to understand the network better. As cloud providers cap the network, this need to be a long test as the capping takes effect after a while.
Hi Orit, Thank you for your reply. Here are few of my thoughts which does point to a network issue: 1) When we scale the number of OSDs from 1 to 2 on the same OCS machines, the write performance improves from 3.5K IOPS to 6.5K IOPS (Again the expectation now was 15K IOPS, since we have 2 OSDs now). If we were getting bottle necked at VM network we wouldn't have seen jump in IOPS with increased OSDs, IMO. 2) On the environment where we have only 1 OSD per VM we see read performance of 35K IOPS. If we were getting bound by network we should have probably seen much lower Read performance. 3) I went to a higher instance type (D32s_v3) and tested the random RBD write performance and we were still getting it in the same range around 3.5K/3.8K with 1 2TB OSD per VM. Please let me know your thoughts on this. Also I am sorry, I don't have the setup with me anymore to run further test at this point of time. --Shekhar
(In reply to Shekhar Berry from comment #4) > Hi Orit, > > Thank you for your reply. > > Here are few of my thoughts which does point to a network issue: Sorry for typo. I meant "which does not point to a network issue:" > > 1) When we scale the number of OSDs from 1 to 2 on the same OCS machines, > the write performance improves from 3.5K IOPS to 6.5K IOPS (Again the > expectation now was 15K IOPS, since we have 2 OSDs now). If we were getting > bottle necked at VM network we wouldn't have seen jump in IOPS with > increased OSDs, IMO. > > 2) On the environment where we have only 1 OSD per VM we see read > performance of 35K IOPS. If we were getting bound by network we should have > probably seen much lower Read performance. > > 3) I went to a higher instance type (D32s_v3) and tested the random RBD > write performance and we were still getting it in the same range around > 3.5K/3.8K with 1 2TB OSD per VM. > > Please let me know your thoughts on this. > > Also I am sorry, I don't have the setup with me anymore to run further test > at this point of time. > > --Shekhar
(In reply to Shekhar Berry from comment #4) > Hi Orit, > > Thank you for your reply. > > Here are few of my thoughts which does point to a network issue: > ... > 2) On the environment where we have only 1 OSD per VM we see read > performance of 35K IOPS. If we were getting bound by network we should have > probably seen much lower Read performance. Unlikely - I don't see why read would use the same amount of network as writes.
Moving to 4.6 since this is an investigation with no cluster available now.
[I'm looking at Azure performance on a setup loaned from Shekhar] The advertised limits of the instance/disk is captured in comment #0. I'll repeat them here for clarity with a focus on IOPS since we are dealing with random I/O: The instance has read caching enabled with cache size of 400GiB. It can deliver upto 32K IOPS. The disk type can deliver 7.5K IOPS. That gives us the following targets for different workloads: 1. random read that is cache-friendly (data set size << 400GB): 32K IOPS per instance 2. random read on a data set much larger than 400GB: 7.5K IOPS per instance (or more because of some data being cached). 3. random write: 7.5K IOPS per instance fio random I/O tests on a single instance support these expectations: fio test using the managed-premium storage class (and a 2TB PVC) with a 128g data set: read: IOPS=30.8k write: IOPS=7611 fio test using the managed-premium storage class (and a 2TB PVC) with a 960g data set: read: IOPS=12.3k write: IOPS=7631 So far so good. For a 3-node OCS setup with 1 managed premium disk per node, that gives us the following targets: 128g data set: approx. 96K IOPS on random read and 7.5K IOPS on random write. 960gdata set: approx. 22.5K IOPS on random read and 7.5K IOPS on random write. The results from the fio tests are far below this: fio (single instance) test using the ocs-storagecluster-ceph-rbd storage class with a 128g data set: read: IOPS=19.9k write: IOPS=3121 So random read is giving about 20K instead of 96K; random write is giving about 3K instead of 7.5K. fio (single instance) test using the ocs-storagecluster-ceph-rbd storage class with a 960g data set: read: IOPS=3545 write: IOPS=2269 So random read is giving 3.5K instead of 22.5K; random write is giving 2.2K instead of 7.5K. Let's focus on the cache-unfriendly random read test (the one with 960g data set). Why is it giving 3.5K instead of 22.5K? Can network be the limiting factor? I don't think so, because the same network supported 19.9K in the cache-friendly test (the one with 128g data set). This data strongly points to some bottleneck at the OSD that is preventing us from hitting the expected IOPS targets. I'm hoping to follow up with some more analysis.
There's a different bug that Ceph identifies the Azure disks as HDDs and not SSDs, could that explain this?
(In reply to Yaniv Kaul from comment #9) > There's a different bug that Ceph identifies the Azure disks as HDDs and not > SSDs, could that explain this? Yes, that's bz #1873161 . That's a prime suspect, ATM. We know that Bluestore has different policies for HDDs vs SSDs, e.g. choice of minimum allocation size is different for HDDs vs SSDs. We are trying to find out if there are other differences as well that might explain the numbers we are seeing here. Another thing to try is to explicitly indicate that the disks are SSD while setting up OCS. Based on https://bugzilla.redhat.com/show_bug.cgi?id=1873161#c8 it seems that is possible.
(In reply to Manoj Pillai from comment #10) > (In reply to Yaniv Kaul from comment #9) > > There's a different bug that Ceph identifies the Azure disks as HDDs and not > > SSDs, could that explain this? > > Yes, that's bz #1873161 . That's a prime suspect, ATM. > > We know that Bluestore has different policies for HDDs vs SSDs, e.g. choice > of minimum allocation size is different for HDDs vs SSDs. We are trying to > find out if there are other differences as well that might explain the > numbers we are seeing here. We had this discussion in the Perf sync-up call today. Wrongly detecting SSDs as HDDs can apparently affect performance adversely in a number of ways. Josh, can you please list out some of the big ones? So the focus is squarely on bz #1873161.
(In reply to Manoj Pillai from comment #11) > (In reply to Manoj Pillai from comment #10) > > (In reply to Yaniv Kaul from comment #9) > > > There's a different bug that Ceph identifies the Azure disks as HDDs and not > > > SSDs, could that explain this? > > > > Yes, that's bz #1873161 . That's a prime suspect, ATM. > > > > We know that Bluestore has different policies for HDDs vs SSDs, e.g. choice > > of minimum allocation size is different for HDDs vs SSDs. We are trying to > > find out if there are other differences as well that might explain the > > numbers we are seeing here. > > We had this discussion in the Perf sync-up call today. Wrongly detecting > SSDs as HDDs can apparently affect performance adversely in a number of > ways. > Josh, can you please list out some of the big ones? > > So the focus is squarely on bz #1873161. There are many options that have different defaults for ssd and hdd - here's a list for nautilus: https://gist.github.com/jdurgin/cf63bf0ec61bdc9a3ce4e60a5b0c4b30 The most likely to have an effect here are those related to threads/shards/cache/alloc size. For any of these options, you can override the disk-specific variant by setting the option without an ssd/hdd suffix, e.g. osd_op_num_threads_per_shard=16 would take effect regardless of the disk type. If you switch to all the ssd settings, the next bottleneck is likely to be the OSD cpu limit, which OCS defaults to a very low 2 or 3 per osd.
As mentioned by Manoj in Comment 8 above, we have narrowed the performance bottleneck at OSD level. For further experiments, I deployed a fresh new OCS cluster on Azure and captured default random performance baseline out of the box: Default Write IOPS: 3090 Default Read IOPS: 16.1K As mentioned in comments 9,10, 11 above we felt that primary reason for slow performance was bz #1873161 (SSD being detected as HDD in device class). So based on this assumption, we changed the class of recognized device to SSD from HDD using below command: ceph osd crush rm-device-class osd.0 osd.1 osd.2 ceph osd crush set-device-class ssd osd.0 osd.1 osd.2 But the above change did not chnage the performance and IOPS remained same. Write Performance after Device was changed to SSD: 2970 Read Performance after Device was changed to SSD: 16K We next tried to manually set the options for SSD in ceph config based on the list provided by Josh in Comment 12. We edited the Ceph Config and changed the value of osd_op_num_threads_per_shard to 2 , 3 , 4 for our various experiments. The command to do this change is: ceph config set osd osd_op_num_threads_per_shard 2 ceph config dump WHO MASK LEVEL OPTION VALUE RO global advanced mon_allow_pool_delete true global advanced mon_pg_warn_min_per_osd 0 global advanced osd_pool_default_pg_autoscale_mode on global advanced rbd_default_features 3 mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode upmap mgr advanced mgr/orchestrator_cli/orchestrator rook * osd advanced osd_op_num_threads_per_shard 2 * mds.ocs-storagecluster-cephfilesystem-a basic mds_cache_memory_limit 4294967296 mds.ocs-storagecluster-cephfilesystem-b basic mds_cache_memory_limit 4294967296 client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_enable_usage_log true client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_log_nonexistent_bucket true client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_log_object_name_utc true client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_zone ocs-storagecluster-cephobjectstore * client.rgw.ocs.storagecluster.cephobjectstore.a advanced rgw_zonegroup ocs-storagecluster-cephobjectstore * For some of the experiments we also edited the storagecluster CR and changed the CPU core count per OSD to 3, 4 for various experiments we conducted. Here are few lines that we added to storagecluster CR: resources: limits: cpu: "3" memory: "8Gi" requests: cpu: "3" memory: "4Gi" =============================================================================================================================================================================================================== Here I will summarize the results of the experiments performed after tuning the setup as described above: 1st Experiment ============== 1 FIO Instance, osd_op_num_threads_per_shard 2, Write IOPS: 3048 1 FIO Instance, osd_op_num_threads_per_shard 2, Read IOPS: 23.7K As you see we see improvement in Read from 16.1k to 23.7K when we change osd_op_num_threads_per_shard to 2 from default of 0 (for HDD) while write remains the same. 2nd Experiment ============== 1 FIO Instance, osd_op_num_threads_per_shard 3, Write IOPS: 2824 1 FIO Instance, osd_op_num_threads_per_shard 3, Read IOPS: 23.3K When osd_op_num_threads_per_shard was increased to 3 the value remained same as when it was 2. This pointed to a bottleneck in CPU as one of the OSD was fully utilizing the entire CPU allotted to it. 3rd Experiment ============== 1 FIO Instance, osd_op_num_threads_per_shard 2, CPU core/OSD is 3 Write IOPS: 3362 1 FIO Instance, osd_op_num_threads_per_shard 2, CPU Core/OSD is 3 Read IOPS: 26.7K Increasing the CPU count to 3 bumped up the performance slightly and showed that CPU was indeed a limiting factor. Next experiment was to test with osd_op_num_threads_per_shard 3 and increased CPU core/OSD count. 4th Experiment ============== 1 FIO Instance, osd_op_num_threads_per_shard 3, CPU core/OSD is 3 Write IOPS: 3052 1 FIO Instance, osd_op_num_threads_per_shard 3, CPU Core/OSD is 3 Read IOPS: 27.4K Here we observed that when shard is set to 3, CPU is again becoming a bottleneck, so in next experiment we increased the CPU core/osd to 4 now keeping osd_op_num_threads_per_shard as 3. 5th Experiment ============== 1 FIO Instance, osd_op_num_threads_per_shard 3, CPU core/OSD is 4 Write IOPS: 3607 1 FIO Instance, osd_op_num_threads_per_shard 3, CPU Core/OSD is 4 Read IOPS: 27K We saw Write performance improving but Read was still the same. There was no bottleneck on CPU and we felt the need of increasing the IOs in the data-path by increasing the FIO instance to 3. 6th Experiment ============== 3 FIO Instance, osd_op_num_threads_per_shard 3, CPU core/OSD is 4 Write IOPS: 4409 3 FIO Instance, osd_op_num_threads_per_shard 3, CPU Core/OSD is 4 Read IOPS: 36.1K As you see, we got a significant improvement here. Read performance is actually 125% better than default value and write performance is 42% better than default. ============================================================================================================================================================================================================== To summarize we see two problems here which is affecting Random performance in Azure: 1) Performance is degraded because OCS is not recognizing devices correctly (SSD is seen as HDD). 2) CPU core/osd will be a bottleneck once OCS starts recognizing the devices correctly. Question: What other option from the list https://gist.github.com/jdurgin/cf63bf0ec61bdc9a3ce4e60a5b0c4b30 do you feel that may significantly improve performance? We are still conducting more experiments and analysis. Will update the BZ once we have more to share.
Rook is working on a workaround for the disks being marked rotational: https://github.com/rook/rook/issues/6153 For now I'd suggest setting all of those ssd-specific tunings and redeploying the osds (min_alloc_size can't be changed without redeployment). As you saw with experiment 6, you may need to increase client load to saturate the cluster.
In continuation to my experiments describe in Comment 13, I did few more whose results are described below: 7th Experiment ============== 3 FIO Instance, osd_op_num_threads_per_shard 4, CPU core/OSD is 6 Write IOPS: 4211 3 FIO Instance, osd_op_num_threads_per_shard 4, CPU Core/OSD is 6 Read IOPS: 38.7K As you see, if we increase osd_op_num_threads_per_shard from 3 to 4 and correspondingly also increase CPU cores/osd to prevent CPU saturation, my write IOPS reduces (4409 --> 4211) but there is increase in Read IOPS (36.1K --> 38.7K). 8th Experiment ============== To further see the effect of increasing osd_op_num_threads_per_shard we conducted one more experiment by increasing it to 8 3 FIO Instance, osd_op_num_threads_per_shard 8, CPU core/OSD is 6 Write IOPS: 3818 3 FIO Instance, osd_op_num_threads_per_shard 8, CPU Core/OSD is 6 Read IOPS: 46.9K Once again we see read improving drastically to 46.9K but write IOPS reduces further. This shows that the current default value of osd_op_num_threads_per_shard_ssd (2) may be low for achieving good performance and we cannot have it too large enough as well as it starts affecting write adversely. The experiments have shown that we can improve the Random performance by tuning the Ceph config and CPU cores/osd accordingly. Bug 1873161 needs to be fixed for this performance issue to be resolved. For better understanding of the user using OCS, we also need to explicitly mention in our product documents that with current default CPU cores/osd (request 1, Limit 2) we would get limited Random performance from OCS on Azure system. If any customer/user wants to achieve more we need to increase this value per OSD. As of now we are stopping our experiments here. We will revisit this bug and conduct more testing to confirm the state of performance once Bug 1873161 is resolved. Thank you
1. Can you verify, especially in the case of write, that network is not the bottleneck? 2. I hope that Multus will improve this - it'll reduce CPU consumption and improve network throughput.
(In reply to Yaniv Kaul from comment #16) > 1. Can you verify, especially in the case of write, that network is not the > bottleneck? During Read Operation, here's the network utilization: (Reason why we are seeing both rxkB/s and txkB/s during read is because we hosted FIO client POD and OCS on the same node, so between FIO pod and OSD one is transmitting and other is receiving) IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s eth0 90483.20 85532.50 104163.45 95172.32 0.00 0.00 0.00 During Write Operation, here's the network utilization: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s eth0 33713.80 33205.40 40940.57 39787.91 0.00 0.00 0.00 If you see from above during write workload, network can still be utilized more than what it's being done compared to Read workload. So network is not a bottleneck here during write.
https://bugzilla.redhat.com/show_bug.cgi?id=1873161 is approved for 4.7
Addressed by bug 1903973
Retested OCS performance on Azure with OCP 4.6 and OCS 4.6.3 but no improvement in performance is seen as OSDs are still not recognized correctly as SSD. OCS still sees OSDs as HDD and applies tuning accordingly. See https://bugzilla.redhat.com/show_bug.cgi?id=1925004#c9 for more details.
Based on discussion over Google Chat,OCS 4.6.3_RC5 build introduced a change by virtue of which it will apply faster tuning to all OSDs, Snippet from oc get cephcluster -oyaml resources: limits: cpu: "2" memory: 5Gi requests: cpu: "2" memory: 5Gi tuneFastDeviceClass: true Just FYI on the OCS side the OSD is still visible as HDDs, See output of ceph osd tree ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6.00000 root default -5 6.00000 region eastus -14 2.00000 zone eastus-1 -13 2.00000 host ocs-deviceset-managed-premium-1-data-0-nvxqx 1 hdd 2.00000 osd.1 up 1.00000 1.00000 -10 2.00000 zone eastus-2 -9 2.00000 host ocs-deviceset-managed-premium-2-data-0-sd266 2 hdd 2.00000 osd.2 up 1.00000 1.00000 -4 2.00000 zone eastus-3 -3 2.00000 host ocs-deviceset-managed-premium-0-data-0-vqs5m 0 hdd 2.00000 osd.0 up 1.00000 1.00000 Nevertheless Re-Evaluated OCS performance On OCP 4.6 with OCS 4.6.3_RC5 build and I still observe poor RBD Random Performance: Write IOPS : 3855 (Expected Value: 7500) which is almost 50% below expectation Read IOPS : 19400 (Expected Value: 90000) which is almost 75% below expectation Based on results above it seems like tuneFastDeviceClass: true has no affect on tuning values being set for OSD. I think OCS is still setting the value based on OSD CLASS type (which is HDD here). to confirm this I am setting SSD values manually and re-ruuning the tests. Will update the bz with results. oc version Client Version: 4.6.16 Server Version: 4.6.16 Kubernetes Version: v1.19.0+e49167a oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.3-732.ci OpenShift Container Storage 4.6.3-732.ci Succeeded
(In reply to Shekhar Berry from comment #23) > Based on discussion over Google Chat,OCS 4.6.3_RC5 build introduced a change > by virtue of which it will apply faster tuning to all OSDs, > > Snippet from oc get cephcluster -oyaml > > resources: > limits: > cpu: "2" > memory: 5Gi > requests: > cpu: "2" > memory: 5Gi > tuneFastDeviceClass: true > > Just FYI on the OCS side the OSD is still visible as HDDs, See output of > ceph osd tree > > ceph osd tree > ID CLASS WEIGHT TYPE NAME > STATUS REWEIGHT PRI-AFF > -1 6.00000 root default > > -5 6.00000 region eastus > > -14 2.00000 zone eastus-1 > > -13 2.00000 host > ocs-deviceset-managed-premium-1-data-0-nvxqx > 1 hdd 2.00000 osd.1 > up 1.00000 1.00000 > -10 2.00000 zone eastus-2 > > -9 2.00000 host > ocs-deviceset-managed-premium-2-data-0-sd266 > 2 hdd 2.00000 osd.2 > up 1.00000 1.00000 > -4 2.00000 zone eastus-3 > > -3 2.00000 host > ocs-deviceset-managed-premium-0-data-0-vqs5m > 0 hdd 2.00000 osd.0 > up 1.00000 1.00000 > > > Nevertheless Re-Evaluated OCS performance On OCP 4.6 with OCS 4.6.3_RC5 > build and I still observe poor RBD Random Performance: > > Write IOPS : 3855 (Expected Value: 7500) which is almost 50% below > expectation > Read IOPS : 19400 (Expected Value: 90000) which is almost 75% below > expectation > > Based on results above it seems like tuneFastDeviceClass: true has no affect > on tuning values being set for OSD. I think OCS is still setting the value > based on OSD CLASS type (which is HDD here). to confirm this I am setting > SSD values manually and re-ruuning the tests. Will update the bz with > results. Before doing this, can you check the config assigned on the OSD via ceph config show osd.<id> And check if the tuning values are applied? > > oc version > Client Version: 4.6.16 > Server Version: 4.6.16 > Kubernetes Version: v1.19.0+e49167a > > oc get csv > NAME DISPLAY VERSION > REPLACES PHASE > ocs-operator.v4.6.3-732.ci OpenShift Container Storage 4.6.3-732.ci > Succeeded
To properly validate that flags are being passed correctly you must exec into an OSD pod and look at the startup flag of the ceph-osd process. So just "ps fauwwwwwx" and you should see a few line of "bluestore_cache_size", etc etc
Here's an output of ceph config show osd.0. ceph config show osd.0 NAME VALUE SOURCE OVERRIDES IGNORES bluestore_cache_size 3221225472 cmdline bluestore_compression_max_blob_size 65536 cmdline bluestore_compression_min_blob_size 8912 cmdline bluestore_deferred_batch_ops 16 cmdline bluestore_max_blob_size 65536 cmdline bluestore_min_alloc_size 4096 cmdline bluestore_prefer_deferred_size 0 cmdline bluestore_throttle_cost_per_io 4000 cmdline crush_location root=default host=ocs-deviceset-managed-premium-0-data-0-vqs5m region=eastus zone=eastus-3 cmdline daemonize false override err_to_stderr true cmdline keyring $osd_data/keyring default leveldb_log default log_file mon log_stderr_prefix debug cmdline log_to_file false default log_to_stderr true cmdline mon_allow_pool_delete true mon mon_cluster_log_file mon mon_cluster_log_to_file false default mon_cluster_log_to_stderr true cmdline mon_host [v2:172.30.77.5:3300,v1:172.30.77.5:6789],[v2:172.30.4.107:3300,v1:172.30.4.107:6789],[v2:172.30.26.243:3300,v1:172.30.26.243:6789] override mon_max_pg_per_osd 600 file mon_osd_backfillfull_ratio 0.800000 file mon_osd_full_ratio 0.850000 file mon_osd_nearfull_ratio 0.750000 file mon_pg_warn_min_per_osd 0 mon ms_learn_addr_from_peer false cmdline osd_delete_sleep 0.000000 cmdline osd_memory_target 2684354560 env (default[2684354560]) osd_memory_target_cgroup_limit_ratio 0.500000 file osd_op_num_shards 8 cmdline osd_op_num_threads_per_shard 2 cmdline osd_pool_default_pg_autoscale_mode on mon osd_recovery_sleep 0.000000 cmdline osd_snap_trim_sleep 0.000000 cmdline rbd_default_features 3 mon default[61] setgroup ceph cmdline setuser ceph cmdline From above it seems that values are being passed correctly as per SSD but we may have to tune further to extract more performance from OCS. I will work on this. Also from inside OSD POD, ps fauwwwwwx | grep osd root 50753 0.0 0.0 143476 2784 ? Ssl 06:55 0:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981/userdata -c bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-storage_rook-ceph-osd-0-555f66cdd5-dnngn_c5fe2c96-6487-4a32-b155-029d8917e805/osd/0.log --log-level info -n k8s_osd_rook-ceph-osd-0-555f66cdd5-dnngn_openshift-storage_c5fe2c96-6487-4a32-b155-029d8917e805_0 -P /var/run/containers/storage/overlay-containers/bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u bb0b4635e008ded69e29efafe6b53bb2a4af32d10dea1f127346f92dd80d1981 -s ceph 50764 25.8 2.2 4215944 1460224 ? Ssl 06:55 61:24 \_ ceph-osd --foreground --id 0 --fsid f54a79a1-7607-4e07-8003-dfc1376f11d6 --setuser ceph --setgroup ceph --crush-location=root=default host=ocs-deviceset-managed-premium-0-data-0-vqs5m region=eastus zone=eastus-3 --osd-op-num-shards=8 --osd-delete-sleep=0 --bluestore-compression-min-blob-size=8912 --bluestore-cache-size=3221225472 --bluestore-deferred-batch-ops=16 --osd-op-num-threads-per-shard=2 --osd-snap-trim-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-throttle-cost-per-io=4000 --osd-recovery-sleep=0 --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
Ok
You should be able to edit the rook-ceph-config-override configmap to apply settings that cannot be modified at runtime. Upstream docs can be found here: https://rook.io/docs/rook/v1.5/ceph-advanced-configuration.html#custom-cephconf-settings
I have been looking at the cluster, and it appears that the `--osd-op-num-threads-per-shard=2` CLI flag is set on OSD Pods when running with the tune fast settings, so there is no way to override the value other than to set the tune fast config to false.
@all, Similar problematic performance was seen during Performance tests run on Azure for both RBD and CephFS. OCP version :4.6.17 OCS version: 4.6.3-271.ci Ceph version: 14.2.11-95.el8cp RBD Sequential FIO test results : 4KiB Read IO rate is 4.5: 30,176 4KiB Read IO rate is 4.6: 33,540 4KiB Read IO rate is 4.6.3 ( 2 independent runs): 25206 and 31778.67 The test results can be found here: http://10.0.78.167:9200/ripsaw-fio-fullres/_search?q=uuid:3b78c61f-b459-56cf-93bf-747b4d98604f and http://10.0.78.167:9200/ripsaw-fio-fullres/_search?q=uuid:2999146d-4be3-57ba-991d-a5a95dfa439e CephFS Sequential FIO test results: 4KiB Read IO rate is 4.5: 33,181 4KiB Read IO rate is 4.6: 30,641 4KiB Read IO rate is 4.6.3 ( 2 independent runs): 22134.33 and 27953.67 The test results can be found here: http://10.0.78.167:9200/ripsaw-fio-fullres/_search?q=uuid:acef724d-736c-5227-a646-a6593942394a and http://10.0.78.167:9200/ripsaw-fio-fullres/_search?q=uuid:3b78c61f-b459-56cf-93bf-747b4d98604f The must gather logs output will be uploaded shortly.
You should be able to change the settings of a running daemon (don't restart the pod or these will get lost) via 'ceph tell osd.* osd config set' - this changes the config of a running daemon directly, rather than updating the centralized config like 'ceph config set'.
(In reply to Josh Durgin from comment #40) > You should be able to change the settings of a running daemon (don't restart > the pod or these will get lost) via 'ceph tell osd.* osd config set' - this > changes the config of a running daemon directly, rather than updating the > centralized config like 'ceph config set'. we tried this https://bugzilla.redhat.com/show_bug.cgi?id=1848907#c33.
In continuation of comment#39 - the must gather logs are located here: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz_1848907/
In continuation of comment#39 - Azure 4.6.3 Performance report it available here: https://docs.google.com/document/d/1xohm7HPNqI4vhcx9LtRKZ6eXzW-kYK4QGHy4hnrm8j4/edit#
I don't why the config set is not applied, you can look at the OSD socket directly for the config too. Or else try Josh's suggestion from https://bugzilla.redhat.com/show_bug.cgi?id=1848907#c40
Did we try https://bugzilla.redhat.com/show_bug.cgi?id=1896810#c40
We were seeing `ceph config set ...` not working for some configs when `tuneFastDeviceClass` was set to `true` because some params are set on the commandline. Make sure all `tune...` settings on the storageClassDeviceSet are `false` or unset.
(In reply to Mudit Agarwal from comment #46) > Did we try https://bugzilla.redhat.com/show_bug.cgi?id=1896810#c40 yes we did try that https://bugzilla.redhat.com/show_bug.cgi?id=1848907#c33 For now by editing the options in deployment it is working.
Jason, Shekhar, are there any recommendation on improving performance on Azure that we can implement? This bug is currently acked for 4.7, but apart from the SSD tunings to be applied (which is already done), we have not heard of other settings to improve performance
Based on the offline discussion with Karthick moving this out of 4.7, there is still WIP which is required to achieve expected performance. https://chat.google.com/room/AAAAREGEba8/i5Ecuobu2a4
Hi All, TL;DR version: By increasing CPU cores/OSD and pumping more IO through network pipe OCS achieves 6400 (85% of Azure Capability) write IOPS and 71200 (80% of Azure capability) Read IOPS. This is 107% and 326% improvement for Write and Read respectively over OCS 4.4 in OCS 4.7. We can further improve performance if all OCS nodes are on the same Availability Zone but that will affect HA. Here’s a detailed description of where things stand related to this bug: -- With D16s_v3 instance type and p40 drive type in Azure expected OCS performance was ~7500 for random write IOPS and ~90000 for random read IOPS. -- Currently we are evaluating OCS 4.7 performance in Azure configuration. In OCS 4.7 TuneFastSetting corresponding to SSD is set. Out of the box with this configuration we were getting ~4000 Random Write IOPS (53% of Azure H/W capability) 30000 Random Read IOPS (33% of Azure Hardware Capability). -- In order to troubleshoot the difference in what Azure can deliver and what OCS is able to extract from it following troubleshooting exercise was performed: ---Configured a OCS cluster with all OCS nodes on same Availability Zones with Proximity Placement Group Enabled (this ensures Azure will create VMs on the same rack and reduce latency between them) and did Uperf analysis between OCS nodes. --- Compared the above Uperf analysis with configuration where OCS nodes were on 3 different availability zones (default configuration). --- Results of Uperf analysis showed that 3AZ configuration was almost 2.5 times slower than singe AZ configuration and thus proving our hypothesis that network latency was one of the major bottlenecks in OCS Random Performance. -- Now we moved back to default 3 AZ cluster configuration with an aim to pump more IO in the network pipe to overcome network round trip latency which we determined above. -- Once we increased the number of threads writing parallely (and thus filling the network pipe and queuing up a large number of Outstanding IO) on the SSD drive we started to see an increase in IO performance. -- With network latency bottlenecks taken care of by pumping more IO we started to hit the CPU bottleneck on the OSDs. To overcome CPU bottleneck we increased CPU cores/OSD from default 2 to 3 to 4 and to 5 and captured IO performance. Now here’s the IOPS numbers based on above changes in configuration and with OCS nodes on 3 different Availability Zone (default): 2 CPU Cores/ OSD, 5G Memory (Default OCS resource Configuration) Random Write: 5143 Random Read: 35600 3 CPU Cores/ OSD, 5G Memory Random Write: 5915 Random Read: 45700 4 CPU Cores/ OSD, 5G Memory Random Write: 5952 Random Read: 65300 5 CPU Cores/ OSD, 5G Memory Random Write: 6400(85% of Azure Capability) Random Read: 71200(80% of Azure Capability) If all OCS nodes areon Same AZ with proximity placement group enabled, here’s the performance numbers we get: 2 CPU Cores/ OSD, 5G Memory (Default OCS resource Configuration) Random Write: 5200 Random Read: 37500 3 CPU Cores/ OSD, 5G Memory Random Write: 6881(~90% of Azure capability) Random Read: 54400 Tests with higher CPU values were not performed as we moved to default OCS configuration(3AZ) testing, Please let me know if you have any questions. Shekhar
Looking at comment 56 perf results from Shekhar, if we assign 5 CPU Cores/ OSD, 5G Memory we reach around 80% of Azure performance with 3 AZ, and within single AZ- 90% of Azure performance. We can recommend cu to increase Requests and limits on OSD deployment, but we don't want to make this the default. Closing this bug as the network round trip latency seems to cause the issue. Please re-open if any other fix is required.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days