+++ This bug was initially created as a clone of Bug #1903973 +++ Description of problem (please be detailed as possible and provide log snippests): For OSDs created on virtual/cloud environments, the OSD is detected as rotational disk though the underlying device is mostly SSD. We want to default to SSD tuning options for these devices in order to get better performance results. See bug 1848907#c13 for instance. We could either default to using tuneFastDeviceClass for specific environments or for all environments. Considering the majority of workloads are run on SSD backed devices, I think it makes sense to set this as default Version of all relevant components (if applicable): 4.6 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Reduced performance Is there any workaround available to the best of your knowledge? Manually setting SSD tune options
Does this also require backport of Rook and ocs-operator patches?
OK. more analysis of all relevant PRs here: This is also in the order that they were merged in master: 1) https://github.com/openshift/ocs-operator/pull/864 Expand the existing throttling (slow tuning) for ebs storage class based osd devices to add automatic fast tuning for azure storage class devices. 2) https://github.com/openshift/ocs-operator/pull/946 refactoring, no behavior change 3) https://github.com/openshift/ocs-operator/pull/947 Committing lost crd changes, just book-keeping 4) https://github.com/openshift/ocs-operator/pull/945 A small follow-up fix for 864 5) https://github.com/openshift/ocs-operator/pull/944 Introduce a new setting DeviceType in the StorageDeviceSet that will override other automatisms for tuning, and allow to enforce fast or slow tuning by setting hdd or ssd device type 6) https://github.com/openshift/ocs-operator/pull/955 (pending merge) Add automatic default fast tuning for some platforms (Azure, IMBCloud, OVirt) This will be overridden by the StorageClass based tuing which will be overridden by the device type tuning. So yeah, we could probably backport only 864, but I still prefer to do all or nothing if we can.
As per QE, this is not a blocker for Azure - so we don't have to backport and close this next release
I am a bit surprised to learn that apparently, we do not only need the backport of the ocs-operator patch, but also a backport of a rook patch for this to work. It seems only the rook upstream version 1.4.9 has the latest tuning patches that the ocs-operator patch requires, and these would need to be applied to the downstream rook release-4.6 branch. @Pulkit - can you confirm?
Sahina, should we add doc_text for this?
(In reply to Mudit Agarwal from comment #15) > Sahina, should we add doc_text for this? No, since we haven't released on Azure before, there's no change in behavior that we want to call out.
In order to verify this fix I've deployed 4.6.3 on Azure platform. The build is: (yulidir) [ypersky@qpas ocs-ci]$ oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.3-261.ci OpenShift Container Storage 4.6.3-261.ci Succeeded (yulidir) [ypersky@qpas ocs-ci]$ When checking whether SSD tuning is automatically applied on all OSDs I've run the following commands and saw that the tunings are NOT applied. 1) yulidir) [ypersky@qpas ocs-ci]$ oc rsh -n openshift-storage rook-ceph-tools-57c7996cd8-j7mjc sh-4.4# ceph config dump WHO MASK LEVEL OPTION VALUE RO global basic log_file * global advanced mon_allow_pool_delete true global advanced mon_cluster_log_file global advanced mon_pg_warn_min_per_osd 0 global advanced osd_pool_default_pg_autoscale_mode on global advanced rbd_default_features 3 mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode upmap mgr. advanced mgr/prometheus/rbd_stats_pools ocs-storagecluster-cephblockpool * mgr.a advanced mgr/dashboard/a/server_addr 10.128.2.13 * mgr.a advanced mgr/prometheus/a/server_addr 10.128.2.13 * mds.ocs-storagecluster-cephfilesystem-a basic mds_cache_memory_limit 4294967296 mds.ocs-storagecluster-cephfilesystem-b basic mds_cache_memory_limit 4294967296 sh-4.4# 2) oc get sc -oyaml and oc get cephcluster -oyaml commands outputs are available here: The outputs of both commands are saved in a file here: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1909793/ Reopening this bug.
The fix is verified in 4.6.3-267.ci. Run ceph config show osd.0 from the toolbox pod and got the following : (yulidir) [ypersky@qpas ~]$ oc rsh rook-ceph-tools-857ffbdd95-bnptl sh-4.4# ceph config show osd.0 NAME VALUE SOURCE OVERRIDES IGNORES bluestore_cache_size 3221225472 cmdline bluestore_compression_max_blob_size 65536 cmdline bluestore_compression_min_blob_size 8912 cmdline bluestore_deferred_batch_ops 16 cmdline bluestore_max_blob_size 65536 cmdline bluestore_min_alloc_size 4096 cmdline bluestore_prefer_deferred_size 0 cmdline bluestore_throttle_cost_per_io 4000 cmdline crush_location root=default host=ocs-deviceset-0-data-0-xfgxt rack=rack0 region=eastus zone=eastus-1 cmdline daemonize false override err_to_stderr true cmdline keyring $osd_data/keyring default leveldb_log default log_file mon log_stderr_prefix debug cmdline log_to_file false default log_to_stderr true cmdline mon_allow_pool_delete true mon mon_cluster_log_file mon mon_cluster_log_to_file false default mon_cluster_log_to_stderr true cmdline mon_host [v2:172.30.229.105:3300,v1:172.30.229.105:6789],[v2:172.30.31.41:3300,v1:172.30.31.41:6789],[v2:172.30.80.121:3300,v1:172.30.80.121:6789] override mon_max_pg_per_osd 600 file mon_osd_backfillfull_ratio 0.800000 file mon_osd_full_ratio 0.850000 file mon_osd_nearfull_ratio 0.750000 file mon_pg_warn_min_per_osd 0 mon ms_learn_addr_from_peer false cmdline osd_delete_sleep 0.000000 cmdline osd_memory_target 2684354560 env (default[2684354560]) osd_memory_target_cgroup_limit_ratio 0.500000 file osd_op_num_shards 8 cmdline osd_op_num_threads_per_shard 2 cmdline osd_pool_default_pg_autoscale_mode on mon osd_recovery_sleep 0.000000 cmdline osd_snap_trim_sleep 0.000000 cmdline rbd_default_features 3 mon default[61] setgroup ceph cmdline setuser ceph cmdline sh-4.4#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.6.3 container bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0718