Bug 1925004

Summary:	[Azure] [Rook Changes] Set SSD tuning (tuneFastDeviceClass) as default for all OSD devices
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Mudit Agarwal <muagarwa>
Component:	rook	Assignee:	Pulkit Kundra <pkundra>
Status:	CLOSED ERRATA	QA Contact:	Yuli Persky <ypersky>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.6	CC:	ebenahar, kramdoss, madam, muagarwa, ocs-bugs, pkundra, ratamir, rcyriac, sabose, shan, shberry, sostapov, swilson, ypersky
Target Milestone:	---	Keywords:	AutomationBackLog, Performance, ZStream
Target Release:	OCS 4.6.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1909793	Environment:
Last Closed:	2021-03-03 22:53:23 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1903973, 1909793
Bug Blocks:

Comment 2 Mudit Agarwal 2021-02-04 07:43:44 UTC

We need two rook commits for this issue which are already there in rook upstream 1.4.9

Have created this BZ to track the same.

Because changes are already there in other releases, this issue is only required for 4.6.z

Comment 7 swilson 2021-02-11 16:36:48 UTC

Agreed. On Azure had to set udev rules via machineconfig to expose disks as SSDs to OCS. Sets SSD as device type in crush map. Without udev rule device shows up as HDD.

Comment 8 Yuli Persky 2021-02-17 13:11:46 UTC

In order to verify this fix I've deployed 4.6.3 on Azure platform. 

The build is: 
(yulidir) [ypersky@qpas ocs-ci]$ oc -n openshift-storage get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.3-261.ci   OpenShift Container Storage   4.6.3-261.ci              Succeeded
(yulidir) [ypersky@qpas ocs-ci]$ 

When checking whether SSD tuning is automatically applied on all OSDs I've run the following commands and saw that the tunings are NOT applied. 

1) 

yulidir) [ypersky@qpas ocs-ci]$ oc rsh -n openshift-storage rook-ceph-tools-57c7996cd8-j7mjc
sh-4.4# ceph config dump
WHO                                         MASK LEVEL    OPTION                             VALUE                            RO 
global                                           basic    log_file                                                            *  
global                                           advanced mon_allow_pool_delete              true                                
global                                           advanced mon_cluster_log_file                                                   
global                                           advanced mon_pg_warn_min_per_osd            0                                   
global                                           advanced osd_pool_default_pg_autoscale_mode on                                  
global                                           advanced rbd_default_features               3                                   
  mgr                                            advanced mgr/balancer/active                true                                
  mgr                                            advanced mgr/balancer/mode                  upmap                               
    mgr.                                         advanced mgr/prometheus/rbd_stats_pools     ocs-storagecluster-cephblockpool *  
    mgr.a                                        advanced mgr/dashboard/a/server_addr        10.128.2.13                      *  
    mgr.a                                        advanced mgr/prometheus/a/server_addr       10.128.2.13                      *  
    mds.ocs-storagecluster-cephfilesystem-a      basic    mds_cache_memory_limit             4294967296                          
    mds.ocs-storagecluster-cephfilesystem-b      basic    mds_cache_memory_limit             4294967296                          
sh-4.4# 


2) oc get sc -oyaml and oc get cephcluster -oyaml commands outputs are available here: 

The outputs of both commands are saved in a file here:  http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1909793/

Reopening this bug.

Comment 9 Shekhar Berry 2021-02-18 07:20:24 UTC

Hi,

Just to add to comment 8 above, I also tested this on my Azure Setup with OCS 4.6.3 RC build and the issue still persists.

oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.3-724.ci   OpenShift Container Storage   4.6.3-724.ci              Succeeded

oc version
Client Version: 4.6.16
Server Version: 4.6.16
Kubernetes Version: v1.19.0+e49167a

ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                                                     STATUS REWEIGHT PRI-AFF 
 -1       6.00000 root default                                                                          
 -5       6.00000     region eastus                                                                     
-14       2.00000         zone eastus-1                                                                 
-13       2.00000             host ocs-deviceset-managed-premium-0-data-0-s86x4                         
  2   hdd 2.00000                 osd.2                                             up  1.00000 1.00000 
-10       2.00000         zone eastus-2                                                                 
 -9       2.00000             host ocs-deviceset-managed-premium-1-data-0-78g7j                         
  1   hdd 2.00000                 osd.1                                             up  1.00000 1.00000 
 -4       2.00000         zone eastus-3                                                                 
 -3       2.00000             host ocs-deviceset-managed-premium-2-data-0-ghlvw                         
  0   hdd 2.00000                 osd.0                                             up  1.00000 1.00000 

--Shekhar

Comment 11 Sébastien Han 2021-02-22 16:44:37 UTC

Shekhar, the way to validate this fix is by checking the osd CLI arguments.
So exec into any OSD pod look for the run flags by running "ps fauxwwww|grep ceph-os[d]"

You should see a few flag like "--osd_op_num_threads_per_shard=2".
Moving back ON_QA.

Comment 13 Yuli Persky 2021-03-02 17:28:13 UTC


I've performed the following steps: 

1)  rsh to one of the osd pods: 
2)  ps fauxwwww|grep osd

I"ve got the following output : 

 22960 14.2  3.8 4861740 2519528 ?     Ssl  Feb25 1051:46  \_ ceph-osd --foreground --id 0 --fsid cc8613e6-2114-420b-9409-53640664e54f --setuser ceph --setgroup ceph --crush-location=root=default host=ocs-deviceset-0-data-0-hqgj9 rack=rack0 region=eastus zone=eastus-1 --osd-op-num-threads-per-shard=2


sh-4.4# ps fauxwwww| grep osd                            
root     1010699  0.0  0.0   9188   996 pts/0    S+   17:17   0:00          \_ grep osd
root       22483  0.0  0.0 143476  2812 ?        Ssl  Feb25   0:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e/userdata -c 2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-storage_rook-ceph-osd-0-6dd687c6cf-p7k9w_af8ec3f7-4d46-48d1-bfc7-6311021576c1/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e.log --log-level info -n k8s_POD_rook-ceph-osd-0-6dd687c6cf-p7k9w_openshift-storage_af8ec3f7-4d46-48d1-bfc7-6311021576c1_0 -P /var/run/containers/storage/overlay-containers/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e -s
root       22948  0.0  0.0 143476  2744 ?        Ssl  Feb25   0:01 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8/userdata -c 5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-storage_rook-ceph-osd-0-6dd687c6cf-p7k9w_af8ec3f7-4d46-48d1-bfc7-6311021576c1/osd/0.log --log-level info -n k8s_osd_rook-ceph-osd-0-6dd687c6cf-p7k9w_openshift-storage_af8ec3f7-4d46-48d1-bfc7-6311021576c1_0 -P /var/run/containers/storage/overlay-containers/5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8 -s
ceph       22960 14.2  3.8 4861740 2519528 ?     Ssl  Feb25 1051:46  \_ ceph-osd --foreground --id 0 --fsid cc8613e6-2114-420b-9409-53640664e54f --setuser ceph --setgroup ceph --crush-location=root=default host=ocs-deviceset-0-data-0-hqgj9 rack=rack0 region=eastus zone=eastus-1 --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8912 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug  --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false
sh-4.4# 

We see that in process 22960 the requested property "--osd-op-num-threads-per-shard=2"

I've verifyied it for each one of the OSD0/1/2 pods. 

=> closing the bug.

Comment 17 errata-xmlrpc 2021-03-03 22:53:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.3 container bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0718

Comment 18 Red Hat Bugzilla 2023-09-15 01:00:30 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days