Bug 1925004
| Summary: | [Azure] [Rook Changes] Set SSD tuning (tuneFastDeviceClass) as default for all OSD devices | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Mudit Agarwal <muagarwa> |
| Component: | rook | Assignee: | Pulkit Kundra <pkundra> |
| Status: | CLOSED ERRATA | QA Contact: | Yuli Persky <ypersky> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.6 | CC: | ebenahar, kramdoss, madam, muagarwa, ocs-bugs, pkundra, ratamir, rcyriac, sabose, shan, shberry, sostapov, swilson, ypersky |
| Target Milestone: | --- | Keywords: | AutomationBackLog, Performance, ZStream |
| Target Release: | OCS 4.6.3 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1909793 | Environment: | |
| Last Closed: | 2021-03-03 22:53:23 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1903973, 1909793 | ||
| Bug Blocks: | |||
|
Comment 2
Mudit Agarwal
2021-02-04 07:43:44 UTC
Agreed. On Azure had to set udev rules via machineconfig to expose disks as SSDs to OCS. Sets SSD as device type in crush map. Without udev rule device shows up as HDD. In order to verify this fix I've deployed 4.6.3 on Azure platform.
The build is:
(yulidir) [ypersky@qpas ocs-ci]$ oc -n openshift-storage get csv
NAME DISPLAY VERSION REPLACES PHASE
ocs-operator.v4.6.3-261.ci OpenShift Container Storage 4.6.3-261.ci Succeeded
(yulidir) [ypersky@qpas ocs-ci]$
When checking whether SSD tuning is automatically applied on all OSDs I've run the following commands and saw that the tunings are NOT applied.
1)
yulidir) [ypersky@qpas ocs-ci]$ oc rsh -n openshift-storage rook-ceph-tools-57c7996cd8-j7mjc
sh-4.4# ceph config dump
WHO MASK LEVEL OPTION VALUE RO
global basic log_file *
global advanced mon_allow_pool_delete true
global advanced mon_cluster_log_file
global advanced mon_pg_warn_min_per_osd 0
global advanced osd_pool_default_pg_autoscale_mode on
global advanced rbd_default_features 3
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/mode upmap
mgr. advanced mgr/prometheus/rbd_stats_pools ocs-storagecluster-cephblockpool *
mgr.a advanced mgr/dashboard/a/server_addr 10.128.2.13 *
mgr.a advanced mgr/prometheus/a/server_addr 10.128.2.13 *
mds.ocs-storagecluster-cephfilesystem-a basic mds_cache_memory_limit 4294967296
mds.ocs-storagecluster-cephfilesystem-b basic mds_cache_memory_limit 4294967296
sh-4.4#
2) oc get sc -oyaml and oc get cephcluster -oyaml commands outputs are available here:
The outputs of both commands are saved in a file here: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1909793/
Reopening this bug.
Hi, Just to add to comment 8 above, I also tested this on my Azure Setup with OCS 4.6.3 RC build and the issue still persists. oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.3-724.ci OpenShift Container Storage 4.6.3-724.ci Succeeded oc version Client Version: 4.6.16 Server Version: 4.6.16 Kubernetes Version: v1.19.0+e49167a ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6.00000 root default -5 6.00000 region eastus -14 2.00000 zone eastus-1 -13 2.00000 host ocs-deviceset-managed-premium-0-data-0-s86x4 2 hdd 2.00000 osd.2 up 1.00000 1.00000 -10 2.00000 zone eastus-2 -9 2.00000 host ocs-deviceset-managed-premium-1-data-0-78g7j 1 hdd 2.00000 osd.1 up 1.00000 1.00000 -4 2.00000 zone eastus-3 -3 2.00000 host ocs-deviceset-managed-premium-2-data-0-ghlvw 0 hdd 2.00000 osd.0 up 1.00000 1.00000 --Shekhar Shekhar, the way to validate this fix is by checking the osd CLI arguments. So exec into any OSD pod look for the run flags by running "ps fauxwwww|grep ceph-os[d]" You should see a few flag like "--osd_op_num_threads_per_shard=2". Moving back ON_QA. I've performed the following steps: 1) rsh to one of the osd pods: 2) ps fauxwwww|grep osd I"ve got the following output : 22960 14.2 3.8 4861740 2519528 ? Ssl Feb25 1051:46 \_ ceph-osd --foreground --id 0 --fsid cc8613e6-2114-420b-9409-53640664e54f --setuser ceph --setgroup ceph --crush-location=root=default host=ocs-deviceset-0-data-0-hqgj9 rack=rack0 region=eastus zone=eastus-1 --osd-op-num-threads-per-shard=2 sh-4.4# ps fauxwwww| grep osd root 1010699 0.0 0.0 9188 996 pts/0 S+ 17:17 0:00 \_ grep osd root 22483 0.0 0.0 143476 2812 ? Ssl Feb25 0:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e/userdata -c 2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-storage_rook-ceph-osd-0-6dd687c6cf-p7k9w_af8ec3f7-4d46-48d1-bfc7-6311021576c1/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e.log --log-level info -n k8s_POD_rook-ceph-osd-0-6dd687c6cf-p7k9w_openshift-storage_af8ec3f7-4d46-48d1-bfc7-6311021576c1_0 -P /var/run/containers/storage/overlay-containers/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 2ac7cf6d1c6411c77d48548a06a4e11e587a10ff508b693e7bcca204d322209e -s root 22948 0.0 0.0 143476 2744 ? Ssl Feb25 0:01 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8/userdata -c 5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-storage_rook-ceph-osd-0-6dd687c6cf-p7k9w_af8ec3f7-4d46-48d1-bfc7-6311021576c1/osd/0.log --log-level info -n k8s_osd_rook-ceph-osd-0-6dd687c6cf-p7k9w_openshift-storage_af8ec3f7-4d46-48d1-bfc7-6311021576c1_0 -P /var/run/containers/storage/overlay-containers/5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 5cdb1106c1797431bd9ed1c600e5bc893e145ef2be1b00f2c7b0b1ac7ec858b8 -s ceph 22960 14.2 3.8 4861740 2519528 ? Ssl Feb25 1051:46 \_ ceph-osd --foreground --id 0 --fsid cc8613e6-2114-420b-9409-53640664e54f --setuser ceph --setgroup ceph --crush-location=root=default host=ocs-deviceset-0-data-0-hqgj9 rack=rack0 region=eastus zone=eastus-1 --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8912 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --log-to-stderr=true --err-to-stderr=true --mon-cluster-log-to-stderr=true --log-stderr-prefix=debug --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false sh-4.4# We see that in process 22960 the requested property "--osd-op-num-threads-per-shard=2" I've verifyied it for each one of the OSD0/1/2 pods. => closing the bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.6.3 container bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0718 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |