Bug 1817228

Summary: [baremetal][RFE] OCS does not distinguish between SSD and HDD
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Ben England <bengland>
Component: unclassifiedAssignee: N Balachandran <nibalach>
Status: CLOSED WONTFIX QA Contact: Petr Balogh <pbalogh>
Severity: high Docs Contact:
Priority: high    
Version: 4.3CC: assingh, bniver, ebenahar, ekuric, etamir, gmeno, madam, muagarwa, ocs-bugs, odf-bz-bot, owasserm, rcyriac, sabose, shan, shberry, sostapov, tmuthami
Target Milestone: ---Keywords: AutomationBackLog, FutureFeature, Performance
Target Release: ---   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-31 13:47:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dittybopper graph of block device throughput
none
screenshot graph showing NVM device approaching max throughput none

Description Ben England 2020-03-25 21:07:11 UTC
Created attachment 1673619 [details]
dittybopper graph of block device throughput

Description of problem 

For baremetal install, OCS does not distinguish between NVMe devices and HDD devices.   Consequently it treats NVM as just another OSD and puts NVM OSDs in same storage pool as HDD OSDs, even though they have different sizes.   Normally, since NVM SSDs are smaller (in TB) than HDDs, each NVM would thus do less I/O work (IOPS, MB/s) than each HDD, although each NVM can do at least 2 orders of magnitude more IOPS than an HDD.

This is a regression compared to RHCS, in which ceph-ansible installer is able to utilize NVM devices intelligently (more below).

Why do we care?   HDDs are not needed for low-density-high-IOPS configurations, but are still needed for high-density-low-IOPS configurations.  All the performance data showing the conclusions below has already been collected for RHCS years ago and nothing about it has changed.


Version:

OpenShift (OCP) 4.3.3
quay.io/rhceph-dev/ocs-olm-operator:4.3.0-rc2
Ceph nautilus 14.2.4-125 
quay.io/rhceph-dev/rhceph@sha256:1ec55227084f058c468df5cfff2cd55623668a72ec742af3e8b1c05b52d44d0a


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, from a performance perspective.


Is there any workaround available to the best of your knowledge?

No.   I can most likely go into toolbox ceph command and hack on the cluster to make it work better, but that is not how OCS installer is supposed to work.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1 - vanilla install on commodity server hardware


Can this issue reproducible?

Every time


Can this issue reproduce from the UI?

never tried it.


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Run baremetal OCS install (described below)
2. run ripsaw fio workload with default CR, modifying storage class to be one provided by OCS
3. monitor cluster block device throughput with cloud-bulldozer/dittybopper (attachment included)

detailed install method at bottom.


Actual results:

/dev/nvme0n1 was not doing nearly as much throughput as it was capable of.  Consequently cluster performance was well below what it could have been.


Expected results:

/dev/nvme0n1 is used intelligently in 2 ways:

a) as a metadata device and journal device for the HDD OSDs, avoiding double-write penalty and metadata-access seek time.

To do this correctly, you have to allocate enough storage space on the NVM partitions to handle the metadata for the HDD.   This can be done if you have an average object size.  

- first allocate a 1-GiB WAL (write-ahead-log) partition on an NVM device for the HDD OSD
- next allocate the RocksDB partition for that OSD.  By default, you take a percentage (Ceph docs say 4%) of the HDD size and allocate an NVM partition for RocksDB of that size.  But if a user has unusually small objects/files, or uses erasure coding extensively, additional space may be needed - we should ask them what their average object/file size is and what kind of replication (3-way or ECk+m) and calculate it from that.


b) as a separate all-flash storage pool that can achieve an order of magnitude more random or small file/object IOPS than a collection of HDDs can.


One can make OCS do b) all-flash pool post-install by retargeting the pool at the "ssd" device class, this can be done without OCS being aware of that.  But you could argue that this should already be done by default for cephfs metadata pool and for certain RGW pools that take very little space but require high IOPS.   It also would cost nothing to create certain pools targeted at SSD and HDD.   This is an opinionated installer!

Suggestions for mixed HDD+SSD baremetal configurations:
RBD pool - have 2 pools, 1 on SSD, and another on HDD
cephfs data pool - have 2 pools, 1 on SSD and another on HDD
cephfs metadata pool - 1 pool on SSD
RGW - have 2 data pools, 1 on SSD and another on HDD.
RGW metadata pools should be on SSD

By default, Cephfs storage class should land your data on HDD and your metadata on SSD.
By default, RGW should land your data on HDD and all metadata on SSD

Of course, if there are only SSD devices available, then you don't have to create both kinds of pools.  And if there are only HDD devices (not recommended) then all pools live there, obviously.


Additional info:

Ceph understands which devices are NVM or HDD.  

sh-4.2$ ceph df
RAW STORAGE:
    CLASS     SIZE       AVAIL      USED        RAW USED     %RAW USED 
    hdd       83 TiB     83 TiB     128 GiB      179 GiB          0.21 
    ssd       13 TiB     13 TiB      21 GiB       26 GiB          0.19 
    TOTAL     97 TiB     97 TiB     149 GiB      205 GiB          0.21 
 
And it's easy for Ceph to create a pool that either lives entirely on SSD or lives entirely on HDD.  This ancient article documents how to use Ceph device classes to create such storage pools.

https://ceph.io/community/new-luminous-crush-device-classes/

Here are examples of pools for RBD and Cephfs:

# ocos rsh $(ocos get pod | awk '/tools/{print $1}') ceph osd pool ls detail
...
pool 1 'example-storagecluster-cephblockpool' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode on last_change 190 lfor 0/0/179 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd
	removed_snaps [1~3]
...
pool 4 'example-storagecluster-cephfilesystem-metadata' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 160 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
...
pool 6 'example-storagecluster-cephfilesystem-data0' replicated size 3 min_size 2 crush_rule 5 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode on last_change 190 lfor 0/0/188 flags hashpspool stripe_width 0 target_size_ratio 0.49 application cephfs

Each pool has its own corresponding CRUSH rule, but if you look at the crush rules, they are all the same.  For example, pool 6 the cephfs data pool uses crush rule 5, but they all look the same and all blindly select OSDs based on rack and not on device type.


Installation methods:

I used this repo to install OCP4 on baremetal Alias lab cluster:

https://github.com/bengland2/ocp4_upi_baremetal

I then used Alex Calhoun's document to learn how to install OCS on baremetal, it comes down to this script, which uses various files in the same URL directory:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/openshift/upi/ocs-bringup/all-of-it.sh

Once OCS was up and running, I use benchmarks from this URL to test:

https://github.com/cloud-bulldozer/ripsaw

this particular problem can be observed with the fio benchmark, the attached graph was from an fio run.

Comment 2 Sébastien Han 2020-03-26 09:07:44 UTC
Setting to 4.6 as a tentative, but this requires work in LSO.

Comment 3 Michael Adam 2020-03-26 17:28:59 UTC
@Seb, if you set one release flag, you should remove the other. Bugzilla is rather dumb. ;-)

Comment 4 Ben England 2020-03-27 13:45:34 UTC
Created attachment 1674072 [details]
screenshot graph showing NVM device approaching max throughput

This graph was the result of an fio test where we used an all NVM storage pool in an OCS cluster.    It was easy to create using the shell like this.

oc create -f toolbox.yaml
sleep 5
alias cephpod="ocos rsh $(ocos get pod | awk '/tools/{print $1}') ceph "
cephpod osd crush rule create-replicated fast default host ssd
cephpod osd pool create fast 256 256 replicated fast
cat > fast-sc.yaml <<EOF
allowVolumeExpansion: false
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: example-storagecluster-ceph-rbd-fast
  resourceVersion: "1452259"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/example-storagecluster-ceph-rbd-fast
  uid: d8814e53-f6f8-49a5-8ce9-31c6f5626796
parameters:
  clusterID: openshift-storage
  csi.storage.k8s.io/fstype: ext4
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
  imageFeatures: layering
  imageFormat: "2"
  pool: fast
provisioner: openshift-storage.rbd.csi.ceph.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
EOF

oc create -f fast-sc.yaml

using 14 ripsaw fio pods spread across 7 hosts (2 per host), with this CR:

apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
  name: fio-benchmark
  namespace: my-ripsaw
spec:
  elasticsearch:
    server: "marquez.perf.lab.eng.rdu2.redhat.com"
    port: 9200
  clustername: "bene-alias-cloud02-2020-03-24"
  test_user: bene
  workload:
    name: "fio_distributed"
    args:
      samples: 1
      servers: 14
      pin_server: ''
      jobs:
        - "write"
        - "randread"
      bs:
        - 4MiB
        - 4KiB
      numjobs:
      - 1
      iodepth: 4
      read_runtime: 60
      read_ramp_time: 5
      filesize: 2GiB
      log_sample_rate: 1000
      storageclass: example-storagecluster-ceph-rbd-fast
      accessmode: ReadWriteOnce
      storagesize: 30Gi

I got up to 1.4 GB/s throughput from each NVM SSD during a write.   This is a significant percentage of the NVM device capacity (have to take down the Ceph cluster and dd to the NVM device to find out what percentage.  Am measuring random IOPS next.   This 1 NVM is the equivalent of 7 HDDs for a sequential workload, but for a random workload it should be 10-50 times faster than HDD if Ceph pods can keep up.   Will update with Random I/O results.

Comment 5 Elad 2020-09-12 11:43:14 UTC
Hi Sahina, Seb,

If this requires work in LSO, is there an OCP BZ to track this work?

One more thing, the performance difference seems significant enough to strive for having this done in OCS 4.6. Can we consider retargeting?

Comment 6 Sahina Bose 2020-09-14 15:21:25 UTC
Not possible for 4.6
For 4.7, we have 2 epics that are related to this

1. Using SSD for metadata and HDD for data PVs
2. Creating multiple pools based on device type of OSDs.

The LSO work involved is to ensure we can identify the device type from the LSO PV & Storageclass. The epic is not yet created in OCP storage

Comment 7 Sahina Bose 2021-01-29 06:17:51 UTC
We have support for segregating Metadata and data via https://issues.redhat.com/browse/KNIP-1546
and support for specifying deviceClass (and overriding the auto-detected one) and pools based on deviceClass via https://issues.redhat.com/browse/KNIP-1545

We don't have a way to correct the auto-detection as we rely on the rotational property of device reported by lsblk command

Does this cover the asks of the bug?

Comment 8 Ben England 2021-02-08 16:17:20 UTC
sorry I didn't see the needinfo, too much e-mail.   

Unfortunately, the previous comment mentions a workaround for the problem, but does not address this feature need for automatic distinguishing between SSDs and HDDs.     A typical storage customer would expect the storage system to understand which devices are SSDs and which are HDDs.   I understand the technical reasons why this is hard to do - Ceph is just defaulting to whatever /sys/block/sdX/queue/rotational says, but this is a lie in many cases (example: RAID controllers).   I thought Sebastien Han had suggested some solutions to this bug for several storage classes based on querying storage-class-specific attributes.   For example, in AWS, the storage type (i.e. gp2 tells you its SSD, st1/sc1 is HDD).   

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html

I'm sure Azure is the same.   Is there any public cloud vendor that doesn't allow the user to query attributes of the device that indicate what type of performance to expect?   Not sure about VMware.  for baremetal, there may be a way to use libstoragemgmt or something like that to dig out the metadata that will distinguish SAS/SATA SSDs from HDDs.

Secondly, until recently the only kind of device supported by OCS was SSD, so it should have been overriding from day 1. But now that we are expanding support to HDDs, this problem has to be solved by better automatic detection - otherwise it could get to be a support nightmare.

Comment 9 Mudit Agarwal 2021-02-09 14:07:05 UTC
Looks like there are more requirements which need to be addressed, moving it to 4.8

Comment 10 Sahina Bose 2021-06-04 07:20:00 UTC
Hi Seb,
Ben mentions in Comment 8 about possible solutions suggested by you. Do we have any reliable way of detecting disk type in cloud/virtualized environments?

Comment 11 Sébastien Han 2021-06-04 14:07:03 UTC
Hi Sahina,

I guess I was thinking we could build some kind of matrix-based out on the information the cloud/virt providers are giving to us (through their respective documentation).
Just like Ben mentioned, we don't have any reliable way to determine the underlying disk family, so that's what we were thinking about.

Comment 12 Rejy M Cyriac 2021-09-06 14:59:19 UTC
Based on request from engineering, the 'installation' component has been deprecated

Comment 14 Mudit Agarwal 2022-02-14 15:49:38 UTC
Eran, please create a Jira epic for this.