Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1622597

Summary: RHCS 3.2 documentation must cover how to configure Bluestore
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Ben England <bengland>
Component: DocumentationAssignee: Aron Gunn <agunn>
Status: CLOSED CURRENTRELEASE QA Contact: Parikshith <pbyregow>
Severity: high Docs Contact:
Priority: high    
Version: 3.1CC: agunn, dfuller, hnallurv, jdurgin, jharriga, kdreyer, mnelson, pasik, vakulkar, vumrao
Target Milestone: rc   
Target Release: 3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-26 06:55:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1641792    

Description Ben England 2018-08-27 14:51:34 UTC
Description of problem:

as of RHCS 3.2, Bluestore will be the default, not Filestore, and will not be tech preview anymore, so the documentation needs to explicitly deal with it and explain to the customer how to configure the right hardware for it and install it correctly.  AFAIK the RHCS 3.1 documentation does not.  There are 2 areas:

1) The highest priority would be to explain how to avoid "RocksDB spillover" for HDD configurations.  This is the situation where the SSD-resident RocksDB partition is too small to hold the Bluestore metadata, so Bluestore starts putting RocksDB data on the HDDs, with consequent loss of performance.  The default for the RocksDB partition size is 1 GB, 1-2 orders of magnitude too small.  This misconfiguration cannot easily be fixed once a production Ceph cluster is already running - we have to prevent this from happening in the first place!

The upstream Ceph documentation for Bluestore configuration was recently updated to cover this topic, see:

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing

as well as the rest of the page.  However, this page does not mention ceph-ansible.  RHCS documentation needs to be very specific about how to use ceph-ansible to accomplish the appropriate sizing.


2) If RHCS 3.2 still requires the sysadmin to specify parameters for bluestore OSD cache, the documentation must explain how to do this correctly, since this has a major impact on performance.  I would think that the user might still be required to say what percentage of memory should be used for OSD caching, because of the possibility of hyperconverged storage (app and OSD on the same node).



The information must appear here:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/red_hat_ceph_storage_hardware_selection_guide/

It should explain how much NVM SSD is recommended for Bluestore RocksDB partitions and how to select the right NVM SSD.

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/installation_guide_for_red_hat_enterprise_linux/

Should explain how to configure ceph-ansible to correctly deploy Bluestore OSDs.

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/troubleshooting_guide/

should explain how to determine if RocksDB spillover to HDD devices has occurred, and explain what the customer's options are for resolving this.

Comment 2 Ben England 2018-09-19 12:57:18 UTC
re 1) ceph-volume lvm batch is solving part of this problem by letting the user just specify the device names - the user no longer has to set up the LVM volumes with the correct sizes on every host.

http://docs.ceph.com/docs/master/ceph-volume/lvm/batch/

However, in some cases the user has to pass in the bluestore_block_db_size parameter to ceph.conf if they do not want the entire NVM device to be used only for RocksDB.   This is not specified in the lvm batch command documentation above.

re 2) Mark Nelson's recent PRs attempt to automate the optimal division of OSD cache memory between RocksDB, onodes, and data.  If his PR works, then the user only has to specify how much memory the OSD should get.  This could be done in ceph-ansible/rook as a percentage of physical memory for all OSDs and then divided evenly among the OSDs by the installer/operator.  But this is non-trivial.  You have to know whether Ceph OSDs are located on the same host as other Ceph services or applications, and how much memory they would need.  Perhaps the answer is different for different hosts?  The conservative approach would be to default to 1/2 of physical memory (still better than the current default) and then allow the user to increase this when they are sure that Ceph OSDs are not sharing the hardware with other services or apps.

Comment 5 Ben England 2018-11-30 18:33:28 UTC
Unfortunately I think this documentation change is somewhat out of date.  ceph-ansible has been changed to try to set the OSD cache size automatically, but it's very conservative and the user can override with osd_memory_limit var, I think.   Also ceph-ansible tries to use as much SSD space as possible for RocksDB, so the user should not have to set it at all in a normal install, unless they are doing something non-standard.

Comment 7 Josh Durgin 2018-12-14 23:23:18 UTC
(In reply to Ben England from comment #5)
> Unfortunately I think this documentation change is somewhat out of date. 
> ceph-ansible has been changed to try to set the OSD cache size
> automatically, but it's very conservative and the user can override with
> osd_memory_limit var, I think.   Also ceph-ansible tries to use as much SSD
> space as possible for RocksDB, so the user should not have to set it at all
> in a normal install, unless they are doing something non-standard.

Yes, in 3.2 the cache settings shouldn't be changed by the user - the new setting 'osd_memory_target' instead controls the bluestore cache  size, tuning it dynamically to try to keep the OSD within the desired total memory usage. osd_memory_target is set by ceph-ansible automatically, so the user does not need to be aware of it by default.

To avoid confusion, we could mention that the OSD using BlueStore will use more memory than FileStore, since BlueStore is doing its own caching rather than using the page cache, so that memory will be attributed to the OSD process instead of the kernel.

Comment 8 Ben England 2019-01-27 16:11:21 UTC
The osd_memory_target variable is discussed in this section of RHCS 3.2 documentation.  

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/administration_guide/index#adding-osd-that-use-bluestore

I verified that osd_scenario: lvm did the right thing with RocksDB partition size.  so I think we can close this now.