Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1583839

Summary:	bluestore osd_memory_target not holding
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Ben England <bengland>
Component:	Distribution	Assignee:	Andrew Schoen <aschoen>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Tejas <tchandra>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.2	CC:	acalhoun, adeza, aschoen, assingh, ceph-eng-bugs, flucifre, gfidente, gsitlani, jdurgin, johfulto, kbader, kdreyer, mnelson, nthomas, pasik, wredtex
Target Milestone:	z2	Keywords:	FutureFeature, Tracking
Target Release:	3.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-01 16:38:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1578730

Description Ben England 2018-05-29 21:13:27 UTC

Description of problem:

This bug is sort of a tracker bug that may be linked to other more specific problems in Ceph (such as ceph-volume or ceph-ansible or Ceph trackers). it also requires performance work.

RHCS 3.2 will be the release that supports Bluestore for all customers, if I understand PM's intent. It should be *easy* to install in the sense that a minimal and comprehensible set of tuning knobs should be exposed to the sysadmin. It must have reasonable defaults. If we fail, it can be almost impossible to correct some of the problems, such as the RocksDB partition size, at a later date in a production environment (i.e. by rebuilding OSDs). Other problems, such as poor defaults, will cause the customer to conclude that Bluestore performance is inferior to, or no better than, Filestore, although clearly Bluestore represents a major performance improvement.

Here are several areas where Bluestore may require tuning today:

- RocksDB partition size
- reducing write amplification through RocksDB tuning (pg log for example)
- reducing tail latency (increasing fairness) to multiple clients
- memory allocation among competing Bluestore OSD processes

We need this to be much easier for the average system administrator. Specifically:

- The Ceph installer s/w could accept an user input for average RADOS object size on an OSD or subset of OSDs. It could then calculate how big to make the RocksDB partition based on knowledge of device sizes, assuming each RADOS object requires roughly the same amount of metadata per object (I'm trying to measure that now). The default might be based on assumption of 4-MB object size (i.e. RBD, Cephfs with large files, or RGW with large objects).

- The Ceph installer s/w could calculate how much RAM to assign to each OSD.
The user input here could be the fraction of physical RAM to assign to OSD caching. The installer would then divide this by the number of OSDs, assuming all OSDs are the same size (we can also weight OSD RAM allocation by size of device. A default might be 3/4 physical memory on a large server (hyperconvergence of apps and OSDs could require lowering this). This default allows for co-resident monitors, MDS, etc. By specifying a fraction, we help to avoid situation where the user can oversubscribe memory.

- if necessary a web-app calculator or spreadsheet similar to the PG calculator can be used to help people derive the correct values for complicated configurations.

- Testing and analysis is needed to show how much of Ryan Meredith's (and Mark Nelson's) Micron tunings are necessary and, if they are, can be made default behavior. For example, should these tunings be Luminous defaults, and if not, why not?

How reproducible:

John Harrigan's work in the Red Hat scale lab shows effect of RocksDB spillover to HDD on increasing RGW latency over time. This in part was coupled with slowness of RGW garbage collection, but still could happen whenever you fill up the OSD.

https://docs.google.com/presentation/d/1wwtYf9ymHwd8B1Utjn2JsqXRCTlAdeTm8VDoxdIWYxQ/edit#slide=id.g35e2252ad7_0_94

(available to non-Red-Hat upon request)

Micron had initial difficulties with tuning Bluestore to outperform Filestore, and Micron is an experienced Ceph user. In the end they got great results, particularly for lower tail latency.

https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21803/unleashing-the-power-of-flash-for-ceph-data-stores-an-all-nvme-ceph-performance-deep-dive

https://www.micron.com/about/blogs/2018/may/ceph-bluestore-vs-filestoreblock-performance-comparison-when-leveraging-micron-nvme-ssds

Steps to Reproduce:

1. Install Ceph with Bluestore using the documented RHCS installer, without tuning.
2. Compare performance to filestore using RBD or RGW, for an aged, fully-populated workload.

Actual results:

Bluestore can be slower than filestore when default caching per OSD (1 GB) and default RocksDB partition size of 1 GB are used.

Expected results:

Bluestore should never be worse, and should be significantly better than Filestore, as shown by Micron (Ryan Meredith) article. Mark Nelson is the authority on Bluestore tuning, and has published a great deal of performance data to the Ceph community.

Alex Calhoun measured big speedups for Bluestore on HDD when write size was multiple of bluestore_min_alloc_size (can avoid double-write penalty), and Ben England measured 4-5x speedup on RADOS object create speed for small 4-KB objects. See:

https://mojo.redhat.com/groups/product-performance-scale-community-of-practice/content?query=Bluestore

Additional info:

https://ceph.com/community/new-luminous-bluestore/
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

There has been some difficulty in measuring metadata per-object cost:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026333.html

Li Xiaoyan from Intel discusses Bluestore RocksDB I/O in:
https://www.youtube.com/watch?v=jKdNFaZHrf0

In her talk, she mentions that about half of the writes to Bluestore seem to be "pg log" writes, some of which may be transitory states and therefore maybe shouldn't be written to RocksDB persistent storage.

Comment 3 Ben England 2018-06-25 14:07:21 UTC

Alex Calhoun has reproduced user problems with getting bluestore to outperform Filestore on all-flash config. -- I cc'ed him.  He is currently working on reproducing the Micron results using smallest possible subset of Mark's tuning.

Comment 4 acalhoun 2019-01-29 20:52:29 UTC

NVMe SSD results: https://mojo.redhat.com/docs/DOC-1189708

HDD results: TBD

observations:

-Bluestore auto-tuning

--NVMe SSD resident memory usage ranged between 15% to 45% of totally memory usage, or 4.5GB to 14.5GB per OSD (8 osds per host), When manually setting the memory to 50% of total available memory or 16GB per OSD, performance had a ~1.7% gain. Based on these results it seemed unnecessary to modify the auto-tuning parameters in order to achieve better results.

--HDDs resident memory usage ranged between 56% to 65% of totally memory usage or 7GB to 6GB per OSD (24 osds per host).
With both configurations memory appeared to fluctuate in response to the load although in the case of HDDs (24 OSD per host) variability seemed to be low, fluctuating between 1-2 GB where as with SSDs, with few OSDs per host(8), fluctuation was between 7-14 GB

--on filestore it appears as if resident memory usage ranged from ~500MB to ~2GB per OSD on HDDs and ~200MB to ~900MB on SSD.

--We believe that the osd_memory_target was set to 4GB, if osd_memory_target is to be treated as a ceiling, it is not functioning accordingly based on these results. An investigation targeted specifically on the osd_memory_target and actual osd memory usage maybe need as a follow-on experiment

-RAM/OSD?
--SSD ranged from ~4.5GB to ~14.5GB
--HDD rhanged from ~6GB to ~7GB

-RocksDB partition size?
--testing utilized a osd scenario of LVM, sizing was automatic and seemed to work with Rocksdb sizing.
--rocksdb was partitioned on both NVMe SSD’s equally. Had 24 HDDs and 12 “ceph-block-dbs” were partitioned on each SSD.

-Micron tunings
--NVMe SSD testing shows that Micron tunings, and subset configurations, only provided a performances gain of ~3%. With the amount of parameters added and the minimal performance impact, its suggested that these specific tunables are not necessary.

Comment 5 Ben England 2019-03-26 20:46:29 UTC

I believe RocksDB spillover issue is resolved in case where user specifies the simplest scenario (just osd_scenario: lvm + a list of devices).  

osd_memory_target must be a firm limit under all conditions.  Without this, we can't set reasonable limits for containerized OSD memory size in a HCI cluster, and we cannot easily co-locate OSDs with other Ceph daemons or applications.   So I'm changing the bz name to be more specific.  Once this issue is resolved, we can close the bz.

Comment 7 Giridhar Ramaraju 2019-08-05 13:12:19 UTC

Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 8 Giridhar Ramaraju 2019-08-05 13:13:06 UTC

Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 10 Ben England 2019-10-01 13:55:17 UTC

close this bug please, it is stale, have not heard of reported problems with osd_memory_target of late in Nautilus or even later versions of Luminous, and there is another bz 1637153 that covers the problem of extreme random write workloads with Luminous.  Am not concerned about Filestore at this point.