Description of problem: RHCS 3.1 supports specifying expected_num_objects and procedural documentation should be added to "Ceph Object Gateway for Production" guide Version-Release number of selected component (if applicable): "Ceph Object Gateway for Production" guide Additional info: In order to help users avoid Filestore splitting operations which can dramatically slow client I/O performance. While this behaviour can affect all Ceph users, it is especially likely to impact RGW customers since they likely have pools with many objects. Guide users through the procedure for determining the correct value for expected_num_objects and illustrate with several customer use cases.
https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_object_gateway_for_production/#considering-expected-object-count-rgw-adv https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_object_gateway_for_production/#creating-a-bucket-index-pool-rgw-adv https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_object_gateway_for_production/#creating-a-data-pool-rgw-adv
@John, will you be verifying this bug?
Doug, how are we going to resolve this if pgcalc does not add support? Can we recommend values for expected_num_objects, perhaps based on cluster size (small, medium, large)?
(In reply to John Wilkins from comment #3) > https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html- > single/ceph_object_gateway_for_production/#considering-expected-object-count- > rgw-adv > There is a gotcha here, expected_num_objects only works correctly as of RHCS 3.1 So the notes on "RHCS3.1 and earlier" need to be rephrased. STATES - 5.1. Considering Expected Object Count (RHCS 3.1 and Earlier) SHOULD BE - 5.1. Considering Expected Object Count (RHCS 3.1 and later, when using Filestore > https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html- > single/ceph_object_gateway_for_production/#creating-a-bucket-index-pool-rgw- > adv section 5.6.1 STATES - For RHCS 3.1 and earlier releases, SHOULD BE - For RHCS 3.1 and later releases, when using Filestore > https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html- > single/ceph_object_gateway_for_production/#creating-a-data-pool-rgw-adv section 5.6.2 STATES - For RHCS 3.1 and earlier releases, SHOULD BE - For RHCS 3.1 and later releases, when using Filestore
(In reply to Harish NV Rao from comment #4) > @John, will you be verifying this bug? Yes, I can do that.
We are still missing an RGW pools creation procedure here, to guide an RGW admin. This is not trivial During Scale Lab testing I used this script https://github.com/jharriga/GCrate/blob/master/resetRGW.sh Perhaps it can be used to document a procedure
(In reply to John Harrigan from comment #8) > We are still missing an RGW pools creation procedure here, to guide an RGW > admin. > This is not trivial > During Scale Lab testing I used this script > https://github.com/jharriga/GCrate/blob/master/resetRGW.sh > > Perhaps it can be used to document a procedure I've fixed the headings. Is there something else you want to do on a pool procedure? The PG calculator isn't something our team controls. https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_object_gateway_for_production/#considering-expected-object-count-rgw-adv https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_object_gateway_for_production/#creating-an-index-pool-rgw-adv https://access.qa.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_object_gateway_for_production/#creating-a-data-pool-rgw-adv
(In reply to John Harrigan from comment #5) > Doug, how are we going to resolve this if pgcalc does not add support? > Can we recommend values for expected_num_objects, perhaps based on cluster > size (small, medium, large)? That's a good idea. I'll ask Neha to suggest some values.
We have discussed different values for expected_num_objects, like 1M, 5M, 50M, 500M, but I am not sure if there is a way to determine values based on small, medium and large clusters. Perhaps, we could use the guidelines that Josh mentioned here: https://bugzilla.redhat.com/show_bug.cgi?id=1592497#c29 "E.g. the default rgw data pool could be created with expected_num_objects = num_osds * 10M"
(In reply to nojha from comment #11) > We have discussed different values for expected_num_objects, like 1M, 5M, > 50M, 500M, but I am not sure if there is a way to determine values based on > small, medium and large clusters. > > Perhaps, we could use the guidelines that Josh mentioned here: > https://bugzilla.redhat.com/show_bug.cgi?id=1592497#c29 > > "E.g. the default rgw data pool could be created with expected_num_objects = > num_osds * 10M" In the Scale Lab we had 12x OSD nodes, each with 26 OSD devices (24 HDDs and 2 bucket index OSDs), fo a total of 312 OSDs. I ran testing with 500M and for our workload it mitigated filestore splitting. We did run into OSD suicide timeouts when expected_num_objects was set very high (ie. one trillion). You need to leave time for the pool creation when using expected_num_objects. My attempts to determine a pattern for predicting that time lag in a repeatable manner was not successful.
(In reply to John Harrigan from comment #12) > (In reply to nojha from comment #11) > > We have discussed different values for expected_num_objects, like 1M, 5M, > > 50M, 500M, but I am not sure if there is a way to determine values based on > > small, medium and large clusters. > > > > Perhaps, we could use the guidelines that Josh mentioned here: > > https://bugzilla.redhat.com/show_bug.cgi?id=1592497#c29 > > > > "E.g. the default rgw data pool could be created with expected_num_objects = > > num_osds * 10M" > > In the Scale Lab we had 12x OSD nodes, each with 26 OSD devices (24 HDDs and > 2 bucket index OSDs), fo a total of 312 OSDs. I ran testing with 500M and > for our > workload it mitigated filestore splitting. > > We did run into OSD suicide timeouts when expected_num_objects was set very > high (ie. one trillion). Would it make sense to suggest an upper bound for expected_num_objects, based on your experience, John, along with the suggested formula? > You need to leave time for the pool creation when using > expected_num_objects. My > attempts to determine a pattern for predicting that time lag in a repeatable > manner was not successful.
Looks pretty close but we should emphasize that setting expected_num_objects is only needed when creating pools which will have high object counts, such as the rgw.data pool The expected number of objects for this pool. By setting this value (together with a negative filestore merge threshold), the PG folder splitting would happen at the pool creation time, to avoid the latency impact to do a runtime folder splitting. While this behaviour can affect all Ceph users, it is especially likely to impact RGW customers since they likely have pools with many objects (i.e. “default.rgw.buckets.data"). And yes, the above referenced RED HAT CEPH STORAGE HARDWARE SELECTION GUIDE is a public link
Looks good. Thanks.