1592497 – cluster fill RGW workload performance degrades by 2/3rds during run

Bug 1592497 - cluster fill RGW workload performance degrades by 2/3rds during run

Summary: cluster fill RGW workload performance degrades by 2/3rds during run

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.1
Assignee:	Douglas Fuller
QA Contact:	Tiffany Nguyen
Docs Contact:	Aron Gunn
URL:
Whiteboard:
Depends On:	1602913
Blocks:	1581350 1584264 1595896
TreeView+	depends on / blocked

Reported:	2018-06-18 17:03 UTC by John Harrigan
Modified:	2018-09-26 18:23 UTC (History)
CC List:	17 users (show)
Fixed In Version:	RHEL: ceph-12.2.5-38.el7cp Ubuntu: 12.2.5-23redhat1xenial
Doc Type:	Known Issue
Doc Text:	.High object counts can degrade IO performance The overhead with directory merging on FileStore can degrade the client's IO performance for pools with high object counts. To work around this issue, use the ‘expected_num_objects’ option during pool creation. Creating pools is described in the {product} link:{object-gw-production}#assembly-creating-data-placement-strategies-rgw-adv[Object Gateway for Production Guide].
Clone Of:
Environment:
Last Closed:	2018-09-26 18:22:08 UTC
Embargoed:
Dependent Products:
Flags:	vakulkar: automate_bug?

Attachments	(Terms of Use)
perf stats from start of cluster fill workload job (258.30 KB, image/png) 2018-06-18 17:03 UTC, John Harrigan	no flags	Details
perf stats from end of cluster fill workload job (249.44 KB, image/png) 2018-06-18 17:06 UTC, John Harrigan	no flags	Details
charts (166.69 KB, image/png) 2018-06-18 19:24 UTC, John Harrigan	no flags	Details
latency plot over cluster fill runtime (79.22 KB, application/pdf) 2018-06-18 19:34 UTC, John Harrigan	no flags	Details
throughput plot over cluster fill runtime (77.99 KB, application/pdf) 2018-06-18 19:36 UTC, John Harrigan	no flags	Details
OSD log from c04-h17-6048r (osd.83) during the run (224.38 KB, application/x-gzip) 2018-06-18 19:54 UTC, Douglas Fuller	no flags	Details
filestore 50% fill using RHCS 3.0 (but w/different object sizes) (162.48 KB, image/png) 2018-06-19 19:56 UTC, John Harrigan	no flags	Details
bluestore 50% fill using RHCS 3.0 (but w/different object sizes) (149.05 KB, image/png) 2018-06-19 19:58 UTC, John Harrigan	no flags	Details
cluster fill with expectNumObj=50M at pool creation (174.37 KB, image/png) 2018-06-19 21:28 UTC, John Harrigan	no flags	Details
cluster fill with expectNumObj=500M at pool creation (w38) (169.76 KB, image/png) 2018-06-25 16:08 UTC, John Harrigan	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 23072	0	None	closed	mon/OSDMonitor: Warnings for expected_num_objects	2020-05-26 19:02:43 UTC
Red Hat Product Errata	RHBA-2018:2819	0	None	None	None	2018-09-26 18:23:34 UTC

Description John Harrigan 2018-06-18 17:03:56 UTC

Created attachment 1452684 [details]
perf stats from start of cluster fill workload job

Description of problem:
Using build RHCEPH-3.1-RHEL-7-20180530.ci.0
filestore journals on NVMe devices
bucket index pools on NVMe devices
12x OSD nodes, 24 HDDs and 2 NVMe's
24 Client nodes
COSbench workload generator


While performing a 25% cluster fill workload the Client I/O performance as 
recorded by COSbench in throughput (iops/sec) and bandwidth (MB/s) reduces
dramatically over the life of the run (see attached files: w21start.png and
w21end.png).

COSbench workload:
# cat fillWorkload.xml
<?xml version="1.0" encoding="UTF-8" ?>
<workload name="fillCluster" description="RGW GCrate testing">

<!-- Initialization -->
  <storage type="swift" config="timeout=900000" retry="3"/>
  <auth type="swauth" config="username=johndoe:swift;password=KOVjOkEJbapU2RDyihys96FtOyAEGUtwCmRDNwUe;auth_url=http://localhost:5000/auth/1.0" />

  <workflow>

<!-- Initialization -->
    <workstage name="init_containers">
        <work type="init" workers="1" config="containers=r(1,5)"/>
    </workstage>

<!-- Prepare using container counts, object counts and object sizes from vars.shinc -->
    <workstage name="preparing_cluster">
        <work type="prepare" workers="72" config="containers=r(1,5);objects=r(1,400000);sizes=h(1|1|50,64|64|15,8192|8192|15,65536|65536|15,1048576|1048576|5)KB"/>
    </workstage>

  </workflow>

</workload>


Expected results:
Performance should remain (mostly) consistent for life of the run.

Comment 3 John Harrigan 2018-06-18 17:06:06 UTC

Created attachment 1452686 [details]
perf stats from end of cluster fill workload job

Comment 6 John Harrigan 2018-06-18 19:24:45 UTC

Created attachment 1452717 [details]
charts

Comment 7 John Harrigan 2018-06-18 19:25:50 UTC

I have added a new attachment which shows the perf chart from COSbench
The slowdown hits hard around 1600 samples.
Using a 5sec interval, that coorelates to 133 minutes into the run.

Comment 8 John Harrigan 2018-06-18 19:34:05 UTC

Created attachment 1452719 [details]
latency plot over cluster fill runtime

Comment 9 John Harrigan 2018-06-18 19:36:14 UTC

Created attachment 1452720 [details]
throughput plot over cluster fill runtime

Comment 12 John Harrigan 2018-06-18 19:51:59 UTC

Using build RHCEPH-3.1-RHEL-7-20180530.ci.0

Also worth noting that once the performance drops off, it does not seem
to come back

Throughput drops from ~200 ops/sec to around 50 ops/sec.
At the job start 'ceph -s' shows
    client:   4148 kB/s rd, 8910 MB/s wr, 5002 op/s rd, 4345 op/s wr
After 133 minutes of runtime it's
    client:   1046 kB/s rd, 2988 MB/s wr, 1273 op/s rd, 1636 op/s wr

I ran a 50% cluster fill test as well. The perf results indicate the same
behaviour. After 133 minutes of runtime the client I/O rates drop hard.
The total time to do the cluster fills:
   * 25% fill    300min
   * 50% fill   1200min

So four times the runtime for 2x the capacity.

Comment 13 Douglas Fuller 2018-06-18 19:54:02 UTC

Created attachment 1452722 [details]
OSD log from c04-h17-6048r (osd.83) during the run

random OSD log from run

Comment 14 Douglas Fuller 2018-06-18 19:56:11 UTC

An RGW log from the run (too large to attach):

http://file.rdu.redhat.com/~dfuller/ceph-rgw-c04-h17-6048r.log-20180618.gz

Comment 16 John Harrigan 2018-06-19 12:44:38 UTC

a lengthy but informative email thread has raised the likely possibility that this slowdown is due to a know issue with filestore split/merge activity.
Previously filed as BZ1219974 ; BZ1332874 ; BZ1367448

Based on previous testing results which proved to mitigate the effect
I will be re-testing using increased filestore_split / filestore_merge values, 16 & 48 respectively.

For a further explanation see pgs 34-36 of this document
https://www.redhat.com/cms/managed-files/st-ceph-storage-qct-object-storage-reference-architecture-f7901-201706-v2-en.pdf

Comment 17 Douglas Fuller 2018-06-19 13:32:02 UTC

Reassigned to RADOS given the current discussion.

Comment 18 John Harrigan 2018-06-19 19:56:48 UTC

Created attachment 1453018 [details]
filestore 50% fill using RHCS 3.0  (but w/different object sizes)

Comment 19 John Harrigan 2018-06-19 19:58:38 UTC

Created attachment 1453019 [details]
bluestore 50% fill using RHCS 3.0 (but w/different object sizes)

Comment 20 John Harrigan 2018-06-19 20:00:28 UTC

> What about 'is this a regression from 3.0'?

The RGW failure testing I did in the Scale Lab earlier this year used RHCS 3.0
However, most of the testing was on bluestore. Also we were using different
object size distribution, more containers (100 vs 5) - so be careful using this as a direct comparative. 
We did see a significant cluster fill rate improvement for
bluestore over filestore.
Those results can be seen here  SEE SLIDE 3
https://docs.google.com/presentation/d/1sapLXG0652iHL8QCjnrn83gvzh6gfJqB2zt0QFo40R8/edit?usp=sharing


I looked through the COSbench results for those cluster fill runs. I do see a performance
dropoff during the filestore runs which I don't see with bluestore. It is not as dramatic
as what I am currently seeing, but it is there.

Comparing a 50% cluster fill for bluestore vs filestore. Both jobs completed 4.3 million
operations and wrote 91TB:
* filestore (jobID w168)
* bluestore (jobID w186)

I have attached two screenshots which show performance rates for the above two
COSbench runs. Notice that the bluestore run delivers consistent throughput,  response
time and bandwidth. The filestore screenshot shows a reduction as the run continues,
starting at around sample number 2200.

Comment 21 John Harrigan 2018-06-19 21:26:57 UTC

Performed a run using these commands (expected_num_objects=50M)

# ceph osd crush rule create-erasure default.rgw.buckets.data myprofile
# ceph osd pool create default.rgw.buckets.data 4096 erasure myprofile default.rgw.buckets.data 50000000

job started at 17:16:15
completed at 20:58:28

The first split event on osd.256 was at 19:17:44 UTC - two hours into the run

I have attached the screenshot with COSbench charts (w34ExpectedNumObjects50M)

I am going to kickoff another run increasing expected_num_objects to 1 trillion
and will update the BZ with those results later.

Comment 22 John Harrigan 2018-06-19 21:28:44 UTC

Created attachment 1453029 [details]
cluster fill with expectNumObj=50M at pool creation

Comment 23 John Harrigan 2018-06-19 21:42:57 UTC

It is worth noting that the performance dropoff for run w34 is much more subtle
than what i was observing on earlier runs.
You can definitely see it (reduced throughput, increased resTime) in the screen
shot but it is now where near the slowdown that was seen when running without
precreation.

Again I will add the results from the 1 trillion precreation run later.

Comment 24 John Harrigan 2018-06-19 23:04:25 UTC

the 1 trillion precreate failed, leaving the cluster in HEALTH_WARN with
140 OSDs down. I will be at an offsite tomorrow so further testing will have
to wait until Thursday.
Seems we need to find a more reasonable value.

Comment 25 John Harrigan 2018-06-20 08:52:02 UTC

At the completion of my 25% cluster fill test 'ceph df' reports 26M objects in
the data pool. In my most recent run (w34) I precreated 50M objects and directory
splitting still occurred, why?

And the BIGGIE:
Given an RGW workload (# of RGW objects, size of RGW objects, ...) what is the
formula for determining how many need to be precreated?

Also, when using precreate how should these values be set?
  * filestore_merge_threshold (default 10) - currently using -1
  * filestore_split_multiple (default 2) - using default

are they still impactful?

thanks,
John

Comment 26 Josh Durgin 2018-06-20 17:28:24 UTC

(In reply to John Harrigan from comment #25)
> At the completion of my 25% cluster fill test 'ceph df' reports 26M objects
> in
> the data pool. In my most recent run (w34) I precreated 50M objects and
> directory
> splitting still occurred, why?

A couple possibilities come to mind:

- other pools are getting split (perhaps some rgw metadata pool ends up with more objects than I thought)
- distribution of objects is skewed
- there's a bug in pre-splitting

Do you have the output of 'tree' on an osd after pre-splitting, or 'ceph df' and 'ceph osd dump' to see the number of objects per pg in other pools?

> And the BIGGIE:
> Given an RGW workload (# of RGW objects, size of RGW objects, ...) what is
> the
> formula for determining how many need to be precreated?
> 
> Also, when using precreate how should these values be set?
>   * filestore_merge_threshold (default 10) - currently using -1
>   * filestore_split_multiple (default 2) - using default
> 
> are they still impactful?

These still matter - as the docs mention, the threshold at which directories are split is abs(merge_threshold) * split_multiple * 16, so using -1 instead of 10 as the merge threshould results in a deeper directory hierarchy.

The most urgent thing to determine is not the optimal settings though, but whether this is a regression from 3.0 - how soon can we run the same workload with the same settings on 3.0 to compare?

Comment 27 John Harrigan 2018-06-25 16:08:30 UTC

Created attachment 1454404 [details]
cluster fill with expectNumObj=500M at pool creation (w38)

Comment 28 John Harrigan 2018-06-25 16:19:12 UTC

Further testing with RHCEPH-3.1-RHEL-7-20180530.ci.0 proves that 
using a large enough expected_num_objects at pool creation avoids the
slowdown. See attachment "cluster fill with expectedNumObjects=500M"

Comparing first attachment "charts" and this recent attachment you will
see how the runtime for cluster fill is dramatically reduced. Also the 
responseTime does not increase as the run progresses. 

There are questions that remain:
* provide field guidance on value to set for expected_num_objects. Given a
  certain number and size of RGW objects, containers and what else - what is
  the formula to calculate?
* what are the tradeoffs for setting expected_num_objects to high value.
  what value is too high?
* could this be done automatically for RGW data pools?

I see no mention of this topic in the "Ceph Object Gateway for Production guide"
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_object_gateway_for_production/

Comment 29 Josh Durgin 2018-06-25 23:01:40 UTC

(In reply to John Harrigan from comment #28)
> Further testing with RHCEPH-3.1-RHEL-7-20180530.ci.0 proves that 
> using a large enough expected_num_objects at pool creation avoids the
> slowdown. See attachment "cluster fill with expectedNumObjects=500M"
> 
> Comparing first attachment "charts" and this recent attachment you will
> see how the runtime for cluster fill is dramatically reduced. Also the 
> responseTime does not increase as the run progresses. 
> 
> There are questions that remain:
> * provide field guidance on value to set for expected_num_objects. Given a
>   certain number and size of RGW objects, containers and what else - what is
>   the formula to calculate?

Buckets don't matter, only the number of objects and their size are factors.

The number of rados objects is roughly:

ceil(avg rgw object size / rgw_max_chunk_size)

> * what are the tradeoffs for setting expected_num_objects to high value.
>   what value is too high?

You get a deeper directory hierarchy, and thus less efficient dentry cache use, which can increase latency.

> * could this be done automatically for RGW data pools?

We could use a conservative estimate as a default to avoid at least a few levels of splitting when setting up rgw pools with ceph-ansible.

E.g. the default rgw data pool could be created with expected_num_objects = num_osds * 10M

> I see no mention of this topic in the "Ceph Object Gateway for Production
> guide"
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-
> single/ceph_object_gateway_for_production/

These guidelines should certainly be added to the docs.

Comment 30 Douglas Fuller 2018-07-16 14:26:58 UTC

Now we print a warning instead of trying to guess a good value automatically. Guessing the initial value too high can cause suicide timeouts on the OSDs.

Comment 32 Vasu Kulkarni 2018-07-27 15:47:31 UTC

Looks like matt moved this by error to 3.2, It will be marked as verified in 3.1

Comment 33 Ken Dreyer (Red Hat) 2018-07-31 18:09:39 UTC

The following commits are in the upstream PR and missing from the downstream ceph-3.1-rhel-patches branch:

"mon/OSDMonitor: Warn when expected_num_objects will have no effect"
"mon/OSDMonitor: Warn if missing expected_num_objects"

Comment 34 Douglas Fuller 2018-07-31 20:15:31 UTC

These patches have been applied downstream.

Comment 39 John Harrigan 2018-08-14 11:45:23 UTC

I do not know why this assigned to me as QA contact.
Resetting.

Comment 40 John Harrigan 2018-08-17 16:26:26 UTC

to verify you can create a pool with expectedNumObjs and look at the directory
structure on one of the OSD nodes, like this

# cd /var/lib/ceph/osd; for D in */; do
   cd $D
   echo -n "${PWD##*/} "
   find . -type d -print | wc -l
   cd ..
 done

This will print the number of directories found under each of the OSDs. 
On a freshly deployed cluster these counts will be low. After you create
a pool with expectedNumObjs, these counts should go up.

In addition, if you run a write intense workload (like fillCluster.sh from
https://github.com/jharriga/GCrate/tree/master) it will perform consistently
throughout its runtime when the default.rgw.buckets.data pool is created with
expectedNumObjs. The GCrate scripts contain two versions for pool creation:
  * resetRGW.sh       does not use expectedNumObj
  * resetRGW.shPRECREATE     does use expectedNumObj

any questions, let me know.
- John

Comment 41 John Harrigan 2018-08-17 19:35:05 UTC

(In reply to John Harrigan from comment #40)
> to verify you can create a pool with expectedNumObjs and look at the
> directory
> structure on one of the OSD nodes, like this
> 
> # cd /var/lib/ceph/osd; for D in */; do
>    cd $D
>    echo -n "${PWD##*/} "
>    find . -type d -print | wc -l
>    cd ..
>  done
> 
> This will print the number of directories found under each of the OSDs. 
> On a freshly deployed cluster these counts will be low. After you create
> a pool with expectedNumObjs, these counts should go up.
> 
> In addition, if you run a write intense workload (like fillCluster.sh from
> https://github.com/jharriga/GCrate/tree/master) it will perform consistently
> throughout its runtime when the default.rgw.buckets.data pool is created with
> expectedNumObjs. The GCrate scripts contain two versions for pool creation:
>   * resetRGW.sh       does not use expectedNumObj
>   * resetRGW.shPRECREATE     does use expectedNumObj
> 
> any questions, let me know.
> - John

the other aspect of verifying this BZ has to do with the warning msgs cited
in comment #30 above
  External Bug ID: Github https://github.com/ceph/ceph/pull/23072

Here is a suggested procedure:
Try to create a filestore pool with >=100 PGs per OSD and verify a warning message is printed.

# ceph osd pool create testpool $(( `ceph osd ls |wc -l` * 100))
A pool should be created and a warning message should be printed.

Comment 42 Tiffany Nguyen 2018-08-24 05:59:10 UTC

Verified with build 12.2.5-39.el7cp with 2 scenarios below:
1. Create a pool with expectedNumObjs and check for directory structure:
Number of directory structure count on OSD node is increase dramatically compare to not using expectedNumObj:
   [root@gprfs041 ~]#cd /var/lib/ceph/osd; for D in */; do
     cd $D
     echo -n "${PWD##*/} "
     find . -type d -print | wc -l
     cd ..
    done
    > ceph-13 20245
    > ceph-16 20937
    > ceph-19 17551
    > ceph-2 16559
    > ceph-20 16835
    > ceph-24 13679
    > ceph-25 24103
    > ceph-26 20653
    > ceph-28 12759
    > ceph-33 21079
    > ceph-5 18875
    > ceph-8 19011

2. * Run fillCluster 30% with use of expectedNumObject (resetRGW.shPRECREATE): 
   Performance is consistent during the run. 
   Using numOBJ=40k and expectednumobj=50M, workers=10
   Bandwidth=1.03GB/S  	Avg-ResTime=513ms  	99%-RT=9710ms

 * Run  fillCluster 30% without use of expectedNumObjects (resetRGW.sh):
   numOBJ=40k and expectednumobj=50M, workers=10
   performance is consistent during the run:
   Bandwidth=758.4MB/S  	Avg-ResTime=703ms  	99%-RT=15460ms

3. Also, testing the message print out when creating a pool: message is displayed and pool is created which is correct behavior.

[root@c04-h01-6048r yum.repos.d]# ceph osd pool create testpool 1024

For better initial performance on pools expected to store a large number of objects, consider supplying the expected_num_objects parameter when creating the pool.

[root@c04-h01-6048r yum.repos.d]# ceph osd lspools 

0 rbd,1 .rgw.root,2 default.rgw.control,3 default.rgw.data.root,4 default.rgw.gc,5 default.rgw.log,6 testpool,

Comment 46 errata-xmlrpc 2018-09-26 18:22:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819

Note You need to log in before you can comment on or make changes to this bug.